Installation

Prerequisites

This package uses python 3.10.

Set-up

Install the autofeat package using pip:

pip install autofeat

Then you can use the package in your code by importing the FeatureDiscovery class in it:

from autofeat import FeatureDiscovery

Usage

To use the framework create an Autofeat Class

autofeat = FeatureDiscovery()

Then you have to set up the base table and the target column:

autofeat.set_base_table(base_table="school/base.csv", target_column="class")

Furthermore, you have to select what repositories you want to use in the feature discovery process. The repositories should be located in ./data/repository_name

autofeat.set_dataset_repository(dataset_repository=["school"])

Alternatively, you can select all repsitories in the ./data directory by using the following command:

autofeat.set_dataset_repository(all_tables=True)

Running

Finally you can run the entire feature discovery process:

autofeat.augment_dataset()

The function has multiple parameters to tune your feature discovery process.

Parameter

Description

Type

Default

algorithm

The algorithm to use for the evaluation of each found tree. Possible options are:
- RF (Random Forrest)
- GBM (Gradient Boosting Machine)
- XT (Extra Trees)
- XGB (XGBoost)
- KNN (K-Nearest Neighbors)
- LR (Logistic Regression)

String

GBM

relation_threshold

The threshold to select relations between columns.

float

0.5

non_null_threshold

The threshold of non-null values in the resulting table after a possible join.

float

0.5

matcher

The matcher to use for the join. Possible options are: | - COMA | - JACCARD

str

COMA

top_k_features

The number of top features to select from the feature discovery process.

int

10

top_k_paths

The number of top paths to select from the feature discovery process.

int

3

explain

If True, the function will print the explanation of the feature discovery process.

bool

False

verbose

If True, the function will print the progress of the feature discovery process.

bool

False

use_cache

If True, the function will use saved relationships to load the results of earlier relation discovery processes.

bool

True

save_cache

If True, the function will save the relationships found in the relation discovery process.

bool

True