Installation

Prerequisites

This package uses python 3.10.

Set-up

Install the autofeat package using pip:

pip install autofeat

Then you can use the package in your code by importing the FeatureDiscovery class in it:

from autofeat import FeatureDiscovery

Usage

To use the framework create an Autofeat Class

autofeat = FeatureDiscovery()

Then you have to set up the base table and the target column:

autofeat.set_base_table(base_table="school/base.csv", target_column="class")

Furthermore, you have to select what repositories you want to use in the feature discovery process. The repositories should be located in ./data/repository_name

autofeat.set_dataset_repository(dataset_repository=["school"])

Alternatively, you can select all repsitories in the ./data directory by using the following command:

autofeat.set_dataset_repository(all_tables=True)

Running

Finally you can run the entire feature discovery process:

autofeat.augment_dataset()

The function has multiple parameters to tune your feature discovery process.

Parameter	Description	Type	Default
algorithm	The algorithm to use for the evaluation of each found tree. Possible options are: - RF (Random Forrest) - GBM (Gradient Boosting Machine) - XT (Extra Trees) - XGB (XGBoost) - KNN (K-Nearest Neighbors) - LR (Logistic Regression)	String	GBM
relation_threshold	The threshold to select relations between columns.	float	0.5
non_null_threshold	The threshold of non-null values in the resulting table after a possible join.	float	0.5
matcher	The matcher to use for the join. Possible options are: \| - COMA \| - JACCARD	str	COMA
top_k_features	The number of top features to select from the feature discovery process.	int	10
top_k_paths	The number of top paths to select from the feature discovery process.	int	3
explain	If True, the function will print the explanation of the feature discovery process.	bool	False
verbose	If True, the function will print the progress of the feature discovery process.	bool	False
use_cache	If True, the function will use saved relationships to load the results of earlier relation discovery processes.	bool	True
save_cache	If True, the function will save the relationships found in the relation discovery process.	bool	True