API

class src.autofeatinsights.autofeat_class.FeatureDiscovery[source]

Bases: object

add_relationship(table1: str, col1: str, table2: str, col2: str, weight: float)[source]

Adds a relationship between two columns in different tables.

Parameters:

table1 (str) – The name of the first table.
col1 (str) – The name of the column in the first table.
table2 (str) – The name of the second table.
col2 (str) – The name of the column in the second table.
weight (float) – The weight of the relationship.

add_table(table: str)[source]

Adds an extra table to the list of tables used for feature generation.

Parameters:: table (str) – The name of the table to be added.

adjust_non_null_ratio(tree_id: int, table: str, value: float)[source]

Adjusts the non-null ratio for a specific tree and table.

Parameters:

tree_id (int) – The ID of the tree.
table (str) – The name of the table.
value (float) – The new non-null ratio value.

adjust_redundancy_value(tree_id: int, feature: str, value: float)[source]

Adjusts the redundancy value for a specific feature in a given tree.

Parameters:

tree_id (int) – The ID of the tree.
feature (str) – The name of the feature.
value (float) – The new redundancy value.

adjust_relevance_value(tree_id: int, feature: str, value: float)[source]

Adjusts the relevance value of a feature for a specific tree.

Parameters:

tree_id (int) – The ID of the tree.
feature (str) – The name of the feature.
value (float) – The new relevance value.

Returns:

None

augment_dataset(algorithm='GBM', relation_threshold: float = 0.5, non_null_threshold=0.5, matcher='coma', top_k_features: int = 10, top_k_paths: int = 3, explain=True, verbose=True, use_cache=True)[source]

Augments the dataset by finding relationships between features, computing join trees, and evaluating the trees.

Parameters:

algorithm (str) – The algorithm to use for tree evaluation. Default is “GBM”.
relation_threshold (float) – The threshold for considering a relationship between features. Default is 0.5.
non_null_threshold – The threshold for considering a feature as non-null. Default is 0.5.
matcher (str) – The matcher to use for finding relationships. Default is “coma”.
top_k_features (int) – The number of top features to select. Default is 10.
top_k_paths (int) – The number of top paths to select. Default is 3.
explain (bool) – Whether to explain the process. Default is True.
verbose (bool) – Whether to print verbose output. Default is True.
use_cache (bool) – Whether to use cached relationship weights. Default is True.

base_dataset: str

compute_join_trees(top_k_features: int = 10, non_null_threshold=0.5, explain=False, verbose=True)[source]

Compute join trees for feature selection.

Parameters:

top_k_features (int) – Number of top features to select. Defaults to 10.
non_null_threshold (float) – Threshold for non-null ratio. Defaults to 0.5.
explain (bool) – Whether to explain the join trees. Defaults to False.
verbose (bool) – Whether to print verbose output. Defaults to True.

display_best_relationships()[source]: Displays the best relationships found by FeatureDiscovery.

display_join_tree(tree_id)[source]

Display the join path with the given tree_id.

Parameters: - tree_id: The ID of the join path to display.

display_join_trees(top_k: int | None = None)[source]

Display the join trees for the AutoFeatClass instance.

Parameters:: top_k (int) – The number of join trees to display. If None, display all join trees.

display_table_relationship(table1: str, table2: str)[source]

Display the relationship between two tables.

Parameters:

table1 (str) – The name of the first table.
table2 (str) – The name of the second table.

evaluate_augmented_table(tree_id: int, algorithm='GBM', verbose=False)[source]

Evaluate the augmented table using the specified algorithm and tree ID.

Parameters: - tree_id (int): The ID of the tree to use for evaluation. - algorithm (str): The algorithm to use for evaluation. Default is ‘GBM’. - verbose (bool): Whether to print verbose output. Default is False.

evaluate_trees(algorithm='GBM', top_k_paths: int = 3, verbose=True, explain=False)[source]

Evaluate the performance of the generated trees.

Parameters: - algorithm (str): The algorithm to use for evaluation. Default is ‘GBM’. - top_k_paths (int): The number of top paths to consider. Default is 3. - verbose (bool): Whether to print verbose output. Default is True. - explain (bool): Whether to explain the evaluation results. Default is False.

exlude_tables: [(<class 'str'>, <class 'str'>)]

explain_relationship(table1: str, table2: str)[source]

Explains the relationship between two tables.

Parameters:

table1 (str) – The name of the first table.
table2 (str) – The name of the second table.

explain_result(tree_id: int, model: str = 'GBM')[source]

Explain the result of a specific tree in the AutoFeat pipeline.

Parameters:

tree_id (int) – The ID of the tree to explain.
model (str, optional) – The model to use for explanation. Defaults to ‘GBM’.

explain_tree(tree_id: int)[source]

Explain the tree identified by the given tree_id.

Parameters:: tree_id (int) – The ID of the tree to explain.

explore: bool

extra_tables: [(<class 'str'>, <class 'str'>)]

find_relationships(matcher='coma', relationship_threshold: float = 0.5, explain=False, use_cache=True, verbose=True)[source]

Finds relationships between features in the dataset.

Parameters:

matcher (str, optional) – The name of the matcher to use for finding relationships. Defaults to “coma”.
relationship_threshold (float, optional) – The threshold value for determining the strength of a relationship. Defaults to 0.5.
explain (bool, optional) – Whether to provide an explanation for the relationships found. Defaults to False.
use_cache (bool, optional) – Whether to use a cache for storing previously computed relationships. Defaults to True.
verbose (bool, optional) – Whether to print verbose output during the process. Defaults to True.

get_best_result()[source]: Returns the best result obtained by the evaluation module.

get_tables_repository()[source]

Retrieves the tables from the repository.

Returns:: A list of table paths.
Return type:: tables (list)

get_weights_from_and_to_table(from_table, to_table)[source]

Returns a list of weights that have the specified ‘from_table’ and ‘to_table’ values.

Parameters:

from_table (str) – The source table name.
to_table (str) – The destination table name.

Returns:

A list of weights that match the specified ‘from_table’ and ‘to_table’ values.

Return type:

list

get_weights_from_table(table: str)[source]

Returns a list of weights from the specified table.

Parameters:: table (str) – The name of the table.
Returns:: A list of weights from the specified table.
Return type:: list

inspect_join_tree(tree_id: int)[source]

Inspects the join tree with the given tree_id.

Parameters:: tree_id (int) – The ID of the join tree to inspect.

join_keys: dict = {}

materialise_join_tree(tree_id: int)[source]

Materializes the join tree with the given tree_id.

Parameters:: tree_id (int) – The ID of the join tree to materialize.
Returns:: The materialized join tree.

move_features_to_discarded(tree_id: int, features: [<class 'str'>])[source]

Moves the specified features to the discarded list for the given tree.

Parameters:

tree_id (int) – The ID of the tree.
features (list[str]) – The list of features to be moved to the discarded list.

move_features_to_selected(tree_id: int, features: [<class 'str'>])[source]

Moves the specified features from discarded to the selected features list for the given tree.

Parameters:

tree_id (int) – The ID of the tree.
features (list[str]) – The list of features to be moved.

non_null_ratio_threshold: float

partial_join: DataFrame

partial_join_selected_features: dict = {}

paths: [<class 'src.autofeatinsights.functions.classes.Tree'>]

read_relationships(file_path)[source]

Reads the relationships from a file and updates the object’s internal state.

Parameters:: file_path (str) – The path to the file containing the relationships.

remove_join_path_from_tree(tree_id: int, table: str)[source]

Removes a join path from the tree.

Parameters:

tree_id (int) – The ID of the tree.
table (str) – The name of the table to remove the join path from.

remove_relationship(table1: str, col1: str, table2: str, col2: str)[source]

Removes a relationship between two columns in different tables.

Parameters:

table1 (str) – The name of the first table.
col1 (str) – The name of the column in the first table.
table2 (str) – The name of the second table.
col2 (str) – The name of the column in the second table.

remove_table(table: str)[source]

Removes a table from the list of extra tables and adds it to the list of excluded tables.

Parameters:: table (str) – The name of the table to be removed.

results: [<class 'src.autofeatinsights.functions.classes.Result'>]

set_base_table(base_table: str, target_column: str)[source]

Sets the base table and target column for feature generation.

Parameters:

base_table (str) – The name of the base table.
target_column (str) – The name of the target column.

Returns:

None

set_dataset_repository(dataset_repository: List[str] = [], all_tables: bool = False)[source]

Sets the dataset repository for the AutofeatClass object.

Parameters: - dataset_repository (List[str]): A list of dataset paths. - all_tables (bool): Flag indicating whether to use all tables in the repository.

Raises: - Exception: If both dataset_repository and all_tables are specified. - Exception: If neither dataset_repository nor all_tables are specified.

show_features(tree_id: int, show_discarded_features: bool = False)[source]

Display the features for a given tree ID.

Parameters:

tree_id (int) – The ID of the tree.
show_discarded_features (bool) – Whether to show discarded features or not. Default is False.

targetColumn: str

threshold: float

update_relationship(table1: str, col1: str, table2: str, col2: str, weight: float)[source]

Update the relationship between two tables and their respective columns with a given weight.

Parameters:

table1 (str) – The name of the first table.
col1 (str) – The name of the column in the first table.
table2 (str) – The name of the second table.
col2 (str) – The name of the column in the second table.
weight (float) – The weight of the relationship.

weights: [<class 'src.autofeatinsights.functions.classes.Weight'>]