API

class src.autofeatinsights.autofeat_class.FeatureDiscovery[source]

Bases: object

add_relationship(table1: str, col1: str, table2: str, col2: str, weight: float)[source]

Adds a relationship between two columns in different tables.

Parameters:
  • table1 (str) – The name of the first table.

  • col1 (str) – The name of the column in the first table.

  • table2 (str) – The name of the second table.

  • col2 (str) – The name of the column in the second table.

  • weight (float) – The weight of the relationship.

add_table(table: str)[source]

Adds an extra table to the list of tables used for feature generation.

Parameters:

table (str) – The name of the table to be added.

adjust_non_null_ratio(tree_id: int, table: str, value: float)[source]

Adjusts the non-null ratio for a specific tree and table.

Parameters:
  • tree_id (int) – The ID of the tree.

  • table (str) – The name of the table.

  • value (float) – The new non-null ratio value.

adjust_redundancy_value(tree_id: int, feature: str, value: float)[source]

Adjusts the redundancy value for a specific feature in a given tree.

Parameters:
  • tree_id (int) – The ID of the tree.

  • feature (str) – The name of the feature.

  • value (float) – The new redundancy value.

adjust_relevance_value(tree_id: int, feature: str, value: float)[source]

Adjusts the relevance value of a feature for a specific tree.

Parameters:
  • tree_id (int) – The ID of the tree.

  • feature (str) – The name of the feature.

  • value (float) – The new relevance value.

Returns:

None

augment_dataset(algorithm='GBM', relation_threshold: float = 0.5, non_null_threshold=0.5, matcher='coma', top_k_features: int = 10, top_k_paths: int = 3, explain=True, verbose=True, use_cache=True)[source]

Augments the dataset by finding relationships between features, computing join trees, and evaluating the trees.

Parameters:
  • algorithm (str) – The algorithm to use for tree evaluation. Default is “GBM”.

  • relation_threshold (float) – The threshold for considering a relationship between features. Default is 0.5.

  • non_null_threshold – The threshold for considering a feature as non-null. Default is 0.5.

  • matcher (str) – The matcher to use for finding relationships. Default is “coma”.

  • top_k_features (int) – The number of top features to select. Default is 10.

  • top_k_paths (int) – The number of top paths to select. Default is 3.

  • explain (bool) – Whether to explain the process. Default is True.

  • verbose (bool) – Whether to print verbose output. Default is True.

  • use_cache (bool) – Whether to use cached relationship weights. Default is True.

base_dataset: str
compute_join_trees(top_k_features: int = 10, non_null_threshold=0.5, explain=False, verbose=True)[source]

Compute join trees for feature selection.

Parameters:
  • top_k_features (int) – Number of top features to select. Defaults to 10.

  • non_null_threshold (float) – Threshold for non-null ratio. Defaults to 0.5.

  • explain (bool) – Whether to explain the join trees. Defaults to False.

  • verbose (bool) – Whether to print verbose output. Defaults to True.

display_best_relationships()[source]

Displays the best relationships found by FeatureDiscovery.

display_join_tree(tree_id)[source]

Display the join path with the given tree_id.

Parameters: - tree_id: The ID of the join path to display.

display_join_trees(top_k: int | None = None)[source]

Display the join trees for the AutoFeatClass instance.

Parameters:

top_k (int) – The number of join trees to display. If None, display all join trees.

display_table_relationship(table1: str, table2: str)[source]

Display the relationship between two tables.

Parameters:
  • table1 (str) – The name of the first table.

  • table2 (str) – The name of the second table.

evaluate_augmented_table(tree_id: int, algorithm='GBM', verbose=False)[source]

Evaluate the augmented table using the specified algorithm and tree ID.

Parameters: - tree_id (int): The ID of the tree to use for evaluation. - algorithm (str): The algorithm to use for evaluation. Default is ‘GBM’. - verbose (bool): Whether to print verbose output. Default is False.

evaluate_trees(algorithm='GBM', top_k_paths: int = 3, verbose=True, explain=False)[source]

Evaluate the performance of the generated trees.

Parameters: - algorithm (str): The algorithm to use for evaluation. Default is ‘GBM’. - top_k_paths (int): The number of top paths to consider. Default is 3. - verbose (bool): Whether to print verbose output. Default is True. - explain (bool): Whether to explain the evaluation results. Default is False.

exlude_tables: [(<class 'str'>, <class 'str'>)]
explain_relationship(table1: str, table2: str)[source]

Explains the relationship between two tables.

Parameters:
  • table1 (str) – The name of the first table.

  • table2 (str) – The name of the second table.

explain_result(tree_id: int, model: str = 'GBM')[source]

Explain the result of a specific tree in the AutoFeat pipeline.

Parameters:
  • tree_id (int) – The ID of the tree to explain.

  • model (str, optional) – The model to use for explanation. Defaults to ‘GBM’.

explain_tree(tree_id: int)[source]

Explain the tree identified by the given tree_id.

Parameters:

tree_id (int) – The ID of the tree to explain.

explore: bool
extra_tables: [(<class 'str'>, <class 'str'>)]
find_relationships(matcher='coma', relationship_threshold: float = 0.5, explain=False, use_cache=True, verbose=True)[source]

Finds relationships between features in the dataset.

Parameters:
  • matcher (str, optional) – The name of the matcher to use for finding relationships. Defaults to “coma”.

  • relationship_threshold (float, optional) – The threshold value for determining the strength of a relationship. Defaults to 0.5.

  • explain (bool, optional) – Whether to provide an explanation for the relationships found. Defaults to False.

  • use_cache (bool, optional) – Whether to use a cache for storing previously computed relationships. Defaults to True.

  • verbose (bool, optional) – Whether to print verbose output during the process. Defaults to True.

get_best_result()[source]

Returns the best result obtained by the evaluation module.

get_tables_repository()[source]

Retrieves the tables from the repository.

Returns:

A list of table paths.

Return type:

tables (list)

get_weights_from_and_to_table(from_table, to_table)[source]

Returns a list of weights that have the specified ‘from_table’ and ‘to_table’ values.

Parameters:
  • from_table (str) – The source table name.

  • to_table (str) – The destination table name.

Returns:

A list of weights that match the specified ‘from_table’ and ‘to_table’ values.

Return type:

list

get_weights_from_table(table: str)[source]

Returns a list of weights from the specified table.

Parameters:

table (str) – The name of the table.

Returns:

A list of weights from the specified table.

Return type:

list

inspect_join_tree(tree_id: int)[source]

Inspects the join tree with the given tree_id.

Parameters:

tree_id (int) – The ID of the join tree to inspect.

join_keys: dict = {}
materialise_join_tree(tree_id: int)[source]

Materializes the join tree with the given tree_id.

Parameters:

tree_id (int) – The ID of the join tree to materialize.

Returns:

The materialized join tree.

move_features_to_discarded(tree_id: int, features: [<class 'str'>])[source]

Moves the specified features to the discarded list for the given tree.

Parameters:
  • tree_id (int) – The ID of the tree.

  • features (list[str]) – The list of features to be moved to the discarded list.

move_features_to_selected(tree_id: int, features: [<class 'str'>])[source]

Moves the specified features from discarded to the selected features list for the given tree.

Parameters:
  • tree_id (int) – The ID of the tree.

  • features (list[str]) – The list of features to be moved.

non_null_ratio_threshold: float
partial_join: DataFrame
partial_join_selected_features: dict = {}
paths: [<class 'src.autofeatinsights.functions.classes.Tree'>]
read_relationships(file_path)[source]

Reads the relationships from a file and updates the object’s internal state.

Parameters:

file_path (str) – The path to the file containing the relationships.

remove_join_path_from_tree(tree_id: int, table: str)[source]

Removes a join path from the tree.

Parameters:
  • tree_id (int) – The ID of the tree.

  • table (str) – The name of the table to remove the join path from.

remove_relationship(table1: str, col1: str, table2: str, col2: str)[source]

Removes a relationship between two columns in different tables.

Parameters:
  • table1 (str) – The name of the first table.

  • col1 (str) – The name of the column in the first table.

  • table2 (str) – The name of the second table.

  • col2 (str) – The name of the column in the second table.

remove_table(table: str)[source]

Removes a table from the list of extra tables and adds it to the list of excluded tables.

Parameters:

table (str) – The name of the table to be removed.

results: [<class 'src.autofeatinsights.functions.classes.Result'>]
set_base_table(base_table: str, target_column: str)[source]

Sets the base table and target column for feature generation.

Parameters:
  • base_table (str) – The name of the base table.

  • target_column (str) – The name of the target column.

Returns:

None

set_dataset_repository(dataset_repository: List[str] = [], all_tables: bool = False)[source]

Sets the dataset repository for the AutofeatClass object.

Parameters: - dataset_repository (List[str]): A list of dataset paths. - all_tables (bool): Flag indicating whether to use all tables in the repository.

Raises: - Exception: If both dataset_repository and all_tables are specified. - Exception: If neither dataset_repository nor all_tables are specified.

show_features(tree_id: int, show_discarded_features: bool = False)[source]

Display the features for a given tree ID.

Parameters:
  • tree_id (int) – The ID of the tree.

  • show_discarded_features (bool) – Whether to show discarded features or not. Default is False.

targetColumn: str
threshold: float
update_relationship(table1: str, col1: str, table2: str, col2: str, weight: float)[source]

Update the relationship between two tables and their respective columns with a given weight.

Parameters:
  • table1 (str) – The name of the first table.

  • col1 (str) – The name of the column in the first table.

  • table2 (str) – The name of the second table.

  • col2 (str) – The name of the column in the second table.

  • weight (float) – The weight of the relationship.

weights: [<class 'src.autofeatinsights.functions.classes.Weight'>]