dfds_ds_toolbox.feature_selection.feat_selector module

class dfds_ds_toolbox.feature_selection.feat_selector.RegFeatureSelector(**kwargs)

Bases: object

Selects useful features.

Several strategies are possible (filter and wrapper methods). Works for regression problems only.

strategy: default = “l1” The strategy to select features. Available strategies = [“variance”, “l1”, “rf_feature_importance”, ‘rf_top_features’, “stepwise”]

threshold: defaut = 0.3 The percentage of variable to discard according the strategy. Must be between 0. and 1.

fit(df_train, y_train)

Fits Reg_feature_selector.

Parameters

df_train (DataFrame) – The train dataset with numerical features and no NA. With shape = (n_train, n_features).
y_train (Series) – The target for regression task. With shape = (n_train, ).

Return type

RegFeatureSelector

Returns

self

fit_transform(df_train, y_train)

Fits Reg_feature_selector and transforms the dataset

Parameters

df_train (DataFrame) – The train dataset with numerical features and no NA. With shape = (n_train, n_features).
y_train (DataFrame) – The target for regression task. With shape = (n_train, ).

Return type

DataFrame

Returns

pandas dataframe of shape = (n_train, n_features*(1-threshold)) The train dataset with relevant features

transform(df)

Transforms the dataset

Parameters: df (DataFrame) – pandas dataframe of shape = (n, n_features). The dataset with numerical features and no NA
Return type: DataFrame
Returns: pandas dataframe of shape = (n_train, n_features*(1-threshold)) The train dataset with relevant features

dfds_ds_toolbox.feature_selection.feat_selector.rf_prim_columns(X, y, n_trees=10, top_cols=10)

DEPRECATED: Use sklearn.feature_selection.SelectFromModel instead.

Returns dictionary counting the number of times each column appears among the top_cols most significant columns in each of the ten Random Forests applied to the data set Also returns a list of the MAE values for further validation of the ability to predict the data set

Parameters

X (DataFrame) –
y (Series) –
n_trees (int) –
top_cols (int) –

Return type

Tuple[DataFrame, List[float]]

Returns

missing

dfds_ds_toolbox.feature_selection.feat_selector.stepwise_selection(X, y, initial_list=None, threshold_in=0.01, threshold_out=0.05, verbose=False)

DEPRECATED: Use sklearn.feature_selection.SequentialFeatureSelector instead.

Perform a forward-backward feature selection

Based on p-value from statsmodels.api.OLS

Parameters

X (DataFrame) – DataFrame with candidate features
y (array) – list-like with the target
initial_list (Optional[List[str], None]) – list of features to start with (column names of X)
threshold_in (float) – include a feature if its p-value < threshold_in
threshold_out (float) – exclude a feature if its p-value > threshold_out
verbose (bool) – whether to print the sequence of inclusions and exclusions

Return type

List[str]

Returns

list of selected features

Always set threshold_in < threshold_out to avoid infinite looping. See https://en.wikipedia.org/wiki/Stepwise_regression for the details