dfds_ds_toolbox.feature_selection.feat_selector module
- class dfds_ds_toolbox.feature_selection.feat_selector.RegFeatureSelector(**kwargs)
Bases:
objectSelects useful features.
Several strategies are possible (filter and wrapper methods). Works for regression problems only.
- strategy
default = “l1” The strategy to select features. Available strategies = [“variance”, “l1”, “rf_feature_importance”, ‘rf_top_features’, “stepwise”]
- threshold
defaut = 0.3 The percentage of variable to discard according the strategy. Must be between 0. and 1.
- fit(df_train, y_train)
Fits Reg_feature_selector.
- Parameters
df_train (
DataFrame) – The train dataset with numerical features and no NA. With shape = (n_train, n_features).y_train (
Series) – The target for regression task. With shape = (n_train, ).
- Return type
- Returns
self
- fit_transform(df_train, y_train)
Fits Reg_feature_selector and transforms the dataset
- Parameters
df_train (
DataFrame) – The train dataset with numerical features and no NA. With shape = (n_train, n_features).y_train (
DataFrame) – The target for regression task. With shape = (n_train, ).
- Return type
DataFrame- Returns
pandas dataframe of shape = (n_train, n_features*(1-threshold)) The train dataset with relevant features
- transform(df)
Transforms the dataset
- Parameters
df (
DataFrame) – pandas dataframe of shape = (n, n_features). The dataset with numerical features and no NA- Return type
DataFrame- Returns
pandas dataframe of shape = (n_train, n_features*(1-threshold)) The train dataset with relevant features
- dfds_ds_toolbox.feature_selection.feat_selector.rf_prim_columns(X, y, n_trees=10, top_cols=10)
DEPRECATED: Use sklearn.feature_selection.SelectFromModel instead.
Returns dictionary counting the number of times each column appears among the top_cols most significant columns in each of the ten Random Forests applied to the data set Also returns a list of the MAE values for further validation of the ability to predict the data set
- Parameters
X (
DataFrame) –y (
Series) –n_trees (
int) –top_cols (
int) –
- Return type
Tuple[DataFrame,List[float]]- Returns
missing
- dfds_ds_toolbox.feature_selection.feat_selector.stepwise_selection(X, y, initial_list=None, threshold_in=0.01, threshold_out=0.05, verbose=False)
DEPRECATED: Use sklearn.feature_selection.SequentialFeatureSelector instead.
Perform a forward-backward feature selection
Based on p-value from statsmodels.api.OLS
- Parameters
X (
DataFrame) – DataFrame with candidate featuresy (
array) – list-like with the targetinitial_list (
Optional[List[str],None]) – list of features to start with (column names of X)threshold_in (
float) – include a feature if its p-value < threshold_inthreshold_out (
float) – exclude a feature if its p-value > threshold_outverbose (
bool) – whether to print the sequence of inclusions and exclusions
- Return type
List[str]- Returns
list of selected features
Always set threshold_in < threshold_out to avoid infinite looping. See https://en.wikipedia.org/wiki/Stepwise_regression for the details