dfds_ds_toolbox.analysis.plotting module

dfds_ds_toolbox.analysis.plotting.get_trend_stats(data, target_col, features_list=None, bins=10, data_test=None)

Calculates trend changes and correlation between train/test for list of features.

Parameters

data (DataFrame) – dataframe containing features and target columns
target_col (str) – target column name
features_list (Optional[List[str], None]) – by default creates plots for all features. If list passed, creates plots of only those features.
bins (int) – number of bins to be created from continuous feature
data_test (Optional[DataFrame, None]) – test data which has to be compared with input data for correlation

Return type

DataFrame

Returns

dataframe with trend changes and trend correlation (if test data passed)

dfds_ds_toolbox.analysis.plotting.plot_classification_proba_histogram(y_true, y_pred, ax=None)

Plot histogram of predictions for binary classifiers.

Parameters

y_true (Sequence[int]) – 1D array of binary target values, 0 or 1.
y_pred (Sequence[float]) – 1D array of predicted target values, probability of class 1.
ax (Optional[Axes, None]) – Optional pre-existing axis to plot on

Return type

Figure

dfds_ds_toolbox.analysis.plotting.plot_gain_chart(y_true, y_pred, n_bins=10, ax=None)

The cumulative gains chart shows the percentage of the overall number of cases in a given: category “gained” by targeting a percentage of the total number of cases.

Parameters

y_true (Sequence[int]) – array with observed values, either 0 or 1.
y_pred (Sequence[float]) – array with predicted probabilities, float between 0 and 1.
n_bins (int) – number of bins to use
ax (Optional[Axes, None]) – Optional pre-existing axis to plot on

Return type

Figure

Returns

matplotlib Figure

dfds_ds_toolbox.analysis.plotting.plot_lift_curve(y_true, y_pred, n_bins=10, ax=None)

Plot lift curve, i.e. how much better than baserate is the model at different thresholds.

Lift of 1 corresponds to predicting the baserate for the whole sample.

Parameters

y_true (Sequence[int]) – array with observed values, either 0 or 1.
y_pred (Sequence[float]) – array with predicted probabilities, float between 0 and 1.
n_bins (int) – number of bins to use
ax (Optional[Axes, None]) – Optional pre-existing axis to plot on

Return type

Figure

Returns

matplotlib Figure

dfds_ds_toolbox.analysis.plotting.plot_regression_predicted_vs_actual(y_true, y_pred, alpha=0.2, ax=None)

Scatter plot of the predicted vs true targets for regression problems.

Parameters

y_true (Sequence[float]) – array with observed values
y_pred (Sequence[float]) – array with predicted values
alpha (float) – transparency of the dots on the scatter plot
ax (Optional[Axes, None]) – Optional pre-existing axis to plot on

Return type

Figure

Returns

Figure

dfds_ds_toolbox.analysis.plotting.plot_roc_curve(y_true, y_pred, label='Train', ax=None)

plot roc curve for train and test

Parameters

y_true (Sequence[int]) – array with observed classes
y_pred (Sequence[float]) – array with predicted probabilities
label (str) – extra text to add, e.g. “Train” or “Test”
ax (Optional[Axes, None]) – Optional pre-existing axis to plot on

Return type

Figure

Returns

Figure

dfds_ds_toolbox.analysis.plotting.plot_univariate_dependencies(data, target_col, features_list=None, bins=10, data_test=None)

Creates univariate dependence plots for features in the dataset

Parameters

data (DataFrame) – dataframe containing features and target columns
target_col (str) – target column name
features_list (Optional[List[str], None]) – by default creates plots for all features. If list passed, creates plots of only those features.
bins (int) – number of bins to be created from continuous feature
data_test (Optional[DataFrame, None]) – test data which has to be compared with input data for correlation

Returns

Draws univariate plots for all columns in data