SmartDrift Object¶

class eurybia.core.smartdrift.SmartDrift(df_current=None, df_baseline=None, dataset_names={'df_baseline': 'Baseline dataset', 'df_current': 'Current dataset'}, deployed_model=None, encoding=None, palette_name='eurybia', colors_dict=None)[source]¶

Bases: object

The SmartDrift class is the main object to compute drift in the Eurybia library It allows to calculate data drift between 2 datasets using a data drift classification model

df_current¶

current (or production) dataset which is compared to df_baseline

Type: pandas.DataFrame

df_baseline¶

baseline (or learning) dataset which is compared to df_current

Type: pandas.DataFrame

datadrift_classifier¶

model used for binary classification of data drift

Type: model object

xpl¶

object used to compute explainability on datadrift_classifier

Type: Shapash object

df_predict¶

computed score on both datasets if a deployed_model is specified

Type: pandas.DataFrame

feature_importance¶

feature importance of datadrift_classifier and feature importance of production model if exist

Type: pandas.DataFrame

pb_cols¶

Dictionnary that references columns differences between df_current and df_baseline

Type: dict

err_mods¶

Dictionnary that references modalities differences in columns between df_current and df_baseline

Type: dict

auc¶

Value auc of model drift

Type: int

historical_auc¶

Dataframe that contains auc history of datadrift_classifier over time

Type: pandas.DataFrame

data_modeldrift¶

Dataframe that contains performance history of deployed_model

Type: pandas.DataFrame

ignore_cols¶

list of feature to ignore in compute

Type: list

dataset_names¶

Dictionnary used to specify dataset names to display in report.

Type: dict, (Optional)

_df_concat¶

Dataframe that’s composed of both df_baseline and df_current concatenated

Type: pandas.DataFrame

plot¶

Instance of an Eurybia SmartPlotter class. It’s used for graph displaying purpose.

Type: eurybia.core.smartplotter.SmartPlotter

deployed_model¶

model in production used to put in perspective drift and to predict

Type: model object, optional

encoding¶

Preprocessing used before the training step

Type: preprocessing object, optional (default: None)

datadrift_stat_test¶

Datadrift statistical tests for each feature. Each test identifies whether the feature has drifted. There are 2 types of test implemented depending on the type of feature: - Chi-square for discrete variables - ref: (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) - Kolmogorov-Smirnov for continuous variables - ref: (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html) This datadrift_stat_test attribute specifies for each feature the test performed, the statistic the test and the p value

Type: dict

palette_name¶

Name of the palette used for the colors of the report (refer to style folder).

Type: str (default: ‘eurybia’)

colors_dict¶

Dict of the colors used in the different plots

Type: dict

datadrift_file¶

Name of the csv file that contains the performance history of data drift If no datadrift file is given, the drift will not be logged

Type: str, optional

js_divergence¶

Jensen-Shannon divergence of probability distributions - ref: (https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence)

Type: float

How to declare a new SmartDrift object?

Example

>>> SD = Smartdrift(df_current=df_production, df_baseline=df_learning)

add_data_modeldrift(dataset, metric='performance', reference_columns=[], year_col='annee', month_col='mois')[source]¶

When method drift is specified, It will display in the report the several plots from a dataframe to analyse drift model from the deployed model. Each plot will represent one possible computed metric according to its groups. (grouped by date(year-month), reference_columns).

Parameters

df (pd.DataFrame) – The Dataframe with all the computed metrics.
metric (str, (default: 'performance')) – The column name of the metric computed
reference_columns (list, (default: [])) – the column names to use for aggregation with the Date computed
year_col (str, (default: 'annee')) – The column name of the year where the metric has been computed
month_col (str, (default: 'mois')) – The column name of the month where the metric has been computed

compile(full_validation=False, ignore_cols: list = [], sampling=True, sample_size=100000, datadrift_file=None, date_compile_auc=None, hyperparameter: Dict = {'custom_loss': ['Accuracy', 'AUC', 'Logloss'], 'eval_metric': 'Logloss', 'iterations': 250, 'l2_leaf_reg': 19, 'learning_rate': 0.166905, 'loss_function': 'Logloss', 'max_depth': 8, 'use_best_model': True}, attr_importance='feature_importances_')[source]¶

The compile method is the first step to compute data drift. It allows to calculate data drift between 2 datasets using a data drift classification model. Most of the parameters are optional but helps to adapt the data drift calculation if necessary. This step can last a few moments with large datasets.

Parameters

full_validation (bool, optional (default: False)) – If True, analyze consistency on modalities between columns
ignore_cols (list, optional) – list of feature to ignore in compute
sampling (bool, optional) – If True, applies the sampling
sample_size (int, optional) – the size of the sample to build
date_compile_auc (str (optional)) – format dd/mm/yyyy use for specify date of compute drift, useful when compute few time drift for different time at the same moment
hyperparameter (dict, optional) – if user want to modify catboost hyperparameter
attr_importance (string, optional (default: "feature_importances_")) – Attribute “feature_importance” of the deployed_model
datadrift_file (str, optional) – Name of the csv file that contains the performance history of data drift. If no datadrift file is given, the drift will not be logged

Examples

>>> SD.compile()

generate_report(output_file, project_info_file=None, title_story='Drift Report', title_description='', working_dir=None)[source]¶

This method will generate an HTML report containing different information about the project. It allows the information compiled to be rendered. It can be associated with a project info yml file on which can figure different information about the project.

Parameters

output_file (str) – Path to the HTML file to write
project_info_file (str) – Path to the file used to display some information about the project in the report
title_story (str, optional) – Report title
title_description (str, optional) – Report title description (as written just below the title)
working_dir (str, optional) – Working directory in which will be generated the notebook used to create the report and where the objects used to execute it will be saved. This parameter can be usefull if one wants to create its own custom report and debug the notebook used to generate the html report. If None, a temporary directory will be used

Examples

>>> SD.generate_report(
        output_file='report.html',
        project_info_file='project_info.yml',
        title_story="Drift project report",
        title_description="This document is a drift report of the score in production"
    )