SmartDrift Object

class eurybia.core.smartdrift.SmartDrift(df_current=None, df_baseline=None, dataset_names={'df_baseline': 'Baseline dataset', 'df_current': 'Current dataset'}, deployed_model=None, encoding=None, palette_name='eurybia', colors_dict=None)[source]

Bases: object

The SmartDrift class is the main object to compute drift in the Eurybia library It allows to calculate data drift between 2 datasets using a data drift classification model

df_current

current (or production) dataset which is compared to df_baseline

Type

pandas.DataFrame

df_baseline

baseline (or learning) dataset which is compared to df_current

Type

pandas.DataFrame

datadrift_classifier

model used for binary classification of data drift

Type

model object

xpl

object used to compute explainability on datadrift_classifier

Type

Shapash object

df_predict

computed score on both datasets if a deployed_model is specified

Type

pandas.DataFrame

feature_importance

feature importance of datadrift_classifier and feature importance of production model if exist

Type

pandas.DataFrame

pb_cols

Dictionnary that references columns differences between df_current and df_baseline

Type

dict

err_mods

Dictionnary that references modalities differences in columns between df_current and df_baseline

Type

dict

auc

Value auc of model drift

Type

int

historical_auc

Dataframe that contains auc history of datadrift_classifier over time

Type

pandas.DataFrame

data_modeldrift

Dataframe that contains performance history of deployed_model

Type

pandas.DataFrame

ignore_cols

list of feature to ignore in compute

Type

list

dataset_names

Dictionnary used to specify dataset names to display in report.

Type

dict, (Optional)

_df_concat

Dataframe that’s composed of both df_baseline and df_current concatenated

Type

pandas.DataFrame

plot

Instance of an Eurybia SmartPlotter class. It’s used for graph displaying purpose.

Type

eurybia.core.smartplotter.SmartPlotter

deployed_model

model in production used to put in perspective drift and to predict

Type

model object, optional

encoding

Preprocessing used before the training step

Type

preprocessing object, optional (default: None)

datadrift_stat_test

Datadrift statistical tests for each feature. Each test identifies whether the feature has drifted. There are 2 types of test implemented depending on the type of feature: - Chi-square for discrete variables - ref: (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) - Kolmogorov-Smirnov for continuous variables - ref: (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html) This datadrift_stat_test attribute specifies for each feature the test performed, the statistic the test and the p value

Type

dict

palette_name

Name of the palette used for the colors of the report (refer to style folder).

Type

str (default: ‘eurybia’)

colors_dict

Dict of the colors used in the different plots

Type

dict

datadrift_file

Name of the csv file that contains the performance history of data drift If no datadrift file is given, the drift will not be logged

Type

str, optional

js_divergence

Jensen-Shannon divergence of probability distributions - ref: (https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence)

Type

float

How to declare a new SmartDrift object?

Example

>>> SD = Smartdrift(df_current=df_production, df_baseline=df_learning)
add_data_modeldrift(dataset, metric='performance', reference_columns=[], year_col='annee', month_col='mois')[source]

When method drift is specified, It will display in the report the several plots from a dataframe to analyse drift model from the deployed model. Each plot will represent one possible computed metric according to its groups. (grouped by date(year-month), reference_columns).

Parameters
  • df (pd.DataFrame) – The Dataframe with all the computed metrics.

  • metric (str, (default: 'performance')) – The column name of the metric computed

  • reference_columns (list, (default: [])) – the column names to use for aggregation with the Date computed

  • year_col (str, (default: 'annee')) – The column name of the year where the metric has been computed

  • month_col (str, (default: 'mois')) – The column name of the month where the metric has been computed

compile(full_validation=False, ignore_cols: list = [], sampling=True, sample_size=100000, datadrift_file=None, date_compile_auc=None, hyperparameter: Dict = {'custom_loss': ['Accuracy', 'AUC', 'Logloss'], 'eval_metric': 'Logloss', 'iterations': 250, 'l2_leaf_reg': 19, 'learning_rate': 0.166905, 'loss_function': 'Logloss', 'max_depth': 8, 'use_best_model': True}, attr_importance='feature_importances_')[source]

The compile method is the first step to compute data drift. It allows to calculate data drift between 2 datasets using a data drift classification model. Most of the parameters are optional but helps to adapt the data drift calculation if necessary. This step can last a few moments with large datasets.

Parameters
  • full_validation (bool, optional (default: False)) – If True, analyze consistency on modalities between columns

  • ignore_cols (list, optional) – list of feature to ignore in compute

  • sampling (bool, optional) – If True, applies the sampling

  • sample_size (int, optional) – the size of the sample to build

  • date_compile_auc (str (optional)) – format dd/mm/yyyy use for specify date of compute drift, useful when compute few time drift for different time at the same moment

  • hyperparameter (dict, optional) – if user want to modify catboost hyperparameter

  • attr_importance (string, optional (default: "feature_importances_")) – Attribute “feature_importance” of the deployed_model

  • datadrift_file (str, optional) – Name of the csv file that contains the performance history of data drift. If no datadrift file is given, the drift will not be logged

Examples

>>> SD.compile()
generate_report(output_file, project_info_file=None, title_story='Drift Report', title_description='', working_dir=None)[source]

This method will generate an HTML report containing different information about the project. It allows the information compiled to be rendered. It can be associated with a project info yml file on which can figure different information about the project.

Parameters
  • output_file (str) – Path to the HTML file to write

  • project_info_file (str) – Path to the file used to display some information about the project in the report

  • title_story (str, optional) – Report title

  • title_description (str, optional) – Report title description (as written just below the title)

  • working_dir (str, optional) – Working directory in which will be generated the notebook used to create the report and where the objects used to execute it will be saved. This parameter can be usefull if one wants to create its own custom report and debug the notebook used to generate the html report. If None, a temporary directory will be used

Examples

>>> SD.generate_report(
        output_file='report.html',
        project_info_file='project_info.yml',
        title_story="Drift project report",
        title_description="This document is a drift report of the score in production"
    )