SmartDrift Object

class eurybia.core.smartdrift.SmartDrift(df_current: pandas.core.frame.DataFrame, df_baseline: pandas.core.frame.DataFrame, dataset_names: tuple[str, str] = ('Current', 'Baseline'), deployed_model: Optional[Any] = None, encoding: Optional[Any] = None, palette_name: str = 'eurybia', colors_dict: Optional[dict] = None)[source]

Bases: object

The SmartDrift class is the main object to compute drift in the Eurybia library It allows to calculate data drift between 2 datasets using a data drift classification model

df_current: pandas.DataFrame

current (or production) dataset which is compared to df_baseline

df_baseline: pandas.DataFrame

baseline (or learning) dataset which is compared to df_current

datadrift_classifier: model object

model used for binary classification of data drift

xpl: Shapash object

object used to compute explainability on datadrift_classifier

df_predict: pandas.DataFrame

computed score on both datasets if a deployed_model is specified

feature_importance: pandas.DataFrame

feature importance of datadrift_classifier and feature importance of production model if exist

pb_cols: dict

Dictionnary that references columns differences between df_current and df_baseline

err_mods: dict

Dictionnary that references modalities differences in columns between df_current and df_baseline

auc: int

Value auc of model drift

historical_auc: pandas.DataFrame

Dataframe that contains auc history of datadrift_classifier over time

data_modeldrift: pandas.DataFrame

Dataframe that contains performance history of deployed_model

ignore_cols: list

list of feature to ignore in compute

dataset_namesdict, (Optional)

Dictionnary used to specify dataset names to display in report.

df_concatpandas.DataFrame

Dataframe that’s composed of both df_baseline and df_current concatenated

ploteurybia.core.smartplotter.SmartPlotter

Instance of an Eurybia SmartPlotter class. It’s used for graph displaying purpose.

deployed_model: model object, optional

model in production used to put in perspective drift and to predict

encoding: preprocessing object, optional (default: None)

Preprocessing used before the training step

datadrift_stat_testdict

Datadrift statistical tests for each feature. Each test identifies whether the feature has drifted. There are 2 types of test implemented depending on the type of feature: - Chi-square for discrete variables - ref: (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) - Kolmogorov-Smirnov for continuous variables - ref: (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html) This datadrift_stat_test attribute specifies for each feature the test performed, the statistic the test and the p value

palette_namestr (default: ‘eurybia’)

Name of the palette used for the colors of the report (refer to style folder).

colors_dict: dict

Dict of the colors used in the different plots

js_divergencefloat

Jensen-Shannon divergence of probability distributions - ref: (https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence)

How to declare a new SmartDrift object?

>>> SD = Smartdrift(df_current=df_production, df_baseline=df_learning)
add_data_modeldrift(dataset: pandas.core.frame.DataFrame, metric: str = 'performance', reference_columns: Optional[list[str]] = None, year_col: str = 'annee', month_col: str = 'mois')[source]

When method drift is specified, It will display in the report the several plots from a dataframe to analyse drift model from the deployed model. Each plot will represent one possible computed metric according to its groups. (grouped by date(year-month), reference_columns).

Parameters
  • df (pd.DataFrame) – The Dataframe with all the computed metrics.

  • metric (str, (default: 'performance')) – The column name of the metric computed

  • reference_columns (list, (default: [])) – the column names to use for aggregation with the Date computed

  • year_col (str, (default: 'annee')) – The column name of the year where the metric has been computed

  • month_col (str, (default: 'mois')) – The column name of the month where the metric has been computed

compile(full_validation: bool = False, ignore_cols: Optional[list[str]] = None, sampling: bool = True, sample_size: int = 100000, datadrift_file: Optional[str] = None, date_compile_auc: Optional[datetime.date] = None, hyperparameter: Optional[dict] = None, attr_importance: str = 'feature_importances_')[source]

The compile method is the first step to compute data drift. It allows to calculate data drift between 2 datasets using a data drift classification model. Most of the parameters are optional but helps to adapt the data drift calculation if necessary. This step can last a few moments with large datasets.

Parameters
  • full_validation (bool, optional (default: False)) – If True, analyze consistency on modalities between columns

  • ignore_cols (list, optional) – list of feature to ignore in compute

  • sampling (bool, optional) – If True, applies the sampling

  • sample_size (int, optional) – the size of the sample to build

  • date_compile_auc (date (optional)) – used to specify date of compute drift, useful when compute few time drift for different time at the same moment

  • hyperparameter (dict, optional) – if user want to modify catboost hyperparameter

  • attr_importance (string, optional (default: "feature_importances_")) – Attribute “feature_importance” of the deployed_model

  • datadrift_file (str, optional) – Name of the csv file that contains the performance history of data drift. If no datadrift file is given, the drift will not be logged

Examples

>>> SD.compile()
generate_report(output_file: str, project_info_file: Optional[str] = None, title_story: str = 'Drift Report', title_description: str = '', working_dir: Optional[str] = None)[source]

This method will generate an HTML report containing different information about the project. It allows the information compiled to be rendered. It can be associated with a project info yml file on which can figure different information about the project.

Parameters
  • output_file (str) – Path to the HTML file to write

  • project_info_file (str) – Path to the file used to display some information about the project in the report

  • title_story (str, optional) – Report title

  • title_description (str, optional) – Report title description (as written just below the title)

  • working_dir (str, optional) – Working directory in which will be generated the notebook used to create the report and where the objects used to execute it will be saved. This parameter can be usefull if one wants to create its own custom report and debug the notebook used to generate the html report. If None, a temporary directory will be used

Examples

>>> SD.generate_report(
        output_file='report.html',
        project_info_file='project_info.yml',
        title_story="Drift project report",
        title_description="This document is a drift report of the score in production"
    )