SmartDrift Object¶
- class eurybia.core.smartdrift.SmartDrift(df_current=None, df_baseline=None, dataset_names={'df_baseline': 'Baseline dataset', 'df_current': 'Current dataset'}, deployed_model=None, encoding=None, palette_name='eurybia', colors_dict=None)[source]¶
Bases:
object
The SmartDrift class is the main object to compute drift in the Eurybia library It allows to calculate data drift between 2 datasets using a data drift classification model
- df_current¶
current (or production) dataset which is compared to df_baseline
- Type
pandas.DataFrame
- df_baseline¶
baseline (or learning) dataset which is compared to df_current
- Type
pandas.DataFrame
- datadrift_classifier¶
model used for binary classification of data drift
- Type
model object
- xpl¶
object used to compute explainability on datadrift_classifier
- Type
Shapash object
- df_predict¶
computed score on both datasets if a deployed_model is specified
- Type
pandas.DataFrame
- feature_importance¶
feature importance of datadrift_classifier and feature importance of production model if exist
- Type
pandas.DataFrame
- pb_cols¶
Dictionnary that references columns differences between df_current and df_baseline
- Type
dict
- err_mods¶
Dictionnary that references modalities differences in columns between df_current and df_baseline
- Type
dict
- auc¶
Value auc of model drift
- Type
int
- historical_auc¶
Dataframe that contains auc history of datadrift_classifier over time
- Type
pandas.DataFrame
- data_modeldrift¶
Dataframe that contains performance history of deployed_model
- Type
pandas.DataFrame
- ignore_cols¶
list of feature to ignore in compute
- Type
list
- dataset_names¶
Dictionnary used to specify dataset names to display in report.
- Type
dict, (Optional)
- _df_concat¶
Dataframe that’s composed of both df_baseline and df_current concatenated
- Type
pandas.DataFrame
- plot¶
Instance of an Eurybia SmartPlotter class. It’s used for graph displaying purpose.
- deployed_model¶
model in production used to put in perspective drift and to predict
- Type
model object, optional
- encoding¶
Preprocessing used before the training step
- Type
preprocessing object, optional (default: None)
- datadrift_stat_test¶
Datadrift statistical tests for each feature. Each test identifies whether the feature has drifted. There are 2 types of test implemented depending on the type of feature: - Chi-square for discrete variables - ref: (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) - Kolmogorov-Smirnov for continuous variables - ref: (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html) This datadrift_stat_test attribute specifies for each feature the test performed, the statistic the test and the p value
- Type
dict
- palette_name¶
Name of the palette used for the colors of the report (refer to style folder).
- Type
str (default: ‘eurybia’)
- colors_dict¶
Dict of the colors used in the different plots
- Type
dict
- datadrift_file¶
Name of the csv file that contains the performance history of data drift If no datadrift file is given, the drift will not be logged
- Type
str, optional
- js_divergence¶
Jensen-Shannon divergence of probability distributions - ref: (https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence)
- Type
float
- How to declare a new SmartDrift object?
Example
>>> SD = Smartdrift(df_current=df_production, df_baseline=df_learning)
- add_data_modeldrift(dataset, metric='performance', reference_columns=[], year_col='annee', month_col='mois')[source]¶
When method drift is specified, It will display in the report the several plots from a dataframe to analyse drift model from the deployed model. Each plot will represent one possible computed metric according to its groups. (grouped by date(year-month), reference_columns).
- Parameters
df (pd.DataFrame) – The Dataframe with all the computed metrics.
metric (str, (default: 'performance')) – The column name of the metric computed
reference_columns (list, (default: [])) – the column names to use for aggregation with the Date computed
year_col (str, (default: 'annee')) – The column name of the year where the metric has been computed
month_col (str, (default: 'mois')) – The column name of the month where the metric has been computed
- compile(full_validation=False, ignore_cols: list = [], sampling=True, sample_size=100000, datadrift_file=None, date_compile_auc=None, hyperparameter: Dict = {'custom_loss': ['Accuracy', 'AUC', 'Logloss'], 'eval_metric': 'Logloss', 'iterations': 250, 'l2_leaf_reg': 19, 'learning_rate': 0.166905, 'loss_function': 'Logloss', 'max_depth': 8, 'use_best_model': True}, attr_importance='feature_importances_')[source]¶
The compile method is the first step to compute data drift. It allows to calculate data drift between 2 datasets using a data drift classification model. Most of the parameters are optional but helps to adapt the data drift calculation if necessary. This step can last a few moments with large datasets.
- Parameters
full_validation (bool, optional (default: False)) – If True, analyze consistency on modalities between columns
ignore_cols (list, optional) – list of feature to ignore in compute
sampling (bool, optional) – If True, applies the sampling
sample_size (int, optional) – the size of the sample to build
date_compile_auc (str (optional)) – format dd/mm/yyyy use for specify date of compute drift, useful when compute few time drift for different time at the same moment
hyperparameter (dict, optional) – if user want to modify catboost hyperparameter
attr_importance (string, optional (default: "feature_importances_")) – Attribute “feature_importance” of the deployed_model
datadrift_file (str, optional) – Name of the csv file that contains the performance history of data drift. If no datadrift file is given, the drift will not be logged
Examples
>>> SD.compile()
- generate_report(output_file, project_info_file=None, title_story='Drift Report', title_description='', working_dir=None)[source]¶
This method will generate an HTML report containing different information about the project. It allows the information compiled to be rendered. It can be associated with a project info yml file on which can figure different information about the project.
- Parameters
output_file (str) – Path to the HTML file to write
project_info_file (str) – Path to the file used to display some information about the project in the report
title_story (str, optional) – Report title
title_description (str, optional) – Report title description (as written just below the title)
working_dir (str, optional) – Working directory in which will be generated the notebook used to create the report and where the objects used to execute it will be saved. This parameter can be usefull if one wants to create its own custom report and debug the notebook used to generate the html report. If None, a temporary directory will be used
Examples
>>> SD.generate_report( output_file='report.html', project_info_file='project_info.yml', title_story="Drift project report", title_description="This document is a drift report of the score in production" )