SmartDrift Object¶
- class eurybia.core.smartdrift.SmartDrift(df_current: pandas.core.frame.DataFrame, df_baseline: pandas.core.frame.DataFrame, dataset_names: tuple[str, str] = ('Current', 'Baseline'), deployed_model: Optional[Any] = None, encoding: Optional[Any] = None, palette_name: str = 'eurybia', colors_dict: Optional[dict] = None)[source]¶
Bases:
objectThe SmartDrift class is the main object to compute drift in the Eurybia library It allows to calculate data drift between 2 datasets using a data drift classification model
- df_current: pandas.DataFrame
current (or production) dataset which is compared to df_baseline
- df_baseline: pandas.DataFrame
baseline (or learning) dataset which is compared to df_current
- datadrift_classifier: model object
model used for binary classification of data drift
- xpl: Shapash object
object used to compute explainability on datadrift_classifier
- df_predict: pandas.DataFrame
computed score on both datasets if a deployed_model is specified
- feature_importance: pandas.DataFrame
feature importance of datadrift_classifier and feature importance of production model if exist
- pb_cols: dict
Dictionnary that references columns differences between df_current and df_baseline
- err_mods: dict
Dictionnary that references modalities differences in columns between df_current and df_baseline
- auc: int
Value auc of model drift
- historical_auc: pandas.DataFrame
Dataframe that contains auc history of datadrift_classifier over time
- data_modeldrift: pandas.DataFrame
Dataframe that contains performance history of deployed_model
- ignore_cols: list
list of feature to ignore in compute
- dataset_namesdict, (Optional)
Dictionnary used to specify dataset names to display in report.
- df_concatpandas.DataFrame
Dataframe that’s composed of both df_baseline and df_current concatenated
- ploteurybia.core.smartplotter.SmartPlotter
Instance of an Eurybia SmartPlotter class. It’s used for graph displaying purpose.
- deployed_model: model object, optional
model in production used to put in perspective drift and to predict
- encoding: preprocessing object, optional (default: None)
Preprocessing used before the training step
- datadrift_stat_testdict
Datadrift statistical tests for each feature. Each test identifies whether the feature has drifted. There are 2 types of test implemented depending on the type of feature: - Chi-square for discrete variables - ref: (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) - Kolmogorov-Smirnov for continuous variables - ref: (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html) This datadrift_stat_test attribute specifies for each feature the test performed, the statistic the test and the p value
- palette_namestr (default: ‘eurybia’)
Name of the palette used for the colors of the report (refer to style folder).
- colors_dict: dict
Dict of the colors used in the different plots
- js_divergencefloat
Jensen-Shannon divergence of probability distributions - ref: (https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence)
How to declare a new SmartDrift object?
>>> SD = Smartdrift(df_current=df_production, df_baseline=df_learning)
- add_data_modeldrift(dataset: pandas.core.frame.DataFrame, metric: str = 'performance', reference_columns: Optional[list[str]] = None, year_col: str = 'annee', month_col: str = 'mois')[source]¶
When method drift is specified, It will display in the report the several plots from a dataframe to analyse drift model from the deployed model. Each plot will represent one possible computed metric according to its groups. (grouped by date(year-month), reference_columns).
- Parameters
df (pd.DataFrame) – The Dataframe with all the computed metrics.
metric (str, (default: 'performance')) – The column name of the metric computed
reference_columns (list, (default: [])) – the column names to use for aggregation with the Date computed
year_col (str, (default: 'annee')) – The column name of the year where the metric has been computed
month_col (str, (default: 'mois')) – The column name of the month where the metric has been computed
- compile(full_validation: bool = False, ignore_cols: Optional[list[str]] = None, sampling: bool = True, sample_size: int = 100000, datadrift_file: Optional[str] = None, date_compile_auc: Optional[datetime.date] = None, hyperparameter: Optional[dict] = None, attr_importance: str = 'feature_importances_')[source]¶
The compile method is the first step to compute data drift. It allows to calculate data drift between 2 datasets using a data drift classification model. Most of the parameters are optional but helps to adapt the data drift calculation if necessary. This step can last a few moments with large datasets.
- Parameters
full_validation (bool, optional (default: False)) – If True, analyze consistency on modalities between columns
ignore_cols (list, optional) – list of feature to ignore in compute
sampling (bool, optional) – If True, applies the sampling
sample_size (int, optional) – the size of the sample to build
date_compile_auc (date (optional)) – used to specify date of compute drift, useful when compute few time drift for different time at the same moment
hyperparameter (dict, optional) – if user want to modify catboost hyperparameter
attr_importance (string, optional (default: "feature_importances_")) – Attribute “feature_importance” of the deployed_model
datadrift_file (str, optional) – Name of the csv file that contains the performance history of data drift. If no datadrift file is given, the drift will not be logged
Examples
>>> SD.compile()
- generate_report(output_file: str, project_info_file: Optional[str] = None, title_story: str = 'Drift Report', title_description: str = '', working_dir: Optional[str] = None)[source]¶
This method will generate an HTML report containing different information about the project. It allows the information compiled to be rendered. It can be associated with a project info yml file on which can figure different information about the project.
- Parameters
output_file (str) – Path to the HTML file to write
project_info_file (str) – Path to the file used to display some information about the project in the report
title_story (str, optional) – Report title
title_description (str, optional) – Report title description (as written just below the title)
working_dir (str, optional) – Working directory in which will be generated the notebook used to create the report and where the objects used to execute it will be saved. This parameter can be usefull if one wants to create its own custom report and debug the notebook used to generate the html report. If None, a temporary directory will be used
Examples
>>> SD.generate_report( output_file='report.html', project_info_file='project_info.yml', title_story="Drift project report", title_description="This document is a drift report of the score in production" )