{ "cells": [ { "cell_type": "markdown", "id": "9b6f3ff7", "metadata": {}, "source": [ "# Detect High Model Drift \n", "With this tutorial you:
\n", "Understand how to use Eurybia to detect datadrift\n", "\n", "Contents:\n", "- Detect data drift \n", "- Compile Drift over years\n", "\n", "This public dataset comes from :\n", "\n", "https://www.kaggle.com/sobhanmoosavi/us-accidents/version/10\n", "\n", "---\n", "Acknowledgements\n", "- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.\n", "- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.\n", "---\n", "\n", "In this tutorial, the data are not loaded raw, a data preparation to facilitate the use of the tutorial has been done. You can find it here : \n", "https://github.com/MAIF/eurybia/blob/master/eurybia/data/dataprep_US_car_accidents.ipynb" ] }, { "cell_type": "markdown", "id": "6ee7dedd", "metadata": {}, "source": [ "**Requirements notice** : the following tutorial may use third party modules not included in Eurybia. \n", "You can find them all in one file [on our Github repository](https://github.com/MAIF/eurybia/blob/master/requirements.dev.txt) or you can manually install those you are missing, if any." ] }, { "cell_type": "code", "execution_count": 2, "id": "8c767469", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from category_encoders import OrdinalEncoder\n", "import catboost\n", "from eurybia import SmartDrift\n", "from sklearn.model_selection import train_test_split\n", "from sklearn import metrics\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "939acff0", "metadata": {}, "source": [ "## Import Dataset and split in training and production dataset" ] }, { "cell_type": "code", "execution_count": 3, "id": "e0b10b1b", "metadata": {}, "outputs": [], "source": [ "from eurybia.data.data_loader import data_loading" ] }, { "cell_type": "code", "execution_count": 4, "id": "6d3d1d90", "metadata": {}, "outputs": [], "source": [ "df_car_accident = data_loading(\"us_car_accident\")" ] }, { "cell_type": "code", "execution_count": 5, "id": "8a0a6ef4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Start_LatStart_LngDistance(mi)Temperature(F)Humidity(%)Visibility(mi)day_of_week_accNautical_Twilightseason_acctargettarget_multiyear_accDescription
033.0-117.10.040.093.02.03Daywinter022019At Carmel Mountain Rd - Accident.
129.5-98.50.083.065.010.04Daysummer132017At TX-345-SP/Woodlawn Ave/Exit 567B - Accident.
232.7-96.80.088.057.010.00Nightsummer022021Incident on RUGGED DR near BERKLEY AVE Expect ...
340.0-76.30.061.058.010.04Dayspring022020At PA-741/Rohrerstown Rd - Accident.
441.5-81.81.071.053.010.00Daysummer022020At 117th St/Exit 166 - Accident.
\n", "
" ], "text/plain": [ " Start_Lat Start_Lng Distance(mi) Temperature(F) Humidity(%) \\\n", "0 33.0 -117.1 0.0 40.0 93.0 \n", "1 29.5 -98.5 0.0 83.0 65.0 \n", "2 32.7 -96.8 0.0 88.0 57.0 \n", "3 40.0 -76.3 0.0 61.0 58.0 \n", "4 41.5 -81.8 1.0 71.0 53.0 \n", "\n", " Visibility(mi) day_of_week_acc Nautical_Twilight season_acc target \\\n", "0 2.0 3 Day winter 0 \n", "1 10.0 4 Day summer 1 \n", "2 10.0 0 Night summer 0 \n", "3 10.0 4 Day spring 0 \n", "4 10.0 0 Day summer 0 \n", "\n", " target_multi year_acc Description \n", "0 2 2019 At Carmel Mountain Rd - Accident. \n", "1 3 2017 At TX-345-SP/Woodlawn Ave/Exit 567B - Accident. \n", "2 2 2021 Incident on RUGGED DR near BERKLEY AVE Expect ... \n", "3 2 2020 At PA-741/Rohrerstown Rd - Accident. \n", "4 2 2020 At 117th St/Exit 166 - Accident. " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_car_accident.head()" ] }, { "cell_type": "code", "execution_count": 6, "id": "78f258f5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Start_LatStart_LngDistance(mi)Temperature(F)Humidity(%)Visibility(mi)day_of_week_accNautical_Twilightseason_acctargettarget_multiyear_accDescription
033.0-117.10.040.093.02.03Daywinter022019At Carmel Mountain Rd - Accident.
129.5-98.50.083.065.010.04Daysummer132017At TX-345-SP/Woodlawn Ave/Exit 567B - Accident.
232.7-96.80.088.057.010.00Nightsummer022021Incident on RUGGED DR near BERKLEY AVE Expect ...
340.0-76.30.061.058.010.04Dayspring022020At PA-741/Rohrerstown Rd - Accident.
441.5-81.81.071.053.010.00Daysummer022020At 117th St/Exit 166 - Accident.
\n", "
" ], "text/plain": [ " Start_Lat Start_Lng Distance(mi) Temperature(F) Humidity(%) \\\n", "0 33.0 -117.1 0.0 40.0 93.0 \n", "1 29.5 -98.5 0.0 83.0 65.0 \n", "2 32.7 -96.8 0.0 88.0 57.0 \n", "3 40.0 -76.3 0.0 61.0 58.0 \n", "4 41.5 -81.8 1.0 71.0 53.0 \n", "\n", " Visibility(mi) day_of_week_acc Nautical_Twilight season_acc target \\\n", "0 2.0 3 Day winter 0 \n", "1 10.0 4 Day summer 1 \n", "2 10.0 0 Night summer 0 \n", "3 10.0 4 Day spring 0 \n", "4 10.0 0 Day summer 0 \n", "\n", " target_multi year_acc Description \n", "0 2 2019 At Carmel Mountain Rd - Accident. \n", "1 3 2017 At TX-345-SP/Woodlawn Ave/Exit 567B - Accident. \n", "2 2 2021 Incident on RUGGED DR near BERKLEY AVE Expect ... \n", "3 2 2020 At PA-741/Rohrerstown Rd - Accident. \n", "4 2 2020 At 117th St/Exit 166 - Accident. " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_car_accident.head()" ] }, { "cell_type": "code", "execution_count": 7, "id": "05039303", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(50000, 13)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_car_accident.shape" ] }, { "cell_type": "code", "execution_count": 8, "id": "a1d226fa", "metadata": {}, "outputs": [], "source": [ "# Let us consider that the column \"year_acc\" corresponds to the reference date. \n", "#In 2016, a model was trained using data. And in next years, we want to detect data drift on new data in production to predict\n", "df_accident_baseline = df_car_accident.loc[df_car_accident['year_acc'] == 2016]\n", "df_accident_2017 = df_car_accident.loc[df_car_accident['year_acc'] == 2017]\n", "df_accident_2018 = df_car_accident.loc[df_car_accident['year_acc'] == 2018]\n", "df_accident_2019 = df_car_accident.loc[df_car_accident['year_acc'] == 2019]\n", "df_accident_2020 = df_car_accident.loc[df_car_accident['year_acc'] == 2020]\n", "df_accident_2021 = df_car_accident.loc[df_car_accident['year_acc'] == 2021]" ] }, { "cell_type": "code", "execution_count": 9, "id": "1e81bb4e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target01
year_acc
201671.40628728.593713
201767.25462032.745380
201866.63466233.365338
201979.55118220.448818
202089.94480410.055196
202198.2599301.740070
\n", "
" ], "text/plain": [ "target 0 1\n", "year_acc \n", "2016 71.406287 28.593713\n", "2017 67.254620 32.745380\n", "2018 66.634662 33.365338\n", "2019 79.551182 20.448818\n", "2020 89.944804 10.055196\n", "2021 98.259930 1.740070" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We will train a classification model to predict the severity of an accident. 0 for a less severe accident and 1 for a severe accident.\n", "#Let's check percentage in class 0 and 1\n", "pd.crosstab(df_car_accident.year_acc, df_car_accident.target, normalize = 'index')*100" ] }, { "cell_type": "code", "execution_count": 10, "id": "c13ca2a5", "metadata": {}, "outputs": [], "source": [ "y_df_learning=df_accident_baseline['target'].to_frame()\n", "X_df_learning=df_accident_baseline[df_accident_baseline.columns.difference([\"target\", \"target_multi\", \"year_acc\", \"Description\"])]\n", "\n", "y_df_2017=df_accident_2017['target'].to_frame()\n", "X_df_2017=df_accident_2017[df_accident_2017.columns.difference([\"target\", \"target_multi\", \"year_acc\", \"Description\"])]\n", "\n", "y_df_2018=df_accident_2018['target'].to_frame()\n", "X_df_2018=df_accident_2018[df_accident_2018.columns.difference([\"target\", \"target_multi\", \"year_acc\", \"Description\"])]\n", "\n", "y_df_2019=df_accident_2019['target'].to_frame()\n", "X_df_2019=df_accident_2019[df_accident_2019.columns.difference([\"target\", \"target_multi\", \"year_acc\", \"Description\"])]\n", "\n", "y_df_2020=df_accident_2020['target'].to_frame()\n", "X_df_2020=df_accident_2020[df_accident_2020.columns.difference([\"target\", \"target_multi\", \"year_acc\", \"Description\"])]\n", "\n", "y_df_2021=df_accident_2021['target'].to_frame()\n", "X_df_2021=df_accident_2021[df_accident_2021.columns.difference([\"target\", \"target_multi\", \"year_acc\", \"Description\"])]" ] }, { "cell_type": "markdown", "id": "676b7cd8", "metadata": {}, "source": [ "## Building Supervized Model" ] }, { "cell_type": "code", "execution_count": 11, "id": "daba7f7d", "metadata": {}, "outputs": [], "source": [ "features = ['Start_Lat', 'Start_Lng', 'Distance(mi)', 'Temperature(F)',\n", " 'Humidity(%)', 'Visibility(mi)', 'day_of_week_acc', 'Nautical_Twilight',\n", " 'season_acc']" ] }, { "cell_type": "code", "execution_count": 12, "id": "8f971d9b", "metadata": {}, "outputs": [], "source": [ "features_to_encode = [col for col in X_df_learning[features].columns if X_df_learning[col].dtype not in ('float64','int64')]\n", "\n", "encoder = OrdinalEncoder(cols=features_to_encode)\n", "encoder = encoder.fit(X_df_learning[features])\n", "\n", "X_df_learning_encoded=encoder.transform(X_df_learning)" ] }, { "cell_type": "code", "execution_count": 13, "id": "1e7fc14c", "metadata": {}, "outputs": [], "source": [ "Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_learning_encoded, y_df_learning, train_size=0.75, random_state=1)" ] }, { "cell_type": "code", "execution_count": 14, "id": "d65eadcb", "metadata": {}, "outputs": [], "source": [ "train_pool_cat = catboost.Pool(data=Xtrain, label= ytrain, cat_features = features_to_encode)\n", "test_pool_cat = catboost.Pool(data=Xtest, label= ytest, cat_features = features_to_encode)" ] }, { "cell_type": "code", "execution_count": 15, "id": "8bcecc82", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "69e0032963b14e3d8792d75564cd1a25", "version_major": 2, "version_minor": 0 }, "text/plain": [ "MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = catboost.CatBoostClassifier(loss_function= \"Logloss\", eval_metric=\"Logloss\",\n", " learning_rate=0.143852,\n", " iterations=300,\n", " l2_leaf_reg=15,\n", " max_depth = 4,\n", " use_best_model=True,\n", " custom_loss=['Accuracy', 'AUC', 'Logloss'])\n", "\n", "model = model.fit(train_pool_cat, plot=True,eval_set=test_pool_cat, verbose=False)" ] }, { "cell_type": "code", "execution_count": 16, "id": "ae73b71a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.7589233355711246\n" ] } ], "source": [ "proba = model.predict_proba(Xtest)\n", "print(metrics.roc_auc_score(ytest,proba[:,1]))" ] }, { "cell_type": "markdown", "id": "f8010a48", "metadata": {}, "source": [ "## Use Eurybia for data validation" ] }, { "cell_type": "code", "execution_count": 17, "id": "c7ae204e", "metadata": {}, "outputs": [], "source": [ "from eurybia import SmartDrift" ] }, { "cell_type": "code", "execution_count": 18, "id": "f8456034", "metadata": {}, "outputs": [], "source": [ "SD = SmartDrift(df_current=X_df_2017,\n", " df_baseline=X_df_learning,\n", " deployed_model=model, # Optional: put in perspective result with importance on deployed model\n", " encoding=encoder # Optional: if deployed_model and encoder to use this model\n", " )" ] }, { "cell_type": "code", "execution_count": 19, "id": "3d998196", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: total: 0 ns\n", "Wall time: 0 ns\n", "The computed AUC on the X_test used to build datadrift_classifier is equal to: 0.6585689489728102\n", "car_accident_auc.csv did not exist and was created. \n" ] } ], "source": [ "%time\n", "SD.compile(full_validation=True, # Optional: to save time, leave the default False value. If True, analyze consistency on modalities between columns.\n", " date_compile_auc = '01/01/2017', # Optional: useful when computing the drift for a time that is not now\n", " datadrift_file = \"car_accident_auc.csv\" # Optional: name of the csv file that contains the performance history of data drift\n", " )\n", " " ] }, { "cell_type": "markdown", "id": "01c2f690", "metadata": {}, "source": [ "As soon as compile() method, Eurybia displays default consistency checks as warnings.
\n", "If some modalities are not present during training and are in production dataset, the deployed model will consider them wrongly.
\n", "Inversely, if some modalities are present during training and are not in production dataset, it means that some profiles are missing." ] }, { "cell_type": "markdown", "id": "c733b40f", "metadata": {}, "source": [ "## Add model drift in report" ] }, { "cell_type": "markdown", "id": "ba8578c8", "metadata": {}, "source": [ "For the moment, the model drift part of eurybia only consists of displaying performance of deployed model. \n", "(We hope to bring new features in the future on this part)" ] }, { "cell_type": "markdown", "id": "65e4592d", "metadata": {}, "source": [ "### Put model performance in DataFrame" ] }, { "cell_type": "code", "execution_count": 20, "id": "f53935dd", "metadata": {}, "outputs": [], "source": [ "proba = model.predict_proba(X_df_2017)\n", "performance = metrics.roc_auc_score(y_df_2017,proba[:,1]).round(5)" ] }, { "cell_type": "code", "execution_count": 21, "id": "4be8debb", "metadata": {}, "outputs": [], "source": [ "#Create Dataframe to track performance over the years\n", "df_performance = pd.DataFrame({'annee': [2017], 'mois':[1], 'performance': [performance]})" ] }, { "cell_type": "code", "execution_count": 22, "id": "136261b6", "metadata": {}, "outputs": [], "source": [ "SD.add_data_modeldrift(dataset=df_performance,metric='performance') " ] }, { "cell_type": "code", "execution_count": 23, "id": "af9bf77a", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Report saved to ./report_car_accident_modeldrift_2017.html. To upload and share your report, create a free Datapane account by running `!datapane signup`." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SD.generate_report( \n", " output_file='report_car_accident_modeldrift_2017.html', \n", " title_story=\"Model drift Report\",\n", " title_description=\"\"\"US Car accident model drift 2017\"\"\", # Optional: add a subtitle to describe report\n", " project_info_file=\"../../eurybia/data/project_info_car_accident.yml\" # Optional: add information on report\n", " )" ] }, { "cell_type": "markdown", "id": "0aca5ec4", "metadata": {}, "source": [ "This tutorial contains only anlysis on additional features of model drift. For more detailed information on data drift, you can consult these tutorials : (https://github.com/MAIF/eurybia/tree/master/tutorial/model_drift/tutorial02-datadrift-high-datadrift.ipynb)" ] }, { "cell_type": "markdown", "id": "6710b459", "metadata": {}, "source": [ "## Compile Drift over years" ] }, { "cell_type": "markdown", "id": "4bd535e1", "metadata": {}, "source": [ "### Compile Drift et generate report for Year 2018" ] }, { "cell_type": "code", "execution_count": 24, "id": "756c9de1", "metadata": {}, "outputs": [], "source": [ "SD = SmartDrift(df_current=X_df_2018,\n", " df_baseline=X_df_learning,\n", " deployed_model=model, # Optional: put in perspective result with importance on deployed model\n", " encoding=encoder # Optional: if deployed_model and encoder to use this model\n", " )" ] }, { "cell_type": "code", "execution_count": 25, "id": "572b1f06", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The computed AUC on the X_test used to build datadrift_classifier is equal to: 0.7036329129677259\n" ] } ], "source": [ "SD.compile(full_validation=True, # Optional: to save time, leave the default False value. If True, analyze consistency on modalities between columns.\n", " date_compile_auc = '01/01/2018', # Optional: useful when computing the drift for a time that is not now\n", " datadrift_file = \"car_accident_auc.csv\" # Optional: name of the csv file that contains the performance history of data drift\n", " )" ] }, { "cell_type": "code", "execution_count": 26, "id": "ecebfa0c", "metadata": {}, "outputs": [], "source": [ "proba = model.predict_proba(X_df_2018)\n", "performance = metrics.roc_auc_score(y_df_2018,proba[:,1]).round(5)\n", "df_performance = df_performance.append({'annee': 2018, 'mois':1, 'performance': performance}, ignore_index=True)" ] }, { "cell_type": "markdown", "id": "810c6da6", "metadata": {}, "source": [ "### Compile Drift et generate report for Year 2019" ] }, { "cell_type": "code", "execution_count": 27, "id": "0912c225", "metadata": {}, "outputs": [], "source": [ "SD = SmartDrift(df_current=X_df_2019,\n", " df_baseline=X_df_learning,\n", " deployed_model=model, # Optional: put in perspective result with importance on deployed model\n", " encoding=encoder # Optional: if deployed_model and encoder to use this model\n", " )" ] }, { "cell_type": "code", "execution_count": 28, "id": "eacffb97", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The computed AUC on the X_test used to build datadrift_classifier is equal to: 0.7856527709300022\n" ] } ], "source": [ "SD.compile(full_validation=True, # Optional: to save time, leave the default False value. If True, analyze consistency on modalities between columns.\n", " date_compile_auc = '01/01/2019', # Optional: useful when computing the drift for a time that is not now\n", " datadrift_file = \"car_accident_auc.csv\" # Optional: name of the csv file that contains the performance history of data drift\n", " )" ] }, { "cell_type": "code", "execution_count": 29, "id": "985c1960", "metadata": {}, "outputs": [], "source": [ "proba = model.predict_proba(X_df_2019)\n", "performance = metrics.roc_auc_score(y_df_2019,proba[:,1]).round(5)\n", "df_performance = df_performance.append({'annee': 2019, 'mois':1, 'performance': performance}, ignore_index=True)" ] }, { "cell_type": "markdown", "id": "1fbd247b", "metadata": {}, "source": [ "### Compile Drift et generate report for Year 2020" ] }, { "cell_type": "code", "execution_count": 30, "id": "bf363bc6", "metadata": {}, "outputs": [], "source": [ "SD = SmartDrift(df_current=X_df_2020,\n", " df_baseline=X_df_learning,\n", " deployed_model=model, # Optional: put in perspective result with importance on deployed model\n", " encoding=encoder # Optional: if deployed_model and encoder to use this model\n", " )" ] }, { "cell_type": "code", "execution_count": 31, "id": "f7b102bf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The computed AUC on the X_test used to build datadrift_classifier is equal to: 0.7902450838961592\n" ] } ], "source": [ "SD.compile(full_validation=True, # Optional: to save time, leave the default False value. If True, analyze consistency on modalities between columns.\n", " date_compile_auc = '01/01/2020', # Optional: useful when computing the drift for a time that is not now\n", " datadrift_file = \"car_accident_auc.csv\" # Optional: name of the csv file that contains the performance history of data drift\n", " )" ] }, { "cell_type": "code", "execution_count": 32, "id": "2636bcb7", "metadata": {}, "outputs": [], "source": [ "proba = model.predict_proba(X_df_2020)\n", "performance = metrics.roc_auc_score(y_df_2020,proba[:,1]).round(5)\n", "df_performance = df_performance.append({'annee': 2020, 'mois':1, 'performance': performance}, ignore_index=True)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1846cdbe", "metadata": {}, "source": [ "### Compile Drift et generate report for Year 2021" ] }, { "cell_type": "code", "execution_count": 33, "id": "da3c7624", "metadata": {}, "outputs": [], "source": [ "SD = SmartDrift(df_current=X_df_2021,\n", " df_baseline=X_df_learning,\n", " deployed_model=model, # Optional: put in perspective result with importance on deployed model\n", " encoding=encoder # Optional: if deployed_model and encoder to use this model\n", " )" ] }, { "cell_type": "code", "execution_count": 34, "id": "6b838b56", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The computed AUC on the X_test used to build datadrift_classifier is equal to: 0.7500011519622525\n" ] } ], "source": [ "SD.compile(full_validation=True, # Optional: to save time, leave the default False value. If True, analyze consistency on modalities between columns.\n", " date_compile_auc = '01/01/2021', # Optional: useful when computing the drift for a time that is not now\n", " datadrift_file = \"car_accident_auc.csv\" # Optional: name of the csv file that contains the performance history of data drift\n", " )" ] }, { "cell_type": "code", "execution_count": 35, "id": "ff3d4d8a", "metadata": {}, "outputs": [], "source": [ "proba = model.predict_proba(X_df_2021)\n", "performance = metrics.roc_auc_score(y_df_2021,proba[:,1]).round(5)\n", "df_performance = df_performance.append({'annee': 2021, 'mois':1, 'performance': performance}, ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 36, "id": "f9d09d5e", "metadata": {}, "outputs": [], "source": [ "SD.add_data_modeldrift(dataset=df_performance,metric='performance') " ] }, { "cell_type": "code", "execution_count": 37, "id": "a936527c", "metadata": {}, "outputs": [ { "data": { "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SD.plot.generate_historical_datadrift_metric() # works if date_compile_auc and/or datadrift_file are filled" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2324467b", "metadata": {}, "source": [ "In 2019 and 2020, data drift is very high. Is there any impact on the performance of the model?" ] }, { "cell_type": "code", "execution_count": 38, "id": "64665647", "metadata": {}, "outputs": [ { "data": { "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SD.plot.generate_modeldrift_data() # works if add_data_modeldrift used before " ] }, { "attachments": {}, "cell_type": "markdown", "id": "438706e2", "metadata": {}, "source": [ "While data drift was high in 2019, the impact on model performance is low. In 2020, data drift leads to a decrease in model performance." ] }, { "cell_type": "code", "execution_count": 39, "id": "c9089d96", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Report saved to ./report_car_accident_modeldrift_2021.html. To upload and share your report, create a free Datapane account by running `!datapane signup`." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SD.generate_report( \n", " output_file='report_car_accident_modeldrift_2021.html', \n", " title_story=\"Model drift Report\",\n", " title_description=\"\"\"US Car accident model drift 2021\"\"\", # Optional: add a subtitle to describe report\n", " project_info_file=\"../../eurybia/data/project_info_car_accident.yml\" # Optional: add information on report\n", " ) " ] } ], "metadata": { "kernelspec": { "display_name": "eurybia_3_9", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "336px" }, "toc_section_display": true, "toc_window_display": true }, "vscode": { "interpreter": { "hash": "36c4204cc0170e083c18487e195263df35fcafba9d65a5415ab6b0958d51e154" } } }, "nbformat": 4, "nbformat_minor": 5 }