{ "cells": [ { "cell_type": "markdown", "id": "d3a55be4", "metadata": {}, "source": [ "# Modeldrift with Eurybia\n", "With this tutorial you:
\n", "Understand how to use Eurybia to detect model drift\n", "\n", "Contents:\n", "- Detect data drift \n", "- Display model drift over years\n", "\n", "This tutorial contains only additional features of model drift.\n", "For more detailed information on data drift, you can consult these tutorials :\n", "(https://github.com/MAIF/eurybia/tree/master/tutorial/data_drift)" ] }, { "cell_type": "markdown", "id": "7dab5e19", "metadata": {}, "source": [ "**Requirements notice** : the following tutorial may use third party modules not included in Eurybia. \n", "You can find them all in one file [on our Github repository](https://github.com/MAIF/eurybia/blob/master/requirements.dev.txt) or you can manually install those you are missing, if any." ] }, { "cell_type": "code", "execution_count": 2, "id": "ba3029c1", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from category_encoders import OrdinalEncoder\n", "from lightgbm import LGBMRegressor\n", "from eurybia import SmartDrift\n", "from eurybia.data.data_loader import data_loading\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import mean_squared_log_error" ] }, { "cell_type": "markdown", "id": "a37f9001", "metadata": {}, "source": [ "## Import Dataset and split in training and production dataset" ] }, { "cell_type": "code", "execution_count": 3, "id": "5e301c02", "metadata": {}, "outputs": [], "source": [ "house_df, house_dict = data_loading('house_prices')" ] }, { "cell_type": "code", "execution_count": 4, "id": "fd3a5e27", "metadata": {}, "outputs": [], "source": [ "# Let us consider that the column \"YrSold\" corresponds to the reference date. \n", "#In 2006, a model was trained using data. And in 2007, we want to detect data drift on new data in production to predict\n", "#house price\n", "house_df_learning = house_df.loc[house_df['YrSold'] == 2006]\n", "house_df_2007 = house_df.loc[house_df['YrSold'] == 2007]" ] }, { "cell_type": "code", "execution_count": 5, "id": "d747da67", "metadata": {}, "outputs": [], "source": [ "y_df_learning=house_df_learning['SalePrice'].to_frame()\n", "X_df_learning=house_df_learning[house_df_learning.columns.difference(['SalePrice','YrSold'])]\n", "\n", "y_df_2007=house_df_2007['SalePrice'].to_frame()\n", "X_df_2007=house_df_2007[house_df_2007.columns.difference(['SalePrice','YrSold'])]" ] }, { "cell_type": "markdown", "id": "f280f685", "metadata": {}, "source": [ "## Building Supervized Model\n" ] }, { "cell_type": "code", "execution_count": null, "id": "2c9af09e", "metadata": {}, "outputs": [], "source": [ "from category_encoders import OrdinalEncoder\n", "\n", "categorical_features = [col for col in X_df_learning.columns if X_df_learning[col].dtype == 'object']\n", "\n", "encoder = OrdinalEncoder(\n", " cols=categorical_features,\n", " handle_unknown='ignore',\n", " return_df=True).fit(X_df_learning)\n", "\n", "X_df_learning_encoded=encoder.transform(X_df_learning)" ] }, { "cell_type": "code", "execution_count": 7, "id": "ec4277c7", "metadata": {}, "outputs": [], "source": [ "Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_learning_encoded, y_df_learning, train_size=0.75, random_state=1)" ] }, { "cell_type": "code", "execution_count": 8, "id": "d3f7cc5d", "metadata": {}, "outputs": [], "source": [ "regressor = LGBMRegressor(n_estimators=200).fit(Xtrain,ytrain)" ] }, { "cell_type": "markdown", "id": "086c7e3d", "metadata": {}, "source": [ "## Use Eurybia for data drift" ] }, { "cell_type": "code", "execution_count": 9, "id": "5bd64f9e", "metadata": {}, "outputs": [], "source": [ "SD = SmartDrift(df_current=X_df_2007,\n", " df_baseline=X_df_learning,\n", " deployed_model=regressor, # Optional: put in perspective result with importance on deployed model\n", " encoding=encoder # Optional: if deployed_model and encoder to use this model\n", " )" ] }, { "cell_type": "code", "execution_count": 10, "id": "bead8a97", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 9 µs, sys: 0 ns, total: 9 µs\n", "Wall time: 30 µs\n", "The variable BsmtCond has mismatching unique values:\n", "['Poor -Severe cracking, settling, or wetness'] | []\n", "\n", "The variable Condition2 has mismatching unique values:\n", "['Near positive off-site feature--park, greenbelt, etc.', 'Adjacent to North-South Railroad', 'Adjacent to East-West Railroad'] | ['Adjacent to feeder street']\n", "\n", "The variable Electrical has mismatching unique values:\n", "['Mixed'] | ['60 AMP Fuse Box and mostly knob & tube wiring (poor)']\n", "\n", "The variable ExterQual has mismatching unique values:\n", "['Fair'] | []\n", "\n", "The variable Exterior1st has mismatching unique values:\n", "[] | ['Stone', 'Imitation Stucco']\n", "\n", "The variable Exterior2nd has mismatching unique values:\n", "['Asphalt Shingles', 'Brick Common'] | ['Other']\n", "\n", "The variable Foundation has mismatching unique values:\n", "[] | ['Stone', 'Wood']\n", "\n", "The variable Functional has mismatching unique values:\n", "['Major Deductions 2', 'Severely Damaged'] | ['Moderate Deductions']\n", "\n", "The variable GarageQual has mismatching unique values:\n", "[] | ['Excellent']\n", "\n", "The variable Heating has mismatching unique values:\n", "[] | ['Wall furnace']\n", "\n", "The variable HeatingQC has mismatching unique values:\n", "['Poor'] | []\n", "\n", "The variable LotConfig has mismatching unique values:\n", "[] | ['Frontage on 3 sides of property']\n", "\n", "The variable MSSubClass has mismatching unique values:\n", "['1-Story w/Finished Attic All Ages'] | []\n", "\n", "The variable Neighborhood has mismatching unique values:\n", "['Northpark Villa'] | []\n", "\n", "The variable RoofMatl has mismatching unique values:\n", "['Roll'] | ['Metal']\n", "\n", "The variable RoofStyle has mismatching unique values:\n", "['Mansard', 'Shed'] | []\n", "\n", "The variable SaleType has mismatching unique values:\n", "['Warranty Deed - Cash'] | ['Contract Low Interest', 'Contract Low Down', 'Contract Low Down payment and low interest']\n", "\n", "The variable Street has mismatching unique values:\n", "['Gravel'] | []\n", "\n", "The computed AUC on the X_test used to build datadrift_classifier is equal to: 0.626082251082251\n" ] } ], "source": [ "%time \n", "SD.compile(full_validation=True, # Optional: to save time, leave the default False value. If True, analyze consistency on modalities between columns.\n", " date_compile_auc = '01/01/2007', # Optional: useful when computing the drift for a time that is not now\n", " datadrift_file = \"house_price_auc.csv\" # Optional: name of the csv file that contains the performance history of data drift\n", " )\n", " " ] }, { "cell_type": "markdown", "id": "625d0912", "metadata": {}, "source": [ "As soon as compile() method, Eurybia displays default consistency checks as warnings.
\n", "If some modalities are not present during training and are in production dataset, the deployed model will consider them wrongly.
\n", "Inversely, if some modalities are present during training and are not in production dataset, it means that some profiles are missing." ] }, { "cell_type": "markdown", "id": "a8ad7820", "metadata": {}, "source": [ "## Add model drift in report" ] }, { "cell_type": "markdown", "id": "e39dc67c", "metadata": {}, "source": [ "For the moment, the model drift part of eurybia only consists of displaying performance of deployed model. \n", "(We hope to bring new features in the future on this part)" ] }, { "cell_type": "markdown", "id": "82d0de33", "metadata": {}, "source": [ "### Put model performance in DataFrame" ] }, { "cell_type": "code", "execution_count": 11, "id": "79ae3c07", "metadata": {}, "outputs": [], "source": [ "y_pred = regressor.predict(Xtest)" ] }, { "cell_type": "code", "execution_count": 12, "id": "28635fd0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.031487" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "performance_test = mean_squared_log_error(ytest, y_pred).round(6)\n", "performance_test" ] }, { "cell_type": "code", "execution_count": 13, "id": "c12a14e7", "metadata": {}, "outputs": [], "source": [ "#Create Dataframe to track performance over the years\n", "df_performance = pd.DataFrame({'annee': [2006], 'mois':[1], 'performance': [performance_test]})" ] }, { "cell_type": "code", "execution_count": 14, "id": "4f164198", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.03309" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2007_encode=encoder.transform(X_df_2007)\n", "y_pred_2007 = regressor.predict(df_2007_encode)\n", "performance_2007 = mean_squared_log_error(y_df_2007, y_pred_2007).round(6)\n", "performance_2007" ] }, { "cell_type": "code", "execution_count": null, "id": "bd9fe858", "metadata": {}, "outputs": [], "source": [ "df_performance = df_performance.append({'annee': 2007, 'mois':1, 'performance': performance_2007}, ignore_index=True)" ] }, { "cell_type": "markdown", "id": "52912cfe", "metadata": {}, "source": [ "### Add performance Dataframe in Smartdrift" ] }, { "cell_type": "code", "execution_count": 16, "id": "f0e96f82", "metadata": {}, "outputs": [], "source": [ "SD.add_data_modeldrift(dataset=df_performance,metric='performance') " ] }, { "cell_type": "code", "execution_count": 17, "id": "ef937e7f", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Report saved to ./report_house_price_modeldrift_2007.html. To upload and share your report, create a free Datapane account by running `!datapane signup`." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SD.generate_report( \n", " output_file='report_house_price_modeldrift_2007.html', \n", " title_story=\"Data drift\",\n", " title_description=\"\"\"House price model drift 2007\"\"\", # Optional: add a subtitle to describe report\n", " project_info_file=\"../../eurybia/data/project_info_house_price.yml\" # Optional: add information on report\n", " )" ] }, { "cell_type": "markdown", "id": "84c8883b", "metadata": {}, "source": [ "Eurybia is designed to generate an HTML report for analysis, and less for use in notebook mode. \n", "However, to illustrate functionalities, we will detail results with notebook mode analysis." ] }, { "cell_type": "markdown", "id": "4add0130", "metadata": {}, "source": [ "This tutorial contains only anlysis on additional features of model drift. For more detailed information on data drift, you can consult these tutorials : (https://github.com/MAIF/eurybia/tree/master/tutorial/data_drift)" ] }, { "cell_type": "markdown", "id": "88cfeb49", "metadata": {}, "source": [ "### Display model drift" ] }, { "cell_type": "code", "execution_count": 18, "id": "6d33cabf", "metadata": {}, "outputs": [ { "data": { "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SD.plot.generate_modeldrift_data() # works if date_compile_auc and/or datadrift_file are filled" ] }, { "cell_type": "markdown", "id": "5f1241e2", "metadata": {}, "source": [ "### Display model drift with multiple indicators" ] }, { "cell_type": "markdown", "id": "08d89b46", "metadata": {}, "source": [ "If you have several metrics or indicators for performance monitoring, it is possible to have reference columns.\n", "Let's create a dummy performance table to show the use." ] }, { "cell_type": "code", "execution_count": 19, "id": "e5dff49d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indicatorlevel_1anneemoisperformance
0rmse02006.01.00.031487
1rmse12007.01.00.033090
2mse02006.01.01.031988
3mse12007.01.01.033644
\n", "
" ], "text/plain": [ " indicator level_1 annee mois performance\n", "0 rmse 0 2006.0 1.0 0.031487\n", "1 rmse 1 2007.0 1.0 0.033090\n", "2 mse 0 2006.0 1.0 1.031988\n", "3 mse 1 2007.0 1.0 1.033644" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_performance_mse = df_performance.copy()\n", "df_performance_mse['performance']= np.exp(df_performance_mse['performance'])\n", "df_performance2 = pd.concat([df_performance, df_performance_mse], keys=[\"rmse\", \"mse\"]).reset_index().rename(columns={\"level_0\": \"indicator\"})\n", "df_performance2" ] }, { "cell_type": "code", "execution_count": 20, "id": "d38140b2", "metadata": {}, "outputs": [], "source": [ "SD.add_data_modeldrift(dataset=df_performance2,metric='performance',reference_columns=['indicator']) " ] }, { "cell_type": "code", "execution_count": 21, "id": "f55d65b4", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Report saved to ./report_house_price_modeldrift_2007.html. To upload and share your report, create a free Datapane account by running `!datapane signup`." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SD.generate_report( \n", " output_file='report_house_price_modeldrift_2007.html', \n", " title_story=\"Data drift\",\n", " title_description=\"\"\"House price model drift 2007\"\"\", # Optional: add a subtitle to describe report\n", " project_info_file=\"../../eurybia/data/project_info_house_price.yml\" # Optional: add information on report \n", " )" ] }, { "cell_type": "markdown", "id": "745f1602", "metadata": {}, "source": [ "## Compile Drift over years" ] }, { "cell_type": "markdown", "id": "836e07cc", "metadata": {}, "source": [ "### Compile Drift et generate report for Year 2008" ] }, { "cell_type": "code", "execution_count": 22, "id": "4b495e2a", "metadata": {}, "outputs": [], "source": [ "house_df_2008 = house_df.loc[house_df['YrSold'] == 2008]\n", "\n", "y_df_2008=house_df_2008['SalePrice'].to_frame()\n", "X_df_2008=house_df_2008[house_df_2008.columns.difference(['SalePrice','YrSold'])]" ] }, { "cell_type": "code", "execution_count": 23, "id": "a0afc6d0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.028883" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2008_encode=encoder.transform(X_df_2008)\n", "y_pred_2008 = regressor.predict(df_2008_encode)\n", "performance_2008 = mean_squared_log_error(y_df_2008, y_pred_2008).round(6)\n", "performance_2008" ] }, { "cell_type": "code", "execution_count": null, "id": "2eb05bf9", "metadata": {}, "outputs": [], "source": [ "df_performance = df_performance.append({'annee': 2008, 'mois':1, 'performance': performance_2008}, ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 25, "id": "25926a75", "metadata": {}, "outputs": [], "source": [ "SD = SmartDrift(df_current=X_df_2008,\n", " df_baseline=X_df_learning,\n", " deployed_model=regressor, # Optional: put in perspective result with importance on deployed model\n", " encoding=encoder # Optional: if deployed_model and encoder to use this model\n", " )" ] }, { "cell_type": "code", "execution_count": 26, "id": "aba273ec", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The variable Condition1 has mismatching unique values:\n", "[\"Within 200' of East-West Railroad\"] | []\n", "\n", "The variable Condition2 has mismatching unique values:\n", "['Adjacent to arterial street', \"Within 200' of North-South Railroad\", 'Adjacent to postive off-site feature', 'Near positive off-site feature--park, greenbelt, etc.'] | []\n", "\n", "The variable Electrical has mismatching unique values:\n", "['Mixed'] | []\n", "\n", "The variable ExterCond has mismatching unique values:\n", "['Excellent'] | []\n", "\n", "The variable ExterQual has mismatching unique values:\n", "['Fair'] | []\n", "\n", "The variable Exterior1st has mismatching unique values:\n", "[] | ['Imitation Stucco']\n", "\n", "The variable Exterior2nd has mismatching unique values:\n", "[] | ['Other', 'Stone']\n", "\n", "The variable Foundation has mismatching unique values:\n", "[] | ['Slab', 'Wood']\n", "\n", "The variable Functional has mismatching unique values:\n", "['Major Deductions 2'] | []\n", "\n", "The variable GarageCond has mismatching unique values:\n", "['Excellent'] | ['Poor']\n", "\n", "The variable GarageQual has mismatching unique values:\n", "[] | ['Poor']\n", "\n", "The variable GarageType has mismatching unique values:\n", "[] | ['More than one type of garage']\n", "\n", "The variable Heating has mismatching unique values:\n", "['Hot water or steam heat other than gas', 'Floor Furnace'] | ['Wall furnace']\n", "\n", "The variable MSSubClass has mismatching unique values:\n", "['1-Story w/Finished Attic All Ages'] | []\n", "\n", "The variable Neighborhood has mismatching unique values:\n", "['Northpark Villa', 'Bluestem'] | []\n", "\n", "The variable RoofMatl has mismatching unique values:\n", "['Membrane', 'Clay or Tile'] | ['Metal']\n", "\n", "The variable SaleCondition has mismatching unique values:\n", "[] | ['Sale between family members']\n", "\n", "The variable SaleType has mismatching unique values:\n", "['Contract 15% Down payment regular terms', 'Warranty Deed - Cash'] | ['Contract Low Interest', 'Other']\n", "\n", "The variable Street has mismatching unique values:\n", "['Gravel'] | []\n", "\n", "The computed AUC on the X_test used to build datadrift_classifier is equal to: 0.6877714667557634\n" ] } ], "source": [ "SD.compile(full_validation=True, # Optional: to save time, leave the default False value. If True, analyze consistency on modalities between columns.\n", " date_compile_auc = '01/01/2008', # Optional: useful when computing the drift for a time that is not now\n", " datadrift_file = \"house_price_auc.csv\" # Optional: name of the csv file that contains the performance history of data drift\n", " )" ] }, { "cell_type": "code", "execution_count": 27, "id": "6868d560", "metadata": {}, "outputs": [], "source": [ "SD.add_data_modeldrift(dataset=df_performance,metric='performance') " ] }, { "cell_type": "code", "execution_count": 28, "id": "46ad3795", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Report saved to ./report_house_price_modeldrift_2008.html. To upload and share your report, create a free Datapane account by running `!datapane signup`." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SD.generate_report( \n", " output_file='report_house_price_modeldrift_2008.html', \n", " title_story=\"Model drift\",\n", " title_description=\"\"\"House price model drift 2008\"\"\", # Optional: add a subtitle to describe report\n", " project_info_file=\"../../eurybia/data/project_info_house_price.yml\" # Optional: add information on report\n", " )" ] }, { "cell_type": "markdown", "id": "78b3758e", "metadata": {}, "source": [ "### Compile Drift et generate report for Year 2009" ] }, { "cell_type": "code", "execution_count": 29, "id": "c782c7de", "metadata": {}, "outputs": [], "source": [ "house_df_2009 = house_df.loc[house_df['YrSold'] == 2009]\n", "\n", "y_df_2009=house_df_2009['SalePrice'].to_frame()\n", "X_df_2009=house_df_2009[house_df_2009.columns.difference(['SalePrice','YrSold'])]" ] }, { "cell_type": "code", "execution_count": 30, "id": "854430e6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.031778" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2009_encode=encoder.transform(X_df_2009)\n", "y_pred_2009 = regressor.predict(df_2009_encode)\n", "performance_2009 = mean_squared_log_error(y_df_2009, y_pred_2009).round(6)\n", "performance_2009" ] }, { "cell_type": "code", "execution_count": 31, "id": "f4d82f70", "metadata": {}, "outputs": [], "source": [ "SD = SmartDrift(df_current=X_df_2009,\n", " df_baseline=X_df_learning,\n", " deployed_model=regressor, # Optional: put in perspective result with importance on deployed model\n", " encoding=encoder # Optional: if deployed_model and encoder to use this model\n", " )" ] }, { "cell_type": "code", "execution_count": 32, "id": "be02b63f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The variable BsmtCond has mismatching unique values:\n", "['Poor -Severe cracking, settling, or wetness'] | []\n", "\n", "The variable Condition1 has mismatching unique values:\n", "[] | ['Adjacent to East-West Railroad']\n", "\n", "The variable Condition2 has mismatching unique values:\n", "['Adjacent to arterial street'] | []\n", "\n", "The variable Electrical has mismatching unique values:\n", "[] | ['60 AMP Fuse Box and mostly knob & tube wiring (poor)']\n", "\n", "The variable ExterCond has mismatching unique values:\n", "['Excellent'] | []\n", "\n", "The variable ExterQual has mismatching unique values:\n", "['Fair'] | []\n", "\n", "The variable Exterior1st has mismatching unique values:\n", "['Brick Common', 'Cinder Block'] | ['Stone', 'Imitation Stucco']\n", "\n", "The variable Exterior2nd has mismatching unique values:\n", "['Brick Common', 'Cinder Block'] | ['Other']\n", "\n", "The variable Functional has mismatching unique values:\n", "['Major Deductions 2'] | []\n", "\n", "The variable GarageCond has mismatching unique values:\n", "['Excellent'] | ['Good']\n", "\n", "The variable GarageQual has mismatching unique values:\n", "[] | ['Poor']\n", "\n", "The variable GarageType has mismatching unique values:\n", "[] | ['More than one type of garage']\n", "\n", "The variable LotConfig has mismatching unique values:\n", "[] | ['Frontage on 3 sides of property']\n", "\n", "The variable MSSubClass has mismatching unique values:\n", "['1-Story w/Finished Attic All Ages'] | []\n", "\n", "The variable Neighborhood has mismatching unique values:\n", "['Northpark Villa', 'Bluestem'] | ['Veenker']\n", "\n", "The variable RoofMatl has mismatching unique values:\n", "[] | ['Metal', 'Wood Shakes']\n", "\n", "The variable RoofStyle has mismatching unique values:\n", "['Mansard'] | []\n", "\n", "The variable SaleCondition has mismatching unique values:\n", "[] | ['Adjoining Land Purchase']\n", "\n", "The variable SaleType has mismatching unique values:\n", "[] | ['Other']\n", "\n", "The variable Utilities has mismatching unique values:\n", "['Electricity and Gas Only'] | []\n", "\n", "The computed AUC on the X_test used to build datadrift_classifier is equal to: 0.5405695039804042\n" ] } ], "source": [ "SD.compile(full_validation=True, # Optional: to save time, leave the default False value. If True, analyze consistency on modalities between columns.\n", " date_compile_auc = '01/01/2009', # Optional: useful when computing the drift for a time that is not now\n", " datadrift_file = \"house_price_auc.csv\" # Optional: name of the csv file that contains the performance history of data drift\n", " )" ] }, { "cell_type": "code", "execution_count": null, "id": "f58ca3b1", "metadata": {}, "outputs": [], "source": [ "df_performance = df_performance.append({'annee': 2009, 'mois':1, 'performance': performance_2009}, ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 34, "id": "a14df209", "metadata": {}, "outputs": [], "source": [ "SD.add_data_modeldrift(dataset=df_performance,metric='performance') " ] }, { "cell_type": "code", "execution_count": 35, "id": "c54b73eb", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Report saved to ./report_house_price_modeldrift_2009.html. To upload and share your report, create a free Datapane account by running `!datapane signup`." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SD.generate_report( \n", " output_file='report_house_price_modeldrift_2009.html', \n", " title_story=\"Model drift\",\n", " title_description=\"\"\"House price model drift 2009\"\"\", # Optional: add a subtitle to describe report\n", " project_info_file=\"../../eurybia/data/project_info_house_price.yml\" # Optional: add information on report \n", " )" ] }, { "cell_type": "markdown", "id": "7701d3d9", "metadata": {}, "source": [ "### Compile Drift et generate report for Year 2010" ] }, { "cell_type": "code", "execution_count": 36, "id": "32b79b14", "metadata": {}, "outputs": [], "source": [ "house_df_2010 = house_df.loc[house_df['YrSold'] == 2010]\n", "\n", "y_df_2010=house_df_2010['SalePrice'].to_frame()\n", "X_df_2010=house_df_2010[house_df_2010.columns.difference(['SalePrice','YrSold'])]" ] }, { "cell_type": "code", "execution_count": 37, "id": "78d982b3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.023441" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2010_encode=encoder.transform(X_df_2010)\n", "y_pred_2010 = regressor.predict(df_2010_encode)\n", "performance_2010 = mean_squared_log_error(y_df_2010, y_pred_2010).round(6)\n", "performance_2010" ] }, { "cell_type": "code", "execution_count": null, "id": "3edb53b5", "metadata": {}, "outputs": [], "source": [ "df_performance = df_performance.append({'annee': 2010, 'mois':1, 'performance': performance_2010}, ignore_index=True" ] }, { "cell_type": "code", "execution_count": 39, "id": "13d0e1c8", "metadata": {}, "outputs": [], "source": [ "SD = SmartDrift(df_current=X_df_2010,\n", " df_baseline=X_df_learning,\n", " deployed_model=regressor, # Optional: put in perspective result with importance on deployed model\n", " encoding=encoder # Optional: if deployed_model and encoder to use this model\n", " )" ] }, { "cell_type": "code", "execution_count": 40, "id": "1157cabb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The variable Condition1 has mismatching unique values:\n", "[\"Within 200' of East-West Railroad\"] | []\n", "\n", "The variable Electrical has mismatching unique values:\n", "[] | ['60 AMP Fuse Box and mostly knob & tube wiring (poor)']\n", "\n", "The variable ExterCond has mismatching unique values:\n", "['Poor'] | []\n", "\n", "The variable ExterQual has mismatching unique values:\n", "['Fair'] | []\n", "\n", "The variable Exterior1st has mismatching unique values:\n", "['Asphalt Shingles'] | ['Stone', 'Imitation Stucco']\n", "\n", "The variable Exterior2nd has mismatching unique values:\n", "['Asphalt Shingles', 'Brick Common'] | ['Other', 'Stone']\n", "\n", "The variable Functional has mismatching unique values:\n", "[] | ['Major Deductions 1']\n", "\n", "The variable GarageCond has mismatching unique values:\n", "[] | ['Poor', 'Good']\n", "\n", "The variable GarageQual has mismatching unique values:\n", "[] | ['Good', 'Excellent', 'Poor']\n", "\n", "The variable GarageType has mismatching unique values:\n", "[] | ['More than one type of garage']\n", "\n", "The variable Heating has mismatching unique values:\n", "[] | ['Gas hot water or steam heat', 'Wall furnace']\n", "\n", "The variable HouseStyle has mismatching unique values:\n", "[] | ['Two and one-half story: 2nd level finished', 'One and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level unfinished']\n", "\n", "The variable LotConfig has mismatching unique values:\n", "[] | ['Frontage on 3 sides of property']\n", "\n", "The variable LotShape has mismatching unique values:\n", "[] | ['Irregular']\n", "\n", "The variable MSSubClass has mismatching unique values:\n", "['1-Story w/Finished Attic All Ages'] | ['2-1/2 Story All Ages', '1-1/2 Story - Unfinished All Ages']\n", "\n", "The variable MSZoning has mismatching unique values:\n", "[] | ['Residential High Density']\n", "\n", "The variable Neighborhood has mismatching unique values:\n", "['Northpark Villa'] | ['Veenker']\n", "\n", "The variable RoofMatl has mismatching unique values:\n", "[] | ['Wood Shingles', 'Metal', 'Gravel & Tar']\n", "\n", "The variable RoofStyle has mismatching unique values:\n", "['Mansard', 'Shed'] | ['Flat']\n", "\n", "The variable SaleCondition has mismatching unique values:\n", "[] | ['Adjoining Land Purchase']\n", "\n", "The variable SaleType has mismatching unique values:\n", "['Contract 15% Down payment regular terms'] | ['Contract Low Down', 'Contract Low Down payment and low interest', 'Other']\n", "\n", "The variable Street has mismatching unique values:\n", "['Gravel'] | []\n", "\n", "The computed AUC on the X_test used to build datadrift_classifier is equal to: 0.6978632478632478\n" ] } ], "source": [ "SD.compile(full_validation=True, # Optional: to save time, leave the default False value. If True, analyze consistency on modalities between columns.\n", " date_compile_auc = '01/01/2010', # Optional: useful when computing the drift for a time that is not now\n", " datadrift_file = \"house_price_auc.csv\" # Optional: name of the csv file that contains the performance history of data drift\n", " )" ] }, { "cell_type": "code", "execution_count": 41, "id": "a2c985d0", "metadata": {}, "outputs": [], "source": [ "SD.add_data_modeldrift(dataset=df_performance,metric='performance') " ] }, { "cell_type": "code", "execution_count": 42, "id": "5651d11a", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Report saved to ./report_house_price_modeldrift_2010.html. To upload and share your report, create a free Datapane account by running `!datapane signup`." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SD.generate_report( \n", " output_file='report_house_price_modeldrift_2010.html', \n", " title_story=\"Model drift\",\n", " title_description=\"\"\"House price model drift 2010\"\"\",\n", " project_info_file=\"../../eurybia/data/project_info_house_price.yml\" \n", " )" ] }, { "cell_type": "code", "execution_count": 43, "id": "c143a5a2", "metadata": {}, "outputs": [ { "data": { "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SD.plot.generate_modeldrift_data() # works if add_data_modeldrift used before " ] }, { "cell_type": "markdown", "id": "7bb69515", "metadata": {}, "source": [ "----" ] } ], "metadata": { "interpreter": { "hash": "d08e6294e2d60f50397263035a337d71f3055486232bc02b45ce2785f62e7d8b" }, "kernelspec": { "display_name": "dev_eurybia", "language": "python", "name": "dev_eurybia" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 5 }