{ "cells": [ { "cell_type": "markdown", "id": "983686e9", "metadata": {}, "source": [ "# Eurybia - Overview\n", "This tutorial will help you understand how Eurybia works with a simple use case\n", "\n", "Contents:\n", "- Compile Eurybia \n", "- Generate report\n", "\n", "For a more detailed tutorial on :\n", "- Data validation : (https://github.com/MAIF/eurybia/tree/master/tutorial/data_validation)\n", "- Data drift : (https://github.com/MAIF/eurybia/tree/master/tutorial/data_drift)\n", "- Model drift : (https://github.com/MAIF/eurybia/tree/master/tutorial/model_drift)" ] }, { "cell_type": "markdown", "id": "9524ace9", "metadata": {}, "source": [ "**Requirements notice** : the following tutorial may use third party modules not included in Eurybia. \n", "You can find them all in one file [on our Github repository](https://github.com/MAIF/eurybia/blob/master/requirements.dev.txt) or you can manually install those you are missing, if any." ] }, { "cell_type": "code", "execution_count": 2, "id": "f8489bfa", "metadata": {}, "outputs": [], "source": [ "from category_encoders import OrdinalEncoder\n", "from lightgbm import LGBMRegressor\n", "from eurybia import SmartDrift\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "id": "29ec936f", "metadata": {}, "source": [ "## Import Dataset and split in training and production dataset" ] }, { "cell_type": "code", "execution_count": 3, "id": "3cb3a493", "metadata": {}, "outputs": [], "source": [ "from eurybia.data.data_loader import data_loading\n", "house_df, house_dict = data_loading('house_prices')" ] }, { "cell_type": "code", "execution_count": 4, "id": "019c6396", "metadata": {}, "outputs": [], "source": [ "# Let us consider that the column \"YrSold\" corresponds to the reference date. \n", "#In 2006, a model was trained using data. And in 2007, we want to detect data drift on new data in production to predict\n", "#house price\n", "house_df_learning = house_df.loc[house_df['YrSold'] == 2006]\n", "house_df_2007 = house_df.loc[house_df['YrSold'] == 2007]" ] }, { "cell_type": "code", "execution_count": 5, "id": "4bda0775", "metadata": {}, "outputs": [], "source": [ "y_df_learning=house_df_learning['SalePrice'].to_frame()\n", "X_df_learning=house_df_learning[house_df_learning.columns.difference(['SalePrice','YrSold'])]\n", "\n", "y_df_2007=house_df_2007['SalePrice'].to_frame()\n", "X_df_2007=house_df_2007[house_df_2007.columns.difference(['SalePrice','YrSold'])]" ] }, { "cell_type": "markdown", "id": "e294d0b5", "metadata": {}, "source": [ "## Building Supervized Model" ] }, { "cell_type": "code", "execution_count": null, "id": "2ca7381d", "metadata": {}, "outputs": [], "source": [ "from category_encoders import OrdinalEncoder\n", "\n", "categorical_features = [col for col in X_df_learning.columns if X_df_learning[col].dtype == 'object']\n", "\n", "encoder = OrdinalEncoder(\n", " cols=categorical_features,\n", " handle_unknown='ignore',\n", " return_df=True).fit(X_df_learning)\n", "\n", "X_df_learning_encoded=encoder.transform(X_df_learning)" ] }, { "cell_type": "code", "execution_count": 7, "id": "8ba398ad", "metadata": {}, "outputs": [], "source": [ "Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_learning_encoded, y_df_learning, train_size=0.75, random_state=1)" ] }, { "cell_type": "code", "execution_count": 8, "id": "2dc04f3e", "metadata": {}, "outputs": [], "source": [ "regressor = LGBMRegressor(n_estimators=200).fit(Xtrain,ytrain)" ] }, { "cell_type": "markdown", "id": "12b12535", "metadata": {}, "source": [ "## Use Eurybia for data drift" ] }, { "cell_type": "code", "execution_count": 9, "id": "9faf9a5e", "metadata": {}, "outputs": [], "source": [ "from eurybia import SmartDrift" ] }, { "cell_type": "code", "execution_count": 10, "id": "493030c9", "metadata": {}, "outputs": [], "source": [ "SD = SmartDrift(df_current=X_df_2007,\n", " df_baseline=X_df_learning,\n", " deployed_model=regressor, # Optional: put in perspective result with importance on deployed model\n", " encoding=encoder, # Optional: if deployed_model and encoder to use this model\n", " dataset_names={\"df_current\": \"2007 dataset\", \"df_baseline\": \"Learning dataset\"} # Optional: Names for outputs\n", " )" ] }, { "cell_type": "code", "execution_count": 11, "id": "5c51a243", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2min 23s, sys: 32.1 s, total: 2min 55s\n", "Wall time: 10.5 s\n" ] } ], "source": [ "%time SD.compile()" ] }, { "cell_type": "code", "execution_count": 12, "id": "ead7d949", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Report saved to ./report_house_price_datadrift_2007.html. To upload and share your report, create a free Datapane account by running `!datapane signup`." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SD.generate_report( \n", " output_file='report_house_price_datadrift_2007.html', \n", " title_story=\"Data drift\",\n", " title_description=\"\"\"House price Data drift 2007\"\"\", # Optional: add a subtitle to describe report\n", " project_info_file=\"../eurybia/data/project_info_house_price.yml\" # Optional: add information on report\n", " )" ] } ], "metadata": { "interpreter": { "hash": "d08e6294e2d60f50397263035a337d71f3055486232bc02b45ce2785f62e7d8b" }, "kernelspec": { "display_name": "dev_eurybia", "language": "python", "name": "dev_eurybia" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }