{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "983686e9",
   "metadata": {},
   "source": [
    "# Eurybia - Overview\n",
    "This tutorial will help you understand how Eurybia works with a simple use case\n",
    "\n",
    "Contents:\n",
    "- Compile Eurybia \n",
    "- Generate report\n",
    "\n",
    "For a more detailed tutorial on :\n",
    "- Data validation : (https://github.com/MAIF/eurybia/tree/master/tutorial/data_validation)\n",
    "- Data drift : (https://github.com/MAIF/eurybia/tree/master/tutorial/data_drift)\n",
    "- Model drift : (https://github.com/MAIF/eurybia/tree/master/tutorial/model_drift)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9524ace9",
   "metadata": {},
   "source": [
    "**Requirements notice** : the following tutorial may use third party modules not included in Eurybia.  \n",
    "You can find them all in one file [on our Github repository](https://github.com/MAIF/eurybia/blob/master/requirements.dev.txt) or you can manually install those you are missing, if any."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f8489bfa",
   "metadata": {},
   "outputs": [],
   "source": [
    "from category_encoders import OrdinalEncoder\n",
    "from lightgbm import LGBMRegressor\n",
    "from eurybia import SmartDrift\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29ec936f",
   "metadata": {},
   "source": [
    "## Import Dataset and split in training and production dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "3cb3a493",
   "metadata": {},
   "outputs": [],
   "source": [
    "from eurybia.data.data_loader import data_loading\n",
    "house_df, house_dict = data_loading('house_prices')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "019c6396",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us consider that the column \"YrSold\" corresponds to the reference date. \n",
    "#In 2006, a model was trained using data. And in 2007, we want to detect data drift on new data in production to predict\n",
    "#house price\n",
    "house_df_learning = house_df.loc[house_df['YrSold'] == 2006]\n",
    "house_df_2007 = house_df.loc[house_df['YrSold'] == 2007]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4bda0775",
   "metadata": {},
   "outputs": [],
   "source": [
    "y_df_learning=house_df_learning['SalePrice'].to_frame()\n",
    "X_df_learning=house_df_learning[house_df_learning.columns.difference(['SalePrice','YrSold'])]\n",
    "\n",
    "y_df_2007=house_df_2007['SalePrice'].to_frame()\n",
    "X_df_2007=house_df_2007[house_df_2007.columns.difference(['SalePrice','YrSold'])]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e294d0b5",
   "metadata": {},
   "source": [
    "## Building Supervized Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ca7381d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from category_encoders import OrdinalEncoder\n",
    "\n",
    "categorical_features = [col for col in X_df_learning.columns if X_df_learning[col].dtype == 'object']\n",
    "\n",
    "encoder = OrdinalEncoder(\n",
    "    cols=categorical_features,\n",
    "    handle_unknown='ignore',\n",
    "    return_df=True).fit(X_df_learning)\n",
    "\n",
    "X_df_learning_encoded=encoder.transform(X_df_learning)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8ba398ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_learning_encoded, y_df_learning, train_size=0.75, random_state=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "2dc04f3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "regressor = LGBMRegressor(n_estimators=200).fit(Xtrain,ytrain)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12b12535",
   "metadata": {},
   "source": [
    "## Use Eurybia for data drift"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "9faf9a5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from eurybia import SmartDrift"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "493030c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "SD = SmartDrift(df_current=X_df_2007,\n",
    "                df_baseline=X_df_learning,\n",
    "                deployed_model=regressor, # Optional: put in perspective result with importance on deployed model\n",
    "                encoding=encoder, # Optional: if deployed_model and encoder to use this model\n",
    "                dataset_names={\"df_current\": \"2007 dataset\", \"df_baseline\": \"Learning dataset\"} # Optional: Names for outputs\n",
    "               )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "5c51a243",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2min 23s, sys: 32.1 s, total: 2min 55s\n",
      "Wall time: 10.5 s\n"
     ]
    }
   ],
   "source": [
    "%time SD.compile()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ead7d949",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Report saved to ./report_house_price_datadrift_2007.html. To upload and share your report, create a free Datapane account by running `!datapane signup`."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "SD.generate_report(    \n",
    "    output_file='report_house_price_datadrift_2007.html',    \n",
    "    title_story=\"Data drift\",\n",
    "    title_description=\"\"\"House price Data drift 2007\"\"\", # Optional: add a subtitle to describe report\n",
    "    project_info_file=\"../eurybia/data/project_info_house_price.yml\" # Optional: add information on report\n",
    "    )"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "d08e6294e2d60f50397263035a337d71f3055486232bc02b45ce2785f62e7d8b"
  },
  "kernelspec": {
   "display_name": "dev_eurybia",
   "language": "python",
   "name": "dev_eurybia"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}