Use Shapash Webapp with Eurybia¶
With this tutorial, you will learn to use Eurybia and the Shapash webapp to understand your datadrift classifier
Contents: - Build a model to deploy - Do data validation between learning dataset and production dataset - Generate Report - Run Webapp
Data from Kaggle Titanic
Requirements notice : the following tutorial may use third party modules not included in Eurybia.
You can find them all in one file on our Github repository or you can manually install those you are missing, if any.
[1]:
from category_encoders import OrdinalEncoder
import catboost
from sklearn.model_selection import train_test_split
Building Supervized Model¶
[3]:
from eurybia.data.data_loader import data_loading
/Users/78176D/workspace/eurybia/eurybia/report/generation.py:18: UserWarning:
Using Panel interactively in VSCode notebooks requires the jupyter_bokeh package to be installed. You can install it with:
pip install jupyter_bokeh
or:
conda install jupyter_bokeh
and try again.
[4]:
titan_df = data_loading('titanic')
[5]:
features = ['Pclass', 'Age', 'Embarked', 'Sex', 'SibSp', 'Parch', 'Fare']
features_to_encode = ['Pclass', 'Embarked', 'Sex']
[6]:
encoder = OrdinalEncoder(cols=features_to_encode)
encoder.fit(titan_df[features], verbose=False)
[6]:
OrdinalEncoder(cols=['Pclass', 'Embarked', 'Sex'],
mapping=[{'col': 'Pclass', 'data_type': dtype('O'),
'mapping': Third class 1
First class 2
Second class 3
NaN -2
dtype: int64},
{'col': 'Embarked', 'data_type': dtype('O'),
'mapping': Southampton 1
Cherbourg 2
Queenstown 3
NaN -2
dtype: int64},
{'col': 'Sex', 'data_type': dtype('O'),
'mapping': male 1
female 2
NaN -2
dtype: int64}])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
OrdinalEncoder(cols=['Pclass', 'Embarked', 'Sex'],
mapping=[{'col': 'Pclass', 'data_type': dtype('O'),
'mapping': Third class 1
First class 2
Second class 3
NaN -2
dtype: int64},
{'col': 'Embarked', 'data_type': dtype('O'),
'mapping': Southampton 1
Cherbourg 2
Queenstown 3
NaN -2
dtype: int64},
{'col': 'Sex', 'data_type': dtype('O'),
'mapping': male 1
female 2
NaN -2
dtype: int64}])[7]:
titan_df_encoded = encoder.transform(titan_df[features])
[8]:
X_train, X_test, y_train, y_test = train_test_split(
titan_df_encoded,
titan_df['Survived'].to_frame(),
test_size=0.2,
random_state=11
)
[9]:
i=0
indice_cat = []
for feature in titan_df_encoded:
if feature in features_to_encode:
indice_cat.append(i)
i=i+1
[10]:
model = catboost.CatBoostClassifier(loss_function= "Logloss", eval_metric="Logloss",
learning_rate=0.143852,
iterations=500,
l2_leaf_reg=15,
max_depth = 4)
[11]:
train_pool_cat = catboost.Pool(data=X_train, label= y_train, cat_features = indice_cat)
test_pool_cat = catboost.Pool(data=X_test, label=y_test, cat_features = indice_cat)
[12]:
model.fit(train_pool_cat, eval_set=test_pool_cat, silent=True)
y_pred = model.predict(X_test)
Creating a fake dataset as a production dataset¶
[13]:
import random
[14]:
df_production = titan_df.copy()
[15]:
df_production['Age'] = df_production['Age'].apply(lambda x: random.randrange(10, 76)).astype(float)
df_production['Fare'] = df_production['Fare'].apply(lambda x: random.randrange(1, 100)).astype(float)
list_sex= ["male", "female"]
df_production['Sex'] = df_production['Sex'].apply(lambda x: random.choice(list_sex))
[16]:
df_baseline = titan_df[features]
df_current = df_production[features]
[17]:
df_current.head()
[17]:
| Pclass | Age | Embarked | Sex | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| PassengerId | |||||||
| 1 | Third class | 36.0 | Southampton | male | 1 | 0 | 57.0 |
| 2 | First class | 11.0 | Cherbourg | female | 1 | 0 | 94.0 |
| 3 | Third class | 12.0 | Southampton | female | 0 | 0 | 25.0 |
| 4 | First class | 60.0 | Southampton | female | 1 | 0 | 94.0 |
| 5 | Third class | 22.0 | Southampton | female | 0 | 0 | 84.0 |
[18]:
df_baseline.head()
[18]:
| Pclass | Age | Embarked | Sex | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| PassengerId | |||||||
| 1 | Third class | 22.0 | Southampton | male | 1 | 0 | 7.25 |
| 2 | First class | 38.0 | Cherbourg | female | 1 | 0 | 71.28 |
| 3 | Third class | 26.0 | Southampton | female | 0 | 0 | 7.92 |
| 4 | First class | 35.0 | Southampton | female | 1 | 0 | 53.10 |
| 5 | Third class | 35.0 | Southampton | male | 0 | 0 | 8.05 |
Use Eurybia for data validation¶
[19]:
from eurybia import SmartDrift
[20]:
sd = SmartDrift(df_current=df_current, df_baseline=df_baseline, deployed_model=model, encoding=encoder)
[21]:
%time sd.compile(full_validation=True)
INFO: Shap explainer type - <shap.explainers._tree.TreeExplainer object at 0x11d28cad0>
CPU times: user 3.6 s, sys: 690 ms, total: 4.29 s
Wall time: 848 ms
[22]:
sd.generate_report(
output_file='report_titanic.html',
title_story="Data validation",
title_description="""Titanic Data validation"""
)
Launch WebApp Shapash from SmartDrift¶
After compile step, you can launch a WebApp Shapash directly from your object SmartDrift. It allows you to access several dynamic plots that will help you to understand where drift has been detected in your data. For information on Shapash Webapp : (https://github.com/MAIF/shapash)
[23]:
app = sd.xpl.run_app(title_story='Eurybia datadrift classifier')
INFO:root:Your Shapash application run on http://PMP01204:8050/
INFO:root:Use the method .kill() to down your app.
Stop the WebApp after using it
[24]:
app.kill()
[ ]: