Eurybia - Overview¶

This tutorial will help you understand how Eurybia works with a simple use case

Contents: - Compile Eurybia - Generate report

For a more detailed tutorial on : - Data validation : (https://github.com/MAIF/eurybia/tree/master/tutorial/data_validation) - Data drift : (https://github.com/MAIF/eurybia/tree/master/tutorial/data_drift) - Model drift : (https://github.com/MAIF/eurybia/tree/master/tutorial/model_drift)

Requirements notice : the following tutorial may use third party modules not included in Eurybia.

You can find them all in one file on our Github repository or you can manually install those you are missing, if any.

[2]:

from category_encoders import OrdinalEncoder
from lightgbm import LGBMRegressor
from eurybia import SmartDrift
from sklearn.model_selection import train_test_split

Import Dataset and split in training and production dataset¶

[3]:

from eurybia.data.data_loader import data_loading
house_df, house_dict = data_loading('house_prices')

[4]:

# Let us consider that the column "YrSold" corresponds to the reference date.
#In 2006, a model was trained using data. And in 2007, we want to detect data drift on new data in production to predict
#house price
house_df_learning = house_df.loc[house_df['YrSold'] == 2006]
house_df_2007 = house_df.loc[house_df['YrSold'] == 2007]

[5]:

y_df_learning=house_df_learning['SalePrice'].to_frame()
X_df_learning=house_df_learning[house_df_learning.columns.difference(['SalePrice','YrSold'])]

y_df_2007=house_df_2007['SalePrice'].to_frame()
X_df_2007=house_df_2007[house_df_2007.columns.difference(['SalePrice','YrSold'])]

Building Supervized Model¶

[ ]:

from category_encoders import OrdinalEncoder

categorical_features = [col for col in X_df_learning.columns if X_df_learning[col].dtype == 'object']

encoder = OrdinalEncoder(
    cols=categorical_features,
    handle_unknown='ignore',
    return_df=True).fit(X_df_learning)

X_df_learning_encoded=encoder.transform(X_df_learning)

[7]:

Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_learning_encoded, y_df_learning, train_size=0.75, random_state=1)

[8]:

regressor = LGBMRegressor(n_estimators=200).fit(Xtrain,ytrain)

Use Eurybia for data drift¶

[9]:

from eurybia import SmartDrift

[10]:

SD = SmartDrift(df_current=X_df_2007,
                df_baseline=X_df_learning,
                deployed_model=regressor, # Optional: put in perspective result with importance on deployed model
                encoding=encoder, # Optional: if deployed_model and encoder to use this model
                dataset_names={"df_current": "2007 dataset", "df_baseline": "Learning dataset"} # Optional: Names for outputs
               )

[11]:

%time SD.compile()

CPU times: user 2min 23s, sys: 32.1 s, total: 2min 55s
Wall time: 10.5 s

[12]:

SD.generate_report(
    output_file='report_house_price_datadrift_2007.html',
    title_story="Data drift",
    title_description="""House price Data drift 2007""", # Optional: add a subtitle to describe report
    project_info_file="../eurybia/data/project_info_house_price.yml" # Optional: add information on report
    )

Report saved to ./report_house_price_datadrift_2007.html. To upload and share your report, create a free Datapane account by running !datapane signup.