Iterate on Data validation with display analysis

With this tutorial you: Understand how to use Eurybia to iterate on different phases of data validation We propose to go into more detail about the use of Eurybia

Contents: - Do data validation - Generate Report - Iterate on analysis of results, data validation, data preparation

Data from Kaggle House Prices

[1]:
import pandas as pd
from category_encoders import OrdinalEncoder
from lightgbm import LGBMRegressor
from eurybia.core.smartdrift import SmartDrift
from sklearn.model_selection import train_test_split

Import Dataset and split in training and production dataset

[2]:
from eurybia.data.data_loader import data_loading
[3]:
house_df, house_dict = data_loading('house_prices')
[4]:
house_df.tail()
[4]:
MSSubClass MSZoning LotArea Street LotShape LandContour Utilities LotConfig LandSlope Neighborhood ... EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1456 2-Story 1946 & Newer Residential Low Density 7917 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope Gilbert ... 0 0 0 0 0 8 2007 Warranty Deed - Conventional Normal Sale 175000
1457 1-Story 1946 & Newer All Styles Residential Low Density 13175 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope Northwest Ames ... 0 0 0 0 0 2 2010 Warranty Deed - Conventional Normal Sale 210000
1458 2-Story 1945 & Older Residential Low Density 9042 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope Crawford ... 0 0 0 0 2500 5 2010 Warranty Deed - Conventional Normal Sale 266500
1459 1-Story 1946 & Newer All Styles Residential Low Density 9717 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope North Ames ... 112 0 0 0 0 4 2010 Warranty Deed - Conventional Normal Sale 142125
1460 1-Story 1946 & Newer All Styles Residential Low Density 9937 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope Edwards ... 0 0 0 0 0 6 2008 Warranty Deed - Conventional Normal Sale 147500

5 rows × 73 columns

[5]:
# For the purpose of the tutorial split dataset in training and production dataset
# To see an interesting analysis, let's test for a bias between  date of construction of training and production dataset
house_df_learning = house_df.loc[house_df['YearBuilt'] < 1980]
house_df_production = house_df.loc[house_df['YearBuilt'] >= 1980]

[6]:
y_df_learning=house_df_learning['SalePrice'].to_frame()
X_df_learning=house_df_learning[house_df_learning.columns.difference(['SalePrice','YearBuilt'])]

y_df_production=house_df_production['SalePrice'].to_frame()
X_df_production=house_df_production[house_df_production.columns.difference(['SalePrice','YearBuilt'])]

Use Eurybia for data validation

[7]:
from eurybia import SmartDrift
[8]:
SD = SmartDrift(df_current=X_df_production, df_baseline=X_df_learning)
[9]:
%time SD.compile(full_validation=True)
INFO:root:The variable BldgType
                                     has mismatching possible values:

                                     [] ['Two-family Conversion; originally built as one-family dwelling']
INFO:root:The variable BsmtCond
                                     has mismatching possible values:

                                     [] ['Poor -Severe cracking, settling, or wetness']
INFO:root:The variable CentralAir
                                     has mismatching possible values:

                                     [] ['No']
INFO:root:The variable Condition1
                                     has mismatching possible values:

                                     ["Within 200' of East-West Railroad"] ['Adjacent to arterial street', 'Adjacent to postive off-site feature']
INFO:root:The variable Condition2
                                     has mismatching possible values:

                                     ['Near positive off-site feature--park, greenbelt, etc.'] ['Adjacent to arterial street', "Within 200' of North-South Railroad", 'Adjacent to feeder street', 'Adjacent to postive off-site feature', 'Adjacent to North-South Railroad', 'Adjacent to East-West Railroad']
INFO:root:The variable Electrical
                                     has mismatching possible values:

                                     [] ['60 AMP Fuse Box and mostly Romex wiring (Fair)', 'Fuse Box over 60 AMP and all Romex wiring (Average)', '60 AMP Fuse Box and mostly knob & tube wiring (poor)']
INFO:root:The variable ExterCond
                                     has mismatching possible values:

                                     [] ['Fair', 'Poor', 'Excellent']
INFO:root:The variable ExterQual
                                     has mismatching possible values:

                                     [] ['Fair']
INFO:root:The variable Exterior1st
                                     has mismatching possible values:

                                     ['Imitation Stucco'] ['Asbestos Shingles', 'Brick Common', 'Asphalt Shingles', 'Stone', 'Cinder Block']
INFO:root:The variable Exterior2nd
                                     has mismatching possible values:

                                     ['Other'] ['Asbestos Shingles', 'Brick Common', 'Asphalt Shingles', 'Stone', 'Cinder Block']
INFO:root:The variable Foundation
                                     has mismatching possible values:

                                     ['Wood'] ['Brick & Tile', 'Stone']
INFO:root:The variable Functional
                                     has mismatching possible values:

                                     [] ['Major Deductions 2', 'Severely Damaged']
INFO:root:The variable GarageCond
                                     has mismatching possible values:

                                     [] ['Poor', 'Excellent']
INFO:root:The variable GarageQual
                                     has mismatching possible values:

                                     [] ['Excellent', 'Poor']
INFO:root:The variable GarageType
                                     has mismatching possible values:

                                     [] ['Car Port']
INFO:root:The variable Heating
                                     has mismatching possible values:

                                     [] ['Gas hot water or steam heat', 'Gravity furnace', 'Wall furnace', 'Hot water or steam heat other than gas', 'Floor Furnace']
INFO:root:The variable HeatingQC
                                     has mismatching possible values:

                                     [] ['Fair', 'Poor']
INFO:root:The variable HouseStyle
                                     has mismatching possible values:

                                     [] ['One and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level finished']
INFO:root:The variable KitchenQual
                                     has mismatching possible values:

                                     [] ['Fair']
INFO:root:The variable LandSlope
                                     has mismatching possible values:

                                     [] ['Severe Slope']
INFO:root:The variable MSSubClass
                                     has mismatching possible values:

                                     [] ['2-Story 1945 & Older', '2 Family Conversion - All Styles and Ages', '1-1/2 Story - Unfinished All Ages', '1-Story 1945 & Older', '2-1/2 Story All Ages', '1-Story w/Finished Attic All Ages']
INFO:root:The variable MSZoning
                                     has mismatching possible values:

                                     ['Floating Village Residential'] ['Commercial']
INFO:root:The variable MasVnrType
                                     has mismatching possible values:

                                     [] ['Brick Common']
INFO:root:The variable Neighborhood
                                     has mismatching possible values:

                                     ['Northridge', 'Somerset', 'Northridge Heights', 'Stone Brook', 'Bloomington Heights', 'Bluestem'] ['Brookside', 'Iowa DOT and Rail Road', 'Meadow Village', 'Northpark Villa', 'Briardale', 'South & West of Iowa State University']
INFO:root:The variable PavedDrive
                                     has mismatching possible values:

                                     [] ['Partial Pavement']
INFO:root:The variable RoofMatl
                                     has mismatching possible values:

                                     ['Clay or Tile'] ['Metal', 'Membrane', 'Gravel & Tar', 'Roll']
INFO:root:The variable RoofStyle
                                     has mismatching possible values:

                                     [] ['Gabrel (Barn)', 'Mansard', 'Flat', 'Shed']
INFO:root:The variable SaleCondition
                                     has mismatching possible values:

                                     [] ['Adjoining Land Purchase']
INFO:root:The variable SaleType
                                     has mismatching possible values:

                                     ['Contract 15% Down payment regular terms'] []
INFO:root:The variable Utilities
                                     has mismatching possible values:

                                     [] ['Electricity and Gas Only']
Backend: Shap TreeExplainer
CPU times: user 2min 39s, sys: 32.4 s, total: 3min 11s
Wall time: 11.1 s

As soon as compile() method, Eurybia displays default consistency checks as warnings. If some modalities are not present during training and are in production dataset, the deployed model will consider them wrongly. Inversely, if some modalities are present during training and are not in production dataset, it means that some profiles are missing.

[10]:
SD.generate_report(
    output_file='../output/report_house_price_v1.html',
    title_story="Data validation V1",
    title_description="""House price Data validation V1"""
    )

Eurybia is designed to generate an HTML report for analysis, and less for use in notebook mode. However, to illustrate functionalities, we will detail results with notebook mode analysis.

First Analysis of results of the data validation

Data validation methodology is based on the ability of a model to discriminate whether an individual belongs to one of the two datasets. For this purpose a target 0 is assigned to the baseline dataset and a target 1 to the current dataset. Then a classification model (catboost) is learned to predict this target. The level of capacity of the data drift classifier to detect if an individual belongs to one of the 2 datasets represents the level of difference between the 2 datasets

Detection data drift performance

[11]:
#Performance of data drift classifier
SD.auc
[11]:
0.9976525821596245

such a high auc means that datasets are not similar.The differences should be analysed before deploying model in production

Importance of features in data drift

This graph represents the variables in the data drift classification model that are most important to differentiate between the two datasets.

[12]:
SD.xpl.plot.features_importance()

We get the features with most gaps, those that are most important to analyse. With date bias introduced, it is normal that date features are the most impacted. We will then decide to remove them. Let’s analyse other important variables

Univariate analysis

This graphs shows a particular feature’s distribution over its possible values. In the drop-down menu, the variables are sorted by importance of the variables in the data drift classification. For categorical features, the possible values are sorted by descending difference between the two datasets.

[13]:
SD.plot.generate_fig_univariate('BsmtQual')

This feature on height of the basement seems to be correlated with the date of build.To avoid creating too much bias, the data scientist decides to remove it from his learning.

[14]:
SD.plot.generate_fig_univariate('Neighborhood')

This feature on neighborhood seems to be correlated with the date of build.To avoid creating too much bias, the data scientist decides to remove it from his learning.

[15]:
SD.plot.generate_fig_univariate('Foundation')

This feature on foundation seems to be correlated with the date of build.To avoid creating too much bias, the data scientist decides to remove it from his learning.

Data scientist thus discards all features that will not be similar to the production training

Second data validation after cleaning data preparation

[16]:
y_df_learning=house_df_learning['SalePrice'].to_frame()
X_df_learning=house_df_learning[house_df_learning.columns.difference(['SalePrice','YearBuilt','BsmtQual',
                                                'Neighborhood','Foundation','GarageYrBlt','YearRemodAdd',
                                                'GarageFinish','OverallCond','MSZoning','BsmtFinType1','MSSubClass',
                                                'ExterQual','KitchenQual','Exterior2nd','Exterior1st','OverallQual',
                                                'HeatingQC','FullBath','OpenPorchSF','GarageType','GrLivArea','GarageArea'])]

y_df_production=house_df_production['SalePrice'].to_frame()
X_df_production=house_df_production[house_df_production.columns.difference(['SalePrice','YearBuilt','BsmtQual',
                                                'Neighborhood','Foundation','GarageYrBlt','YearRemodAdd',
                                                'GarageFinish','OverallCond','MSZoning','BsmtFinType1','MSSubClass',
                                                'ExterQual','KitchenQual','Exterior2nd','Exterior1st','OverallQual',
                                                'HeatingQC','FullBath','OpenPorchSF','GarageType','GrLivArea','GarageArea'])]
[17]:
SD = SmartDrift(df_current=X_df_production, df_baseline=X_df_learning)
[18]:
%time SD.compile(full_validation=True)
INFO:root:The variable BldgType
                                     has mismatching possible values:

                                     [] ['Two-family Conversion; originally built as one-family dwelling']
INFO:root:The variable BsmtCond
                                     has mismatching possible values:

                                     [] ['Poor -Severe cracking, settling, or wetness']
INFO:root:The variable CentralAir
                                     has mismatching possible values:

                                     [] ['No']
INFO:root:The variable Condition1
                                     has mismatching possible values:

                                     ["Within 200' of East-West Railroad"] ['Adjacent to arterial street', 'Adjacent to postive off-site feature']
INFO:root:The variable Condition2
                                     has mismatching possible values:

                                     ['Near positive off-site feature--park, greenbelt, etc.'] ['Adjacent to arterial street', "Within 200' of North-South Railroad", 'Adjacent to feeder street', 'Adjacent to postive off-site feature', 'Adjacent to North-South Railroad', 'Adjacent to East-West Railroad']
INFO:root:The variable Electrical
                                     has mismatching possible values:

                                     [] ['60 AMP Fuse Box and mostly Romex wiring (Fair)', 'Fuse Box over 60 AMP and all Romex wiring (Average)', '60 AMP Fuse Box and mostly knob & tube wiring (poor)']
INFO:root:The variable ExterCond
                                     has mismatching possible values:

                                     [] ['Fair', 'Poor', 'Excellent']
INFO:root:The variable Functional
                                     has mismatching possible values:

                                     [] ['Major Deductions 2', 'Severely Damaged']
INFO:root:The variable GarageCond
                                     has mismatching possible values:

                                     [] ['Poor', 'Excellent']
INFO:root:The variable GarageQual
                                     has mismatching possible values:

                                     [] ['Excellent', 'Poor']
INFO:root:The variable Heating
                                     has mismatching possible values:

                                     [] ['Gas hot water or steam heat', 'Gravity furnace', 'Wall furnace', 'Hot water or steam heat other than gas', 'Floor Furnace']
INFO:root:The variable HouseStyle
                                     has mismatching possible values:

                                     [] ['One and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level finished']
INFO:root:The variable LandSlope
                                     has mismatching possible values:

                                     [] ['Severe Slope']
INFO:root:The variable MasVnrType
                                     has mismatching possible values:

                                     [] ['Brick Common']
INFO:root:The variable PavedDrive
                                     has mismatching possible values:

                                     [] ['Partial Pavement']
INFO:root:The variable RoofMatl
                                     has mismatching possible values:

                                     ['Clay or Tile'] ['Metal', 'Membrane', 'Gravel & Tar', 'Roll']
INFO:root:The variable RoofStyle
                                     has mismatching possible values:

                                     [] ['Gabrel (Barn)', 'Mansard', 'Flat', 'Shed']
INFO:root:The variable SaleCondition
                                     has mismatching possible values:

                                     [] ['Adjoining Land Purchase']
INFO:root:The variable SaleType
                                     has mismatching possible values:

                                     ['Contract 15% Down payment regular terms'] []
INFO:root:The variable Utilities
                                     has mismatching possible values:

                                     [] ['Electricity and Gas Only']
Backend: Shap TreeExplainer
CPU times: user 1min 52s, sys: 23.2 s, total: 2min 15s
Wall time: 7.71 s
[19]:
SD.generate_report(
    output_file='../output/report_house_price_v2.html',
    title_story="Data validation V2",
    title_description="""House price Data validation V2"""
    )

Second Analysis of results of the data validation

[20]:
SD.auc
[20]:
0.9004200642451199
[21]:
SD.xpl.plot.features_importance()
[22]:
SD.plot.generate_fig_univariate('2ndFlrSF')

Let’s assume that the datascientist is ok with these distribution gaps.

Let’s look at the impact on the deployed model. To do this, let’s first build the model.

Building Supervized Model

[23]:
from category_encoders import OrdinalEncoder

categorical_features = [col for col in X_df_learning.columns if X_df_learning[col].dtype == 'object']

encoder = OrdinalEncoder(
    cols=categorical_features,
    handle_unknown='ignore',
    return_df=True).fit(X_df_learning)

X_df_learning_encoded=encoder.transform(X_df_learning)
is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
[24]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_learning_encoded, y_df_learning, train_size=0.75, random_state=1)
[25]:
regressor = LGBMRegressor(n_estimators=200).fit(Xtrain,ytrain)

Third Analysis of results of the data validation

Let’s add model to be deployed to the SmartDrift to put into perspective differences in dataset distributions with importance of the features on model. To get the predicted probability distribution, we also need to add encoding used

[26]:
SD = SmartDrift(df_current=X_df_production, df_baseline=X_df_learning,
                deployed_model=regressor, encoding=encoder)
[27]:
%time SD.compile(full_validation=True)
INFO:root:The variable BldgType
                                     has mismatching possible values:

                                     [] ['Two-family Conversion; originally built as one-family dwelling']
INFO:root:The variable BsmtCond
                                     has mismatching possible values:

                                     [] ['Poor -Severe cracking, settling, or wetness']
INFO:root:The variable CentralAir
                                     has mismatching possible values:

                                     [] ['No']
INFO:root:The variable Condition1
                                     has mismatching possible values:

                                     ["Within 200' of East-West Railroad"] ['Adjacent to arterial street', 'Adjacent to postive off-site feature']
INFO:root:The variable Condition2
                                     has mismatching possible values:

                                     ['Near positive off-site feature--park, greenbelt, etc.'] ['Adjacent to arterial street', "Within 200' of North-South Railroad", 'Adjacent to feeder street', 'Adjacent to postive off-site feature', 'Adjacent to North-South Railroad', 'Adjacent to East-West Railroad']
INFO:root:The variable Electrical
                                     has mismatching possible values:

                                     [] ['60 AMP Fuse Box and mostly Romex wiring (Fair)', 'Fuse Box over 60 AMP and all Romex wiring (Average)', '60 AMP Fuse Box and mostly knob & tube wiring (poor)']
INFO:root:The variable ExterCond
                                     has mismatching possible values:

                                     [] ['Fair', 'Poor', 'Excellent']
INFO:root:The variable Functional
                                     has mismatching possible values:

                                     [] ['Major Deductions 2', 'Severely Damaged']
INFO:root:The variable GarageCond
                                     has mismatching possible values:

                                     [] ['Poor', 'Excellent']
INFO:root:The variable GarageQual
                                     has mismatching possible values:

                                     [] ['Excellent', 'Poor']
INFO:root:The variable Heating
                                     has mismatching possible values:

                                     [] ['Gas hot water or steam heat', 'Gravity furnace', 'Wall furnace', 'Hot water or steam heat other than gas', 'Floor Furnace']
INFO:root:The variable HouseStyle
                                     has mismatching possible values:

                                     [] ['One and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level finished']
INFO:root:The variable LandSlope
                                     has mismatching possible values:

                                     [] ['Severe Slope']
INFO:root:The variable MasVnrType
                                     has mismatching possible values:

                                     [] ['Brick Common']
INFO:root:The variable PavedDrive
                                     has mismatching possible values:

                                     [] ['Partial Pavement']
INFO:root:The variable RoofMatl
                                     has mismatching possible values:

                                     ['Clay or Tile'] ['Metal', 'Membrane', 'Gravel & Tar', 'Roll']
INFO:root:The variable RoofStyle
                                     has mismatching possible values:

                                     [] ['Gabrel (Barn)', 'Mansard', 'Flat', 'Shed']
INFO:root:The variable SaleCondition
                                     has mismatching possible values:

                                     [] ['Adjoining Land Purchase']
INFO:root:The variable SaleType
                                     has mismatching possible values:

                                     ['Contract 15% Down payment regular terms'] []
INFO:root:The variable Utilities
                                     has mismatching possible values:

                                     [] ['Electricity and Gas Only']
Backend: Shap TreeExplainer
CPU times: user 1min 51s, sys: 24.7 s, total: 2min 15s
Wall time: 7.91 s
[28]:
SD.generate_report(
    output_file='../output/report_house_price_v3.html',
    title_story="Data validation V3",
    title_description="""House price Data validation V3"""
    )

Feature importance overview

This graph compares the importance of variables between the data drift classifier model and the deployed model. This allows us to put into perspective the importance of data drift in relation to the impacts to be expected on the deployed model. If the variable is at the top left, it means that the variable is very important for data drift classification, but that the variable has little influence on the deployed model. If the variable is at the bottom right, it means that the variable has little importance for data drift classification, and that the variable has a lot of influence on the deployed model.

[29]:
SD.plot.scatter_feature_importance()

Putting importance of the drift into perspective according to the importance of the model to be deployed, can help the data scientist to validate that his model can be deployed. Here we see that some features are necessary to analyse

[30]:
SD.plot.generate_fig_univariate('LotArea')
[31]:
SD.plot.generate_fig_univariate('1stFlrSF')

We see that for important features, the data in production will not be similar in distributions to that in training

Distribution of predicted values

This graph shows distributions of the production model outputs on both baseline and current datasets.

[32]:
SD.plot.generate_fig_univariate(df_all=SD.df_predict,col='Score',hue="dataset")

Differences between 2 datasets generate a difference in the distribution of the predictions of the deployed model. These differences can have important impacts on the performance of the model in production. Such differences in predicted probabilities may call into question the decision to deploy the model as is.

With this tutorial, we hope to have detailed how Eurybia can be used in a data validation phase before deploying a model.