Iterate on Data validation with display analysis¶
With this tutorial you: Understand how to use Eurybia to iterate on different phases of data validation We propose to go into more detail about the use of Eurybia
Contents: - Do data validation - Generate Report - Iterate on analysis of results, data validation, data preparation
Data from Kaggle House Prices
[1]:
import pandas as pd
from category_encoders import OrdinalEncoder
from lightgbm import LGBMRegressor
from eurybia.core.smartdrift import SmartDrift
from sklearn.model_selection import train_test_split
Import Dataset and split in training and production dataset¶
[2]:
from eurybia.data.data_loader import data_loading
[3]:
house_df, house_dict = data_loading('house_prices')
[4]:
house_df.tail()
[4]:
MSSubClass | MSZoning | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | ... | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
1456 | 2-Story 1946 & Newer | Residential Low Density | 7917 | Paved | Regular | Near Flat/Level | All public Utilities (E,G,W,& S) | Inside lot | Gentle slope | Gilbert | ... | 0 | 0 | 0 | 0 | 0 | 8 | 2007 | Warranty Deed - Conventional | Normal Sale | 175000 |
1457 | 1-Story 1946 & Newer All Styles | Residential Low Density | 13175 | Paved | Regular | Near Flat/Level | All public Utilities (E,G,W,& S) | Inside lot | Gentle slope | Northwest Ames | ... | 0 | 0 | 0 | 0 | 0 | 2 | 2010 | Warranty Deed - Conventional | Normal Sale | 210000 |
1458 | 2-Story 1945 & Older | Residential Low Density | 9042 | Paved | Regular | Near Flat/Level | All public Utilities (E,G,W,& S) | Inside lot | Gentle slope | Crawford | ... | 0 | 0 | 0 | 0 | 2500 | 5 | 2010 | Warranty Deed - Conventional | Normal Sale | 266500 |
1459 | 1-Story 1946 & Newer All Styles | Residential Low Density | 9717 | Paved | Regular | Near Flat/Level | All public Utilities (E,G,W,& S) | Inside lot | Gentle slope | North Ames | ... | 112 | 0 | 0 | 0 | 0 | 4 | 2010 | Warranty Deed - Conventional | Normal Sale | 142125 |
1460 | 1-Story 1946 & Newer All Styles | Residential Low Density | 9937 | Paved | Regular | Near Flat/Level | All public Utilities (E,G,W,& S) | Inside lot | Gentle slope | Edwards | ... | 0 | 0 | 0 | 0 | 0 | 6 | 2008 | Warranty Deed - Conventional | Normal Sale | 147500 |
5 rows × 73 columns
[5]:
# For the purpose of the tutorial split dataset in training and production dataset
# To see an interesting analysis, let's test for a bias between date of construction of training and production dataset
house_df_learning = house_df.loc[house_df['YearBuilt'] < 1980]
house_df_production = house_df.loc[house_df['YearBuilt'] >= 1980]
[6]:
y_df_learning=house_df_learning['SalePrice'].to_frame()
X_df_learning=house_df_learning[house_df_learning.columns.difference(['SalePrice','YearBuilt'])]
y_df_production=house_df_production['SalePrice'].to_frame()
X_df_production=house_df_production[house_df_production.columns.difference(['SalePrice','YearBuilt'])]
Use Eurybia for data validation¶
[7]:
from eurybia import SmartDrift
[8]:
SD = SmartDrift(df_current=X_df_production, df_baseline=X_df_learning)
[9]:
%time SD.compile(full_validation=True)
INFO:root:The variable BldgType
has mismatching possible values:
[] ['Two-family Conversion; originally built as one-family dwelling']
INFO:root:The variable BsmtCond
has mismatching possible values:
[] ['Poor -Severe cracking, settling, or wetness']
INFO:root:The variable CentralAir
has mismatching possible values:
[] ['No']
INFO:root:The variable Condition1
has mismatching possible values:
["Within 200' of East-West Railroad"] ['Adjacent to arterial street', 'Adjacent to postive off-site feature']
INFO:root:The variable Condition2
has mismatching possible values:
['Near positive off-site feature--park, greenbelt, etc.'] ['Adjacent to arterial street', "Within 200' of North-South Railroad", 'Adjacent to feeder street', 'Adjacent to postive off-site feature', 'Adjacent to North-South Railroad', 'Adjacent to East-West Railroad']
INFO:root:The variable Electrical
has mismatching possible values:
[] ['60 AMP Fuse Box and mostly Romex wiring (Fair)', 'Fuse Box over 60 AMP and all Romex wiring (Average)', '60 AMP Fuse Box and mostly knob & tube wiring (poor)']
INFO:root:The variable ExterCond
has mismatching possible values:
[] ['Fair', 'Poor', 'Excellent']
INFO:root:The variable ExterQual
has mismatching possible values:
[] ['Fair']
INFO:root:The variable Exterior1st
has mismatching possible values:
['Imitation Stucco'] ['Asbestos Shingles', 'Brick Common', 'Asphalt Shingles', 'Stone', 'Cinder Block']
INFO:root:The variable Exterior2nd
has mismatching possible values:
['Other'] ['Asbestos Shingles', 'Brick Common', 'Asphalt Shingles', 'Stone', 'Cinder Block']
INFO:root:The variable Foundation
has mismatching possible values:
['Wood'] ['Brick & Tile', 'Stone']
INFO:root:The variable Functional
has mismatching possible values:
[] ['Major Deductions 2', 'Severely Damaged']
INFO:root:The variable GarageCond
has mismatching possible values:
[] ['Poor', 'Excellent']
INFO:root:The variable GarageQual
has mismatching possible values:
[] ['Excellent', 'Poor']
INFO:root:The variable GarageType
has mismatching possible values:
[] ['Car Port']
INFO:root:The variable Heating
has mismatching possible values:
[] ['Gas hot water or steam heat', 'Gravity furnace', 'Wall furnace', 'Hot water or steam heat other than gas', 'Floor Furnace']
INFO:root:The variable HeatingQC
has mismatching possible values:
[] ['Fair', 'Poor']
INFO:root:The variable HouseStyle
has mismatching possible values:
[] ['One and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level finished']
INFO:root:The variable KitchenQual
has mismatching possible values:
[] ['Fair']
INFO:root:The variable LandSlope
has mismatching possible values:
[] ['Severe Slope']
INFO:root:The variable MSSubClass
has mismatching possible values:
[] ['2-Story 1945 & Older', '2 Family Conversion - All Styles and Ages', '1-1/2 Story - Unfinished All Ages', '1-Story 1945 & Older', '2-1/2 Story All Ages', '1-Story w/Finished Attic All Ages']
INFO:root:The variable MSZoning
has mismatching possible values:
['Floating Village Residential'] ['Commercial']
INFO:root:The variable MasVnrType
has mismatching possible values:
[] ['Brick Common']
INFO:root:The variable Neighborhood
has mismatching possible values:
['Northridge', 'Somerset', 'Northridge Heights', 'Stone Brook', 'Bloomington Heights', 'Bluestem'] ['Brookside', 'Iowa DOT and Rail Road', 'Meadow Village', 'Northpark Villa', 'Briardale', 'South & West of Iowa State University']
INFO:root:The variable PavedDrive
has mismatching possible values:
[] ['Partial Pavement']
INFO:root:The variable RoofMatl
has mismatching possible values:
['Clay or Tile'] ['Metal', 'Membrane', 'Gravel & Tar', 'Roll']
INFO:root:The variable RoofStyle
has mismatching possible values:
[] ['Gabrel (Barn)', 'Mansard', 'Flat', 'Shed']
INFO:root:The variable SaleCondition
has mismatching possible values:
[] ['Adjoining Land Purchase']
INFO:root:The variable SaleType
has mismatching possible values:
['Contract 15% Down payment regular terms'] []
INFO:root:The variable Utilities
has mismatching possible values:
[] ['Electricity and Gas Only']
Backend: Shap TreeExplainer
CPU times: user 2min 39s, sys: 32.4 s, total: 3min 11s
Wall time: 11.1 s
As soon as compile() method, Eurybia displays default consistency checks as warnings. If some modalities are not present during training and are in production dataset, the deployed model will consider them wrongly. Inversely, if some modalities are present during training and are not in production dataset, it means that some profiles are missing.
[10]:
SD.generate_report(
output_file='../output/report_house_price_v1.html',
title_story="Data validation V1",
title_description="""House price Data validation V1"""
)
Eurybia is designed to generate an HTML report for analysis, and less for use in notebook mode. However, to illustrate functionalities, we will detail results with notebook mode analysis.
First Analysis of results of the data validation¶
Data validation methodology is based on the ability of a model to discriminate whether an individual belongs to one of the two datasets. For this purpose a target 0 is assigned to the baseline dataset and a target 1 to the current dataset. Then a classification model (catboost) is learned to predict this target. The level of capacity of the data drift classifier to detect if an individual belongs to one of the 2 datasets represents the level of difference between the 2 datasets
Detection data drift performance¶
[11]:
#Performance of data drift classifier
SD.auc
[11]:
0.9976525821596245
such a high auc means that datasets are not similar.The differences should be analysed before deploying model in production
Importance of features in data drift¶
This graph represents the variables in the data drift classification model that are most important to differentiate between the two datasets.
[12]:
SD.xpl.plot.features_importance()
We get the features with most gaps, those that are most important to analyse. With date bias introduced, it is normal that date features are the most impacted. We will then decide to remove them. Let’s analyse other important variables
Univariate analysis¶
This graphs shows a particular feature’s distribution over its possible values. In the drop-down menu, the variables are sorted by importance of the variables in the data drift classification. For categorical features, the possible values are sorted by descending difference between the two datasets.
[13]:
SD.plot.generate_fig_univariate('BsmtQual')
This feature on height of the basement seems to be correlated with the date of build.To avoid creating too much bias, the data scientist decides to remove it from his learning.
[14]:
SD.plot.generate_fig_univariate('Neighborhood')
This feature on neighborhood seems to be correlated with the date of build.To avoid creating too much bias, the data scientist decides to remove it from his learning.
[15]:
SD.plot.generate_fig_univariate('Foundation')
This feature on foundation seems to be correlated with the date of build.To avoid creating too much bias, the data scientist decides to remove it from his learning.
Data scientist thus discards all features that will not be similar to the production training
Second data validation after cleaning data preparation¶
[16]:
y_df_learning=house_df_learning['SalePrice'].to_frame()
X_df_learning=house_df_learning[house_df_learning.columns.difference(['SalePrice','YearBuilt','BsmtQual',
'Neighborhood','Foundation','GarageYrBlt','YearRemodAdd',
'GarageFinish','OverallCond','MSZoning','BsmtFinType1','MSSubClass',
'ExterQual','KitchenQual','Exterior2nd','Exterior1st','OverallQual',
'HeatingQC','FullBath','OpenPorchSF','GarageType','GrLivArea','GarageArea'])]
y_df_production=house_df_production['SalePrice'].to_frame()
X_df_production=house_df_production[house_df_production.columns.difference(['SalePrice','YearBuilt','BsmtQual',
'Neighborhood','Foundation','GarageYrBlt','YearRemodAdd',
'GarageFinish','OverallCond','MSZoning','BsmtFinType1','MSSubClass',
'ExterQual','KitchenQual','Exterior2nd','Exterior1st','OverallQual',
'HeatingQC','FullBath','OpenPorchSF','GarageType','GrLivArea','GarageArea'])]
[17]:
SD = SmartDrift(df_current=X_df_production, df_baseline=X_df_learning)
[18]:
%time SD.compile(full_validation=True)
INFO:root:The variable BldgType
has mismatching possible values:
[] ['Two-family Conversion; originally built as one-family dwelling']
INFO:root:The variable BsmtCond
has mismatching possible values:
[] ['Poor -Severe cracking, settling, or wetness']
INFO:root:The variable CentralAir
has mismatching possible values:
[] ['No']
INFO:root:The variable Condition1
has mismatching possible values:
["Within 200' of East-West Railroad"] ['Adjacent to arterial street', 'Adjacent to postive off-site feature']
INFO:root:The variable Condition2
has mismatching possible values:
['Near positive off-site feature--park, greenbelt, etc.'] ['Adjacent to arterial street', "Within 200' of North-South Railroad", 'Adjacent to feeder street', 'Adjacent to postive off-site feature', 'Adjacent to North-South Railroad', 'Adjacent to East-West Railroad']
INFO:root:The variable Electrical
has mismatching possible values:
[] ['60 AMP Fuse Box and mostly Romex wiring (Fair)', 'Fuse Box over 60 AMP and all Romex wiring (Average)', '60 AMP Fuse Box and mostly knob & tube wiring (poor)']
INFO:root:The variable ExterCond
has mismatching possible values:
[] ['Fair', 'Poor', 'Excellent']
INFO:root:The variable Functional
has mismatching possible values:
[] ['Major Deductions 2', 'Severely Damaged']
INFO:root:The variable GarageCond
has mismatching possible values:
[] ['Poor', 'Excellent']
INFO:root:The variable GarageQual
has mismatching possible values:
[] ['Excellent', 'Poor']
INFO:root:The variable Heating
has mismatching possible values:
[] ['Gas hot water or steam heat', 'Gravity furnace', 'Wall furnace', 'Hot water or steam heat other than gas', 'Floor Furnace']
INFO:root:The variable HouseStyle
has mismatching possible values:
[] ['One and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level finished']
INFO:root:The variable LandSlope
has mismatching possible values:
[] ['Severe Slope']
INFO:root:The variable MasVnrType
has mismatching possible values:
[] ['Brick Common']
INFO:root:The variable PavedDrive
has mismatching possible values:
[] ['Partial Pavement']
INFO:root:The variable RoofMatl
has mismatching possible values:
['Clay or Tile'] ['Metal', 'Membrane', 'Gravel & Tar', 'Roll']
INFO:root:The variable RoofStyle
has mismatching possible values:
[] ['Gabrel (Barn)', 'Mansard', 'Flat', 'Shed']
INFO:root:The variable SaleCondition
has mismatching possible values:
[] ['Adjoining Land Purchase']
INFO:root:The variable SaleType
has mismatching possible values:
['Contract 15% Down payment regular terms'] []
INFO:root:The variable Utilities
has mismatching possible values:
[] ['Electricity and Gas Only']
Backend: Shap TreeExplainer
CPU times: user 1min 52s, sys: 23.2 s, total: 2min 15s
Wall time: 7.71 s
[19]:
SD.generate_report(
output_file='../output/report_house_price_v2.html',
title_story="Data validation V2",
title_description="""House price Data validation V2"""
)
Second Analysis of results of the data validation¶
[20]:
SD.auc
[20]:
0.9004200642451199
[21]:
SD.xpl.plot.features_importance()
[22]:
SD.plot.generate_fig_univariate('2ndFlrSF')
Let’s assume that the datascientist is ok with these distribution gaps.
Let’s look at the impact on the deployed model. To do this, let’s first build the model.
Building Supervized Model¶
[23]:
from category_encoders import OrdinalEncoder
categorical_features = [col for col in X_df_learning.columns if X_df_learning[col].dtype == 'object']
encoder = OrdinalEncoder(
cols=categorical_features,
handle_unknown='ignore',
return_df=True).fit(X_df_learning)
X_df_learning_encoded=encoder.transform(X_df_learning)
is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
[24]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df_learning_encoded, y_df_learning, train_size=0.75, random_state=1)
[25]:
regressor = LGBMRegressor(n_estimators=200).fit(Xtrain,ytrain)
Third Analysis of results of the data validation¶
Let’s add model to be deployed to the SmartDrift to put into perspective differences in dataset distributions with importance of the features on model. To get the predicted probability distribution, we also need to add encoding used
[26]:
SD = SmartDrift(df_current=X_df_production, df_baseline=X_df_learning,
deployed_model=regressor, encoding=encoder)
[27]:
%time SD.compile(full_validation=True)
INFO:root:The variable BldgType
has mismatching possible values:
[] ['Two-family Conversion; originally built as one-family dwelling']
INFO:root:The variable BsmtCond
has mismatching possible values:
[] ['Poor -Severe cracking, settling, or wetness']
INFO:root:The variable CentralAir
has mismatching possible values:
[] ['No']
INFO:root:The variable Condition1
has mismatching possible values:
["Within 200' of East-West Railroad"] ['Adjacent to arterial street', 'Adjacent to postive off-site feature']
INFO:root:The variable Condition2
has mismatching possible values:
['Near positive off-site feature--park, greenbelt, etc.'] ['Adjacent to arterial street', "Within 200' of North-South Railroad", 'Adjacent to feeder street', 'Adjacent to postive off-site feature', 'Adjacent to North-South Railroad', 'Adjacent to East-West Railroad']
INFO:root:The variable Electrical
has mismatching possible values:
[] ['60 AMP Fuse Box and mostly Romex wiring (Fair)', 'Fuse Box over 60 AMP and all Romex wiring (Average)', '60 AMP Fuse Box and mostly knob & tube wiring (poor)']
INFO:root:The variable ExterCond
has mismatching possible values:
[] ['Fair', 'Poor', 'Excellent']
INFO:root:The variable Functional
has mismatching possible values:
[] ['Major Deductions 2', 'Severely Damaged']
INFO:root:The variable GarageCond
has mismatching possible values:
[] ['Poor', 'Excellent']
INFO:root:The variable GarageQual
has mismatching possible values:
[] ['Excellent', 'Poor']
INFO:root:The variable Heating
has mismatching possible values:
[] ['Gas hot water or steam heat', 'Gravity furnace', 'Wall furnace', 'Hot water or steam heat other than gas', 'Floor Furnace']
INFO:root:The variable HouseStyle
has mismatching possible values:
[] ['One and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level unfinished', 'Two and one-half story: 2nd level finished']
INFO:root:The variable LandSlope
has mismatching possible values:
[] ['Severe Slope']
INFO:root:The variable MasVnrType
has mismatching possible values:
[] ['Brick Common']
INFO:root:The variable PavedDrive
has mismatching possible values:
[] ['Partial Pavement']
INFO:root:The variable RoofMatl
has mismatching possible values:
['Clay or Tile'] ['Metal', 'Membrane', 'Gravel & Tar', 'Roll']
INFO:root:The variable RoofStyle
has mismatching possible values:
[] ['Gabrel (Barn)', 'Mansard', 'Flat', 'Shed']
INFO:root:The variable SaleCondition
has mismatching possible values:
[] ['Adjoining Land Purchase']
INFO:root:The variable SaleType
has mismatching possible values:
['Contract 15% Down payment regular terms'] []
INFO:root:The variable Utilities
has mismatching possible values:
[] ['Electricity and Gas Only']
Backend: Shap TreeExplainer
CPU times: user 1min 51s, sys: 24.7 s, total: 2min 15s
Wall time: 7.91 s
[28]:
SD.generate_report(
output_file='../output/report_house_price_v3.html',
title_story="Data validation V3",
title_description="""House price Data validation V3"""
)
Feature importance overview¶
This graph compares the importance of variables between the data drift classifier model and the deployed model. This allows us to put into perspective the importance of data drift in relation to the impacts to be expected on the deployed model. If the variable is at the top left, it means that the variable is very important for data drift classification, but that the variable has little influence on the deployed model. If the variable is at the bottom right, it means that the variable has little importance for data drift classification, and that the variable has a lot of influence on the deployed model.
[29]:
SD.plot.scatter_feature_importance()
Putting importance of the drift into perspective according to the importance of the model to be deployed, can help the data scientist to validate that his model can be deployed. Here we see that some features are necessary to analyse
[30]:
SD.plot.generate_fig_univariate('LotArea')
[31]:
SD.plot.generate_fig_univariate('1stFlrSF')
We see that for important features, the data in production will not be similar in distributions to that in training
Distribution of predicted values¶
This graph shows distributions of the production model outputs on both baseline and current datasets.
[32]:
SD.plot.generate_fig_univariate(df_all=SD.df_predict,col='Score',hue="dataset")
Differences between 2 datasets generate a difference in the distribution of the predictions of the deployed model. These differences can have important impacts on the performance of the model in production. Such differences in predicted probabilities may call into question the decision to deploy the model as is.
With this tutorial, we hope to have detailed how Eurybia can be used in a data validation phase before deploying a model.