{
"cells": [
{
"cell_type": "markdown",
"id": "996903eb",
"metadata": {},
"source": [
"# Iterate on Data validation with display analysis\n"
]
},
{
"cell_type": "markdown",
"id": "463ecee0",
"metadata": {},
"source": [
"With this tutorial you:
\n",
"Understand how to use Eurybia to iterate on different phases of data validation
\n",
"We propose to go into more detail about the use of Eurybia
\n",
"\n",
"Contents:\n",
"- Validate your data \n",
"- Generate Report \n",
"- Iterate on analysis of results, data validation, data preparation\n",
"\n",
"Data from Kaggle [House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)"
]
},
{
"cell_type": "markdown",
"id": "b239e1e0",
"metadata": {},
"source": [
"**Requirements notice** : the following tutorial may use third party modules not included in Eurybia. \n",
"You can find them all in one file [on our Github repository](https://github.com/MAIF/eurybia/blob/master/requirements.dev.txt) or you can manually install those you are missing, if any."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cd5f25fb",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from category_encoders import OrdinalEncoder\n",
"from lightgbm import LGBMRegressor\n",
"from eurybia.core.smartdrift import SmartDrift\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"id": "6aed9f4b",
"metadata": {},
"source": [
"## Import Dataset and split in training and production dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "1c6aca48",
"metadata": {},
"outputs": [],
"source": [
"from eurybia.data.data_loader import data_loading"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "7c55e4fa",
"metadata": {},
"outputs": [],
"source": [
"house_df, house_dict = data_loading('house_prices')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "d4a2e665",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | MSSubClass | \n", "MSZoning | \n", "LotArea | \n", "Street | \n", "LotShape | \n", "LandContour | \n", "Utilities | \n", "LotConfig | \n", "LandSlope | \n", "Neighborhood | \n", "... | \n", "EnclosedPorch | \n", "3SsnPorch | \n", "ScreenPorch | \n", "PoolArea | \n", "MiscVal | \n", "MoSold | \n", "YrSold | \n", "SaleType | \n", "SaleCondition | \n", "SalePrice | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
1456 | \n", "2-Story 1946 & Newer | \n", "Residential Low Density | \n", "7917 | \n", "Paved | \n", "Regular | \n", "Near Flat/Level | \n", "All public Utilities (E,G,W,& S) | \n", "Inside lot | \n", "Gentle slope | \n", "Gilbert | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "8 | \n", "2007 | \n", "Warranty Deed - Conventional | \n", "Normal Sale | \n", "175000 | \n", "
1457 | \n", "1-Story 1946 & Newer All Styles | \n", "Residential Low Density | \n", "13175 | \n", "Paved | \n", "Regular | \n", "Near Flat/Level | \n", "All public Utilities (E,G,W,& S) | \n", "Inside lot | \n", "Gentle slope | \n", "Northwest Ames | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "2010 | \n", "Warranty Deed - Conventional | \n", "Normal Sale | \n", "210000 | \n", "
1458 | \n", "2-Story 1945 & Older | \n", "Residential Low Density | \n", "9042 | \n", "Paved | \n", "Regular | \n", "Near Flat/Level | \n", "All public Utilities (E,G,W,& S) | \n", "Inside lot | \n", "Gentle slope | \n", "Crawford | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2500 | \n", "5 | \n", "2010 | \n", "Warranty Deed - Conventional | \n", "Normal Sale | \n", "266500 | \n", "
1459 | \n", "1-Story 1946 & Newer All Styles | \n", "Residential Low Density | \n", "9717 | \n", "Paved | \n", "Regular | \n", "Near Flat/Level | \n", "All public Utilities (E,G,W,& S) | \n", "Inside lot | \n", "Gentle slope | \n", "North Ames | \n", "... | \n", "112 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "4 | \n", "2010 | \n", "Warranty Deed - Conventional | \n", "Normal Sale | \n", "142125 | \n", "
1460 | \n", "1-Story 1946 & Newer All Styles | \n", "Residential Low Density | \n", "9937 | \n", "Paved | \n", "Regular | \n", "Near Flat/Level | \n", "All public Utilities (E,G,W,& S) | \n", "Inside lot | \n", "Gentle slope | \n", "Edwards | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "6 | \n", "2008 | \n", "Warranty Deed - Conventional | \n", "Normal Sale | \n", "147500 | \n", "
5 rows × 73 columns
\n", "