Search
  • Antonello Calamea

A kind of “Hello, World!”​ in ML (using a basic workflow)


Photo by Martim Braz on Unsplash


Some time ago, a friend of mine told me she has to start to deal with ML topics and asked me more about it, so I prepared this little example, a kind of “Hello, World!”, to show her the process to find and predict info from data.


IMHO, the most important thing is to define a workflow, something to follow during the analysis because it helps A LOT having one. This is mine:


  1. Define objectives

  2. Collect data

  3. Understand and prepare the data

  4. Create and evaluate the Model


We’ll arrive here in this post but it’s not over…you have then to:


  1. Refine the Model

  2. Deploy

Very important: it’s an iterative process and every step can be improved, affecting the outcome of the next ones.


Let’s start with the example!


1) Define objectives

What I have to do and what kind of problem I have to resolve?


The objective is to predict the price of a house (target), based on several variables describing the characteristics of the buildings (features)

As the prediction is a continuous value and both features and target values are available in the dataset, this is supervised regression problem.


In simpler words, if someone gives me new values for the features (how much the house is big, overall quality, number of bathrooms, etc), I want to have a model that can answer with an estimated sale price.


No more to add, so let’s dive into….


2) Collect data

The test and train datasets are available on Kaggle.


We’ll use Colaboratory from Google, a Jupyter cloud environment, already powered with a lot of libraries and with the possibility to have free GPUs to run complex stuff…but this is not the case.


Let’s start with boilerplate code to retrieve the files from GDrive, previously downloaded from Kaggle.

# code ti retrieve file from GDrive
!pip install -U -q PyDrive
 
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
 
# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
file_list = drive.ListFile({'q': "'<folder id>' in parents and trashed=false"}).GetList()
for file1 in file_list:
  print('title: %s, id: %s' % (file1['title'], file1['id']))
  
# create local file
house_prices_train_downloaded = drive.CreateFile({'id': '<file id>'})
house_prices_train_downloaded.GetContentFile('house_prices_train.csv')  
house_prices_test_downloaded = drive.CreateFile({'id': '<file id>'})
house_prices_test_downloaded.GetContentFile('house_prices_test.csv')

Let’s import some libraries and take a look at the data

# Pandas and numpy for data manipulation
import pandas as pd
import numpy as np
pd.set_option("display.max_columns",100)
# No warnings about setting value on copy of slice
pd.options.mode.chained_assignment = None
# Display up to 60 columns of a dataframe
pd.set_option('display.max_columns', 60)
# Matplotlib visualization
import matplotlib.pyplot as plt
%matplotlib inline
# Set default font size
plt.rcParams['font.size'] = 24
# Internal ipython tool for setting figure size
from IPython.core.pylabtools import figsize
# Seaborn for visualization
import seaborn as sns
sns.set(font_scale = 2)
from IPython.display import display
original_train_set = pd.read_csv('house_prices_train.csv')
display(original_train_set.info())<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

The columns descriptions are available at this link


Other observations:

  • SalePrice is the target

  • There are 80 features columns, both categorical and numerical

  • There are sufficient number of samples (1460 rows) respect the number of features

  • 19 columns have missing values (we’ll deal with this on next step)


To help with data preparation, let’s use a library called SpeedML, allowing to do several operations with fewer commands


Let’s install it (with pip!) and initialize with the two test and train dataframes

!pip install speedml
from speedml import Speedml
sml = Speedml('house_prices_train.csv', 
              'house_prices_test.csv', 
              target = 'SalePrice',
              uid = 'Id')Collecting speedml
  Downloading https://files.pythonhosted.org/packages/b1/72/91dcc93415b09829897b3d34a87383a946b720771b6d1662fbc017782b6c/speedml-0.9.3-py2.py3-none-any.whl
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from speedml) (0.16.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.6/dist-packages (from speedml) (0.7.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from speedml) (0.22.0)
Collecting sklearn (from speedml)
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from speedml) (1.14.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from speedml) (2.1.2)
Requirement already satisfied: xgboost in /usr/local/lib/python3.6/dist-packages (from speedml) (0.7.post4)
Requirement already satisfied: pytz>=2011k in /usr/local/lib/python3.6/dist-packages (from pandas->speedml) (2018.4)
Requirement already satisfied: python-dateutil>=2 in /usr/local/lib/python3.6/dist-packages (from pandas->speedml) (2.5.3)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from sklearn->speedml) (0.19.1)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->speedml) (1.11.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->speedml) (2.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->speedml) (0.10.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from xgboost->speedml) (0.19.1)
Building wheels for collected packages: sklearn
  Running setup.py bdist_wheel for sklearn ... done
  Stored in directory: /content/.cache/pip/wheels/76/03/bb/589d421d27431bcd2c6da284d5f2286c8e3b2ea3cf1594c074
Successfully built sklearn
Installing collected packages: sklearn, speedml
Successfully installed sklearn-0.0 speedml-0.9.32) Understand and prepare the data

3) Understand and prepare the data


This is a very important step, because here you set the foundation of the whole work


We can divide the process in sub steps:

  1. Basic data preparation (deal with missing values, outliers, etc)

  2. EDA (exploratory data analysis) to gather more information about the dataset (distribution, correlations, etc) and have a better knowledge of the data

  3. Feature selection — choose the most relevant features

  4. Feature engineering — create new features from existing ones or other available data


3.1 Data preparation

Let’s deal first with the missing values

def missing_values_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% '})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% ', ascending=False).round(1)
        
        return mis_val_table_ren_columns
missing_values_table(sml.train)

In this case, we’ll just drop the first 4 columns with higher number of missing values

sml.feature.drop(['PoolQC','MiscFeature','Alley','Fence'])
'Dropped 4 features with 76 features available.'

SpeedML will drop columns both in train and test dataset, to avoid inconsistency.


Let’s fill all the rest with median or most present text values with just a single command (impute) and let’s check results (being an example we can do it without too many problems, but could be important to choose the best strategy for every column, especially to improve results)

sml.feature.impute()
missing_values_table(sml.train)
display(sml.train.info())
'Imputed 1558 empty values to 0.'Your selected dataframe has 76 columns.
There are 0 columns that have missing values.
Missing Values	% of Total Values
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 0 to 1459
Data columns (total 76 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1460 non-null object
MasVnrArea       1460 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1460 non-null object
BsmtCond         1460 non-null object
BsmtExposure     1460 non-null object
BsmtFinType1     1460 non-null object
BsmtFinSF1       1460 non-null float64
BsmtFinType2     1460 non-null object
BsmtFinSF2       1460 non-null float64
BsmtUnfSF        1460 non-null float64
TotalBsmtSF      1460 non-null float64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1460 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null float64
BsmtHalfBath     1460 non-null float64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      1460 non-null object
GarageType       1460 non-null object
GarageYrBlt      1460 non-null float64
GarageFinish     1460 non-null object
GarageCars       1460 non-null float64
GarageArea       1460 non-null float64
GarageQual       1460 non-null object
GarageCond       1460 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(11), int64(26), object(39)
memory usage: 878.3+ KB

Nice, no more missing data.


3.2 EDA

To keep it simple, let’s find the most important features correlated to the target

sml.train[sml.train.columns[0:]].corr()['SalePrice'][:-1].sort_values()
KitchenAbvGr    -0.135907
EnclosedPorch   -0.128578
MSSubClass      -0.084284
OverallCond     -0.077856
YrSold          -0.028923
LowQualFinSF    -0.025606
MiscVal         -0.021190
BsmtHalfBath    -0.016844
BsmtFinSF2      -0.011378
3SsnPorch        0.044584
MoSold           0.046432
PoolArea         0.092404
ScreenPorch      0.111447
BedroomAbvGr     0.168213
BsmtUnfSF        0.214479
BsmtFullBath     0.227122
LotArea          0.263843
HalfBath         0.284108
OpenPorchSF      0.315856
2ndFlrSF         0.319334
WoodDeckSF       0.324413
LotFrontage      0.334544
BsmtFinSF1       0.386420
Fireplaces       0.466929
GarageYrBlt      0.469056
MasVnrArea       0.472614
YearRemodAdd     0.507101
YearBuilt        0.522897
TotRmsAbvGrd     0.533723
FullBath         0.560664
1stFlrSF         0.605852
TotalBsmtSF      0.613581
GarageArea       0.623431
GarageCars       0.640409
GrLivArea        0.708624
OverallQual      0.790982
Name: SalePrice, dtype: float64

The correlation can be positive or negative (in range [-1,1]). The higher positive features (OverallQual,GrLivArea, ..) make sense, because the price is directly proportional to their values.


Let’s visualize all the correlations with a correlation matrix

Quite a puzzle!

Here it’s possible to see correlation not only between target and features, but even between features (look for example at the high correlation between GrLivArea and TotRmsAboveGrd, something that has perfect sense because if there is more space on the ground more rooms can be built above it)


3.2 Feature selection

Let’s focus on the most correlated and let’s remove outliers, using the standard definition of +/- 3 IQR deviation. An operation to do with caution, because outliers can be useful data too…


columns_of_interest = ['OverallQual','GrLivArea','GarageCars','GarageArea',
                'TotalBsmtSF','1stFlrSF','FullBath','TotRmsAbvGrd',
                 'YearBuilt','YearRemodAdd']
sml.train.loc[:,columns_of_interest].describe()
def remove_outliers(df, columns):
  for c in columns:
    print('Removing outliers from ', c)
    first_quartile = df[c].describe()['25%']
    third_quartile = df[c].describe()['75%']
    # Interquartile range
    iqr = third_quartile - first_quartile
# Remove outliers
    df = df[(df[c] > (first_quartile - 3 * iqr)) &
            (df[c] < (third_quartile + 3 * iqr))]
return df
sml.train = remove_outliers(sml.train, columns_of_interest)
sml.train.loc[:,columns_of_interest].describe()
sml.train.shape
Removing outliers from  OverallQual
Removing outliers from  GrLivArea
Removing outliers from  GarageCars
Removing outliers from  GarageArea
Removing outliers from  TotalBsmtSF
Removing outliers from  1stFlrSF
Removing outliers from  FullBath
Removing outliers from  TotRmsAbvGrd
Removing outliers from  YearBuilt
Removing outliers from  YearRemodAdd

We can gain some insights here (overall quality mean vote is around 6, the most ancient house was built in 1872 and so on. But let’s visualize the data to find more info

Let’s see the distribution of the target (the sale price)

_ = sns.distplot(original_train_set['SalePrice'])

Most values are under 400K. Another way to see this is to plot the ECDF, showing the cumulative distribution of the price.


def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
# Number of data points: n
    n = len(data)
# x-data for the ECDF: x
    x = np.sort(data)
# y-data for the ECDF: y
    y = np.arange(1, n+1) / n
return x, y

It’s more clear here: almost the total of the values are under the 400K of sale price, with the 75% a bit more under 200K


Let’s now do some multivariate analysis between target and most relevant features, expecting to see a positive correlation

_ = sns.jointplot(x="GrLivArea", y="SalePrice", data=sml.train)
sml.plot.bar("OverallQual", "SalePrice")
sml.plot.bar("GarageCars", "SalePrice");
plt.show()

Yep, the sale price is definitely rising with the living area (the Pearson correlation coefficient of 0.72 tells us there is a quite strong positive correlation), quality and the garage size, with the exception of the 4 cars garage size value (something could be interesting to investigate)


As a final step, let’s transform the categorical columns in something numeric, so can be used by a ML algorithm


# Select the object columns
object_columns = sml.train.select_dtypes('object').columns
sml.train = pd.get_dummies(sml.train, columns = object_columns)
sml.train.shape
(1449, 275)

The columns number increased a lot after the change of categorical features (now are 275) but only in this way the model can be trained because everything is numeric.


3.4 Features engineering

Being an “Hello, World!”, we’ll use the data as is, without creating new features, but this step is very important because can feed the model with more useful data.


4 Prepare the model

Let’s split train data in 70% (train) ad 30% (test)

from sklearn.model_selection import train_test_split
features = sml.train.drop(columns='SalePrice')
targets = pd.DataFrame(sml.train['SalePrice'])
# Replace the inf and -inf with nan (required for later imputation)
features = features.replace({np.inf: np.nan, -np.inf: np.nan})
# Split into 70% training and 30% testing set
X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size = 0.3, 
                                        random_state = 42)

Now we can start to work with a model, but first…..let’s define a baseline to beat with our model


4.1 Define a baseline

Let’s choose the mean absolute error (MAE) as KPI and let’s evaluate its value with a naive model, using always the sales price median value (163250$)

# Function to calculate mean absolute error
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))
  
baseline_guess = np.median(y_test)
print('The baseline guess is %0.2f' % baseline_guess)
print("Baseline Performance on the test set: MAE = %0.4f" % mae(y_test, baseline_guess))
The baseline guess is 163250.00
Baseline Performance on the test set: MAE = 51501.8644

So the MAE to beat is 51501.86


4.2 Train the simplest model

In this case, let’s use a linear regression with basic parameters

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
    
# Train the model
lr.fit(X_train, y_train)
    
# Make predictions and evalute
lr_pred = model.predict(X_test)
lr_mae = mae(y_test, lr_pred)
print('Linear Regression Performance on the test set: MAE = %0.4f' % lr_mae)
Linear Regression Performance on the test set: MAE = 17273.8701

4.3 Evaluate the model

Wow, we beat naive baseline, improving by 33%…let’s compare the real value with predicted ones

_ = plt.plot(list(y_test.iloc[:,0]), marker='o', linestyle='none', 
             alpha=0.2, label='real values')
_ = plt.plot(model_pred, marker='.', linestyle='none', label = 'predicted')
_ = plt.xlabel('number of samples')
_ = plt.ylabel('SalePrice')
plt.show()
ax = sns.distplot(model_pred, color='red', kde=True)
ax = sns.distplot(list(y_test.iloc[:,0]), kde=True)
ax.set(xlabel='SalePrice', ylabel='probability')

The model seems to act poorly nearly under 200K, but it’s a start and definitely a good point for an hello world example.


Speaking of this…..

Hello, World! :)

1 view0 comments