House price prediction with Machine Learning in Python

Updated: Nov 8, 2020

Using Ridge, Bayesian, Lasso, Elastic Net, and OLS regression model for prediction



Introduction


Estimating the sale prices of houses is one of the basic projects to have on your Data Science CV. By finishing this article, you will be able to predict continuous variables using various types of linear regression algorithm.


Why linear regression? Linear regression is an algorithm used to predict values that are continuous in nature. It became more popular because it is the best algorithm to start with if you are a newbie to ML.


To predict the sale prices we are going to use the following linear regression algorithms: Ordinal Least Square (OLS) algorithm, Ridge regression algorithm, Lasso regression algorithm, Bayesian regression algorithm, and lastly Elastic Net regression algorithm. These algorithms can be feasibly implemented in python with the use of the scikit-learn package.


Finally, we conclude which model is best suitable for the given case by evaluating each of them using the evaluation metrics provided by the scikit-learn package.


Why Python?


Python is a general-purpose, and high-level programming language which is best known for its efficiency and powerful methods. Python is loved by data scientists because of its ease of use, which makes it more accessible. Python provides data scientists with an extensive amount of tools and packages to build machine learning models. One of its special features is that we can build various machine learning with less-code.


Steps Involved

  1. Importing the required packages into our python environment

  2. Importing the house price data and do some EDA on it

  3. Data Visualization on the house price data

  4. Feature Selection & Data Split

  5. Modeling the data using the algorithms

  6. Evaluating the built model using the evaluation metrics

Let’s now jump into our coding part!

Importing required packages


Our primary packages for this project are going to be pandas for data processing, NumPy to work with arrays, matplotlib & seaborn for data visualizations, and finally scikit-learn for building an evaluating our ML model. Let’s import all the required packages into our python environment.


Python Implementation:


# IMPORTING PACKAGES

import pandas as pd # data processing
import numpy as np # working with arrays
import matplotlib.pyplot as plt# visualization
import seaborn as sb # visualization
from termcolor import colored as cl # text customization 

from sklearn.model_selection import train_test_split # data split        from sklearn.linear_model import LinearRegression # OLS algorithm    from sklearn.linear_model import Ridge # Ridge algorithm 
from sklearn.linear_model import Lasso # Lasso algorithm 
from sklearn.linear_model import BayesianRidge # Bayesian algorithm from sklearn.linear_model import ElasticNet # ElasticNet algorithm 
from sklearn.metrics import explained_variance_score as evs 
from sklearn.metrics import r2_score as r2 

sb.set_style('whitegrid') 
plt.rcParams['figure.figsize'] = (20, 10) 

Importing Data & EDA


As I said before, we are going to work with the house price dataset that contains various features and information about the house and its sale price. Using the ‘read_csv’ method provided by the Pandas package, we can import the data into our python environment. After importing the data, we can use the ‘head’ method to get a glimpse of our dataset.


Python Implementation:


# IMPORTING DATA

df = pd.read_csv('house.csv')
df.set_index('Id', inplace=True)

df.head(5)

Output:



Now let’s move on to the EDA part. We begin our EDA process by removing all the null values that contain in our dataset. We can do this in python using the ‘dropna’ method.


Python Implementation:


df.dropna(inplace = True) 

print(cl(df.isnull().sum(), attrs = ['bold']))

Output:



Now, using the ‘describe’ method we can get a statistical view of the data like mean, median, standard deviation, and so on. 


Python Implementation:


df.describe()

Output:



Our final step in the EDA process is to check the data types of the variables that are present in our variables. In case if there is any float or object type variable, we have to convert them into integer type. Now, let’s have a look at the data types of the variables present in our dataset using the ‘dtypes’ method in python.


Python Implementation:


print(cl(df.dtypes, attrs = ['bold']))

Output:



We can notice that the variable ‘MasVnrArea’ is in the form of a float data type. Remember that, it is essential to change float types to integer types because linear regression in the library supports only integer type variables. It can be converted using the ‘astype’ method in python.


Python Implementation:


df['MasVnrArea'] = pd.to_numeric(df['MasVnrArea'],errors='coerce')
df['MasVnrArea'] = df['MasVnrArea'].astype('int64')

print(cl(df.dtypes, attrs= ['bold']))

Output:



With that, our EDA process is over. Our next process is to visualize the data using the matplotlib and seaborn packages.


Data Visualization


In this process, we are going to produce three different types of charts including heatmap, scatter plot, and a distribution plot.


(i) Heatmap:


Heatmaps are very useful to find relations between two variables in a dataset. Heatmap can be easily produced using the ‘heatmap’ method provided by the seaborn package in python.


Python Implementation:


# 1. Heatmap

sb.heatmap(df.corr(), annot = True, cmap = 'magma')

plt.savefig('heatmap.png')
plt.show()

Output:



(ii) Scatter plot


Like heatmap, a scatter plot is also used to observe linear relations between two variables in a dataset. In a scatter plot, the dependent variable is marked on the x-axis and the independent variable is marked on the y-axis. In our case, the ‘SalePrice’ attribute is the dependent variable, and every other are the independent variables. It would be difficult to produce a plot for each variable, so we can define a function that takes only the dependent variable and returns a scatter plot for every independent variable present in a dataset.


Python Implementation:


# 2. Scatter plot

def scatter_df(y_var):
    scatter_df = df.drop(y_var, axis = 1)
    i = df.columns
    
    plot1=sb.scatterplot(i[0], 
                           y_var, 
                           data = df, 
                           color='orange', 
                           edgecolor='b', 
                           s = 150)
    plt.title('{} / Sale Price'.format(i[0]), fontsize=16)
    plt.xlabel('{}'.format(i[0]), fontsize=14)
    plt.ylabel('Sale Price', fontsize=14)                                                
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.savefig('scatter1.png')
    plt.show()
    
    plot2=sb.scatterplot(i[1], 
                         y_var, 
                         data=df, 
                         color='yellow', 
                         edgecolor='b', 
                         s=150)
    plt.title('{} / Sale Price'.format(i[1]), fontsize=16)
    plt.xlabel('{}'.format(i[1]), fontsize=14)
    plt.ylabel('Sale Price', fontsize=14)   
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.savefig('scatter2.png')
    plt.show()
    
    plot3=sb.scatterplot(i[2], 
                         y_var, 
                         data=df, 
                         color='aquamarine', 
                         edgecolor='b', 
                         s=150)
    plt.title('{} / Sale Price'.format(i[2]), fontsize=16)
    plt.xlabel('{}'.format(i[2]), fontsize=14)
    plt.ylabel('Sale Price', fontsize=14)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.savefig('scatter3.png')
    plt.show()
    
    plot4=sb.scatterplot(i[3], 
                         y_var, 
                         data=df, 
                         color='deepskyblue', 
                         edgecolor='b', 
                         s=150)
    plt.title('{} / Sale Price'.format(i[3]), fontsize=16)
    plt.xlabel('{}'.format(i[3]), fontsize=14)
    plt.ylabel('Sale Price', fontsize=14)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.savefig('scatter4.png')
    plt.show()
    
    plot5=sb.scatterplot(i[4], 
                         y_var, 
                         data=df, 
                         color='crimson', 
                         edgecolor='white', 
                         s=150)
    plt.title('{} / Sale Price'.format(i[4]), fontsize=16)
    plt.xlabel('{}'.format(i[4]), fontsize=14)
    plt.ylabel('Sale Price', fontsize=14)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.savefig('scatter5.png')
    plt.show()
    
    plot6=sb.scatterplot(i[5], 
                         y_var, 
                         data=df, 
                         color='darkviolet', 
                         edgecolor='white', 
                         s=150)
    plt.title('{} / Sale Price'.format(i[5]), fontsize=16)
    plt.xlabel('{}'.format(i[5]), fontsize=14)
    plt.ylabel('Sale Price', fontsize=14)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.savefig('scatter6.png')
    plt.show()
    
    plot7=sb.scatterplot(i[6], 
                         y_var, 
                         data=df, 
                         color='khaki', 
                         edgecolor='b', 
                         s=150)
    plt.title('{} / Sale Price'.format(i[6]), fontsize=16)
    plt.xlabel('{}'.format(i[6]), fontsize=14)
    plt.ylabel('Sale Price', fontsize=14)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.savefig('scatter7.png')
    plt.show()
    
    plot8=sb.scatterplot(i[7], 
                         y_var, 
                         data=df, 
                         color='gold', 
                         edgecolor='b', 
                         s=150)
    plt.title('{} / Sale Price'.format(i[7]), fontsize=16)
    plt.xlabel('{}'.format(i[7]), fontsize=14)
    plt.ylabel('Sale Price', fontsize=14)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.savefig('scatter8.png')
    plt.show()
    
    plot9=sb.scatterplot(i[8], 
                         y_var, 
                         data=df, 
                         color='r', 
                         edgecolor='b', 
                         s=150)
    plt.title('{} / Sale Price'.format(i[8]), fontsize=16)
    plt.xlabel('{}'.format(i[8]), fontsize=14)
    plt.ylabel('Sale Price', fontsize=14)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.savefig('scatter9.png')
    plt.show()
    
    plot10=sb.scatterplot(i[9], 
                         y_var, 
                         data=df, 
                         color='deeppink', 
                         edgecolor='b', 
                         s=150)
    plt.title('{} / Sale Price'.format(i[9]), fontsize=16)
    plt.xlabel('{}'.format(i[9]), fontsize=14)
    plt.ylabel('Sale Price', fontsize=14)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.savefig('scatter10.png')
    plt.show()
    
scatter_df('SalePrice')

Output:












(iii) Distribution Plot


Distribution plots are very useful to check how well a variable is distributed in the dataset. Let’s now produce a distribution plot using the ‘distplot’ method to check the distribution of the ‘SalePrice’ variable in the dataset.


Python Implementation:


# 3. Distribution plot

sb.distplot(df['SalePrice'], color = 'r')
plt.title('Sale Price Distribution', fontsize = 16)
plt.xlabel('Sale Price', fontsize = 14)
plt.ylabel('Frequency', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

plt.savefig('distplot.png')
plt.show()

Output:



With that, we have finished the Data Visualization process. Our next step is to select and define the dependent variables and the independent variables and split them into a train set and test set.


Feature Selection & Data Split


As I said before, in this process we are going to define the ‘X’ variable (independent variable) and the ‘Y’ variable (dependent variable). After defining the variables, we will use them to split the data into a train set and test set. Splitting the data can be done using the ‘train_test_split’ method provided by scikit-learn in python.


Python Implementation:


# FEATURE SELECTION & DATA SPLIT

X_var = df[['LotArea', 'MasVnrArea', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF']].values

y_var = df['SalePrice'].values
X_train, X_test, y_train, y_test = train_test_split(X_var, y_var,
                                                    test_size = 0.2, 
                                                    random_state = 0)

print(cl('X_train samples : ', attrs= ['bold']), X_train[0:5])
print(cl('X_test samples : ', attrs= ['bold']), X_test[0:5])
print(cl('y_train samples : ', attrs= ['bold']), y_train[0:5])
print(cl('y_test samples : ', attrs= ['bold']), y_test[0:5])

Output:



Now that we have all our required elements to build our linear regression models. So, let’s proceed to our next step which is building the model using scikit-learn in python.


Modeling


In this process, we are going to build and train five different types of linear regression models which are the OLS model, Ridge regression model, Lasso regression model, Bayesian regression model, Elastic Net regression model. For all the models, we are going to use the pre-built algorithms provided by the scikit-learn package in python. And the process for all the models are the same, first, we define a variable to store the model algorithm, next, we fit the train set variables into the model, and finally make some predictions in the test set.


Python Implementation:


# MODELING

# 1. OLS

ols = LinearRegression()
ols.fit(X_train, y_tra in)
ols_yhat = ols.predict(X_test)

# 2. Ridge

ridge = Ridge(alpha = 0.5)
ridge.fit(X_train, y_train)
ridge_yhat = ridge.