Resource
- Dataset
Definition
The Data
Problem Statement
- Business Problem
- Data Problem
Data Cleaning
Data Pre-processing
- Feature Engineering
  - Using dplyr to create new columns from
Feature Selection
- Drop yom
- Categorical Variables
  - Dummy Variables
Data Splitting
Model Building
- Linear Regression
Model Validation
- Predicting values for our test set
  - Mean Squared Error and Root Mean Squared Error
- Model Evaluation
  - Model Performance
  - Examine Residuals
Compare Residuals of Multiple Linear Regression with Normal Density
Summary
Links
- Previous: Communicating Results using RMarkdown
- Next: Machine Learning Classification Algorithms
References

Resource

Dataset

We will be using a dataset that contains vehicles from the manufacturer Subaru.

Download from the following link, and copy the file in your working directory

Definition

Exploratory Data Analysis

Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

In EDA we use visualisation and transformation to explore our data in a systematic way.

Outcomes of EDA

Understanding the given dataset and helps clean up the given dataset.
It gives you a clear picture of the features and the relationships between them.
Providing guidelines for essential variables and leaving behind/removing non-essential variables.
Handling Missing values or human error.
Identifying outliers.

Machine Learning

Machine learning algorithms are programs that can learn from data and improve from experience, without human intervention.

Machine learning algorithms use historical data as input to predict new output values.

Types of Machine Learning

Supervised Machine Learning Algorithms

Supervised learning uses labeled training data to learn the mapping function that turns input variables (X) into the output variable (Y).

Classification

Used to predict the outcome of a categorical variable.

If your data has a label column, target variable, of qualitative nature and occurs in categories, we use classification algorithms.

Regression

Used to predict the outcome of a continuous variable. When the target variable in our data is continuous, we use regression algorithms.

Unsupervised Machine Learning Algorithms

Unsupervised learning models are used when we only have the input variables (X) and no corresponding output variables. They use unlabeled training data to model the underlying structure of the data.

Clustering
Dimensionality Reduction
Association

Reinforcement Learning

Reinforcement learning is a type of machine learning algorithm that allows an agent to decide the best next action based on its current state by learning behaviors that will maximize a reward.

Reinforcement algorithms usually learn optimal actions through trial and error

Steps in Building ML Model

Problem Formulation
Data Tidying
Freature Pre-processing
Data Splitting
Model Building
Model Evaluation
Prediction

The Data

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(fastDummies)

## Warning: package 'fastDummies' was built under R version 4.1.3

library(caTools)

## Warning: package 'caTools' was built under R version 4.1.3

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.1.3

## corrplot 0.92 loaded

# load the dataset into a dataframe
df <- read.csv(file.choose(),header=TRUE)

# Number of rows and columns
dim(df)

## [1] 276   6

# first 5 rows of the df
head(df)

##                                                     title    model    condition
## 1                               Subaru Forester 2014 Gray Forester Foreign Used
## 2                               Subaru Forester 2014 Blue Forester Foreign Used
## 3                       Subaru XV 2014 Sport Package Blue       XV Foreign Used
## 4                              Subaru Forester 2014 Black Forester Foreign Used
## 5 New Subaru Impreza 2012 WRX Hatchback STI Limited White  Impreza    Brand New
## 6                              Subaru Forester 2014 Green Forester Foreign Used
##    yom mileage   price
## 1 2014      NA 2890000
## 2 2014  100862 2400000
## 3 2014  115000 1850000
## 4 2014   38000 2400000
## 5 2012    6683 1950000
## 6 2014   89021 2700000

# Return the dataframe in a table
# View(df)

To see the structure of the dataframe use str()

# structure of the dataframe
str(df)

## 'data.frame':    276 obs. of  6 variables:
##  $ title    : chr  "Subaru Forester 2014 Gray" "Subaru Forester 2014 Blue" "Subaru XV 2014 Sport Package Blue" "Subaru Forester 2014 Black" ...
##  $ model    : chr  "Forester" "Forester" "XV" "Forester" ...
##  $ condition: chr  "Foreign Used" "Foreign Used" "Foreign Used" "Foreign Used" ...
##  $ yom      : chr  "2014" "2014" "2014" "2014" ...
##  $ mileage  : num  NA 100862 115000 38000 6683 ...
##  $ price    : int  2890000 2400000 1850000 2400000 1950000 2700000 870000 1350000 1350000 1950000 ...

Problem Statement

I am going to walk you through how we can train a model that will help us predict car prices. you’ll practice the machine learning workflow you’ve learned so far to predict a car’s market price using its attributes. The dataset we will be working with contains information on various cars of Subaru make.

Business Problem

This study deals with predicting the price of used cars in the Kenyan market.

Data Problem

In the world of Machine Learning, this problem is a Regression problem using which we can predict the price of a used car given a variety of features like (mileage, model, condition, year of manufacture)

Data Cleaning

Change Data Types

# change model column to categorical type
df$model <- as.factor(df$model)

# change condition column to categorical type
df$condition <- as.factor(df$condition)

# change yom column to numeric type
df$yom <- as.numeric(df$yom)

## Warning: NAs introduced by coercion

# structure of dataframe
str(df)

## 'data.frame':    276 obs. of  6 variables:
##  $ title    : chr  "Subaru Forester 2014 Gray" "Subaru Forester 2014 Blue" "Subaru XV 2014 Sport Package Blue" "Subaru Forester 2014 Black" ...
##  $ model    : Factor w/ 10 levels "Exiga","Forester",..: 2 2 10 2 3 2 4 3 3 5 ...
##  $ condition: Factor w/ 4 levels "Brand New","Foreign Used",..: 2 2 2 2 1 2 3 2 2 2 ...
##  $ yom      : num  2014 2014 2014 2014 2012 ...
##  $ mileage  : num  NA 100862 115000 38000 6683 ...
##  $ price    : int  2890000 2400000 1850000 2400000 1950000 2700000 870000 1350000 1350000 1950000 ...

Handle NA and Missing Values

Observe columns that have missing values and deal with them. You can drop the rows with missing values, or replace the missing values with the mean, mode or median

# summary of the dataframe
summary(df)

##     title                model           condition        yom      
##  Length:276         Forester:94   Brand New   : 12   Min.   :2000  
##  Class :character   Outback :60   Foreign Used:212   1st Qu.:2013  
##  Mode  :character   Impreza :51   Kenyan Used : 51   Median :2014  
##                     Legacy  :26   NULL        :  1   Mean   :2013  
##                     XV      :22                      3rd Qu.:2014  
##                     Levorg  :16                      Max.   :2016  
##                     (Other) : 7                      NA's   :1     
##     mileage           price        
##  Min.   :  6683   Min.   : 465000  
##  1st Qu.: 65566   1st Qu.:1450000  
##  Median : 80000   Median :2140000  
##  Mean   : 96104   Mean   :2137547  
##  3rd Qu.:101431   3rd Qu.:2749250  
##  Max.   :868301   Max.   :3650000  
##  NA's   :41

# condition has 1 NULL
# yom has 1 NA
# mileage has 41 missing values

Dropping observations with missing values. All cars without mileage are dropped.

# df[is.na(df$mileage),]

df <- df[!is.na(df$mileage),]

Fill in the missing value for yom column

# notice how the title of the row has the year of make 2014
df[is.na(df$yom),]

##                         title   model condition yom mileage   price
## 271 Subaru Outback 2014 black Outback      NULL  NA  180097 1649999

# set the NA value to 2014 
df[is.na(df$yom),'yom'] <- 2014

The other way we could achieve the same is by filling the missing value with the mode of the column.

We can get the mode by easily visualizing using a barplot

# table of year and frequency
freq_yom <- table(df$yom)

# sort in descending order
freq_yom <- sort(freq_yom,decreasing=TRUE)

barplot(freq_yom,col="lightblue",
        ylab="Number of cars",xlab="yom",las=2,
        main="Distribution of cars and their year of make")

Use the mode year and replace the missing value

# table of year and frequency
freq_yom <- table(df$yom)

# sort in descending order
freq_yom <- sort(freq_yom,decreasing=TRUE)

# get the mode year
mode_year <- names(freq_yom)[1]

# change to numeric
mode_year <- as.numeric(mode_year)

# replace the missing value
df[is.na(df$yom),'yom'] <- mode_year

condition column does not contain any missing values. However, there is “NULL” as one of the observations.

We will replace the “NULL” using the mode method discussed above

# table of year and frequency
freq_cond <- table(df$condition)

# sort in descending order
freq_cond <- sort(freq_cond,decreasing=TRUE)

barplot(freq_cond,col="lightblue",
        ylab="Number of cars",xlab="condition",
        main="Distribution of cars and their condition")

Foreign Used is the mode, or the majority class.

# table of year and frequency
freq_cond <- table(df$condition)

# sort in descending order
freq_cond <- sort(freq_cond,decreasing=TRUE)

# grab the name of the top condition
mode_cond <- names(freq_cond)[1]

# update the row with NULL with the mode condition
df[df$condition=="NULL",'condition'] <- mode_cond

# summary of data frame
summary(df)

##     title                model           condition        yom      
##  Length:235         Forester:79   Brand New   : 12   Min.   :2000  
##  Class :character   Outback :51   Foreign Used:178   1st Qu.:2013  
##  Mode  :character   Impreza :46   Kenyan Used : 45   Median :2014  
##                     Legacy  :22   NULL        :  0   Mean   :2013  
##                     XV      :19                      3rd Qu.:2014  
##                     Levorg  :11                      Max.   :2016  
##                     (Other) : 7                                    
##     mileage           price        
##  Min.   :  6683   Min.   : 465000  
##  1st Qu.: 65566   1st Qu.:1435000  
##  Median : 80000   Median :2100000  
##  Mean   : 96104   Mean   :2126740  
##  3rd Qu.:101431   3rd Qu.:2715000  
##  Max.   :868301   Max.   :3650000  
##

# update categories in condition column
df$condition <- as.factor(as.character(df$condition))

summary(df)

##     title                model           condition        yom      
##  Length:235         Forester:79   Brand New   : 12   Min.   :2000  
##  Class :character   Outback :51   Foreign Used:178   1st Qu.:2013  
##  Mode  :character   Impreza :46   Kenyan Used : 45   Median :2014  
##                     Legacy  :22                      Mean   :2013  
##                     XV      :19                      3rd Qu.:2014  
##                     Levorg  :11                      Max.   :2016  
##                     (Other) : 7                                    
##     mileage           price        
##  Min.   :  6683   Min.   : 465000  
##  1st Qu.: 65566   1st Qu.:1435000  
##  Median : 80000   Median :2100000  
##  Mean   : 96104   Mean   :2126740  
##  3rd Qu.:101431   3rd Qu.:2715000  
##  Max.   :868301   Max.   :3650000  
##

Outlier Detection

Outliers, as the name suggests, are the data points that lie away from the other points of the dataset. That is the data values that appear away from other data values and hence disturb the overall distribution of the dataset.

Outliers in data can distort predictions and affect the accuracy, if you don’t detect and handle them appropriately especially in regression models.

For a given continuous variable, outliers are those observations that lie outside 1.5*IQR, where IQR, the Inter Quartile Range is the difference between 75th and 25th quartiles.

Box Plots

A boxplot graphically represents the distribution of a quantitative variable by visually displaying five common location summary (minimum, median, first/third quartiles and maximum) and any observation that was classified as a suspected outlier using the interquartile range (IQR) criterion.

Box Plot

# create boxplot of mileage
boxplot(df$mileage)

# get the lower quartile of mileage
lq <- quantile(df$mileage,0.25)
lq

##   25% 
## 65566

# get the upper quartile value
uq <- quantile(df$mileage,0.75)
uq

##    75% 
## 101431

Replace Outliers

Capping

For values that lie outside the 1.5*IQR limits, we could cap it by replacing those observations outside the lower limit with the value of 5th %ile and those that lie above the upper limit, with the value of 95th %ile.

# get the 5 percentile value
cap_lower <- quantile(df$mileage,0.5)
cap_lower

##   50% 
## 80000

# get the 95 percentile value
cap_upper <- quantile(df$mileage,0.95)
cap_upper

##      95% 
## 180151.9

Replace the outliers with their respective capping values

# interquartile range

# iqr <- uq - lq
iqr <- IQR(df$mileage)

# show the outliers to the lower side
df[df$mileage<(lq-(1.5*iqr)),]

##                                                     title   model condition
## 5 New Subaru Impreza 2012 WRX Hatchback STI Limited White Impreza Brand New
##    yom mileage   price
## 5 2012    6683 1950000

# drop the rows
df <- df[!df$mileage<(lq-(1.5*iqr)),]

# replace it with 5th percentile value
# df[df$mileage<(lq-(1.5*iqr)),'mileage'] <- cap_lower

# show the outliers to the upper side
# df[df$mileage > (uq+(1.5*iqr)),]

# drop the rows
df <- df[!df$mileage > (uq+(1.5*iqr)),]

# replace values that are greater than 1.5 upper quartile
# df[df$mileage > (uq+(1.5*iqr)),'mileage'] <- cap_upper

summary(df$mileage)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15000   65000   78633   81713   92500  155000

Let’s see the boxplot of the cleaned mileage column

boxplot(df$mileage)

Scatter Plot

We can use a scatter plot to visualize the distribution of one variable in relation to another variable. Let’s see how mileage and price are related

plot(df$mileage,df$price)

Using ggplot2

ggplot(df,aes(x=mileage,y=price)) + geom_point() + scale_y_continuous("Price of vehicle",labels = scales::comma) + labs(title="Distribution of Prive vs Mileage of Subaru Cars")

Data Pre-processing

Feature Engineering

Let’s create a new column age_years that captures the number of years that have passed since its year of manufacture yom

# create a new column
df$age_years <- (2022 - df$yom)

Using dplyr to create new columns from

# using pipe syntax
df <- df %>% mutate(age_years = (2022-yom))

Summary of the new column

# summary of age_years
summary(df$age_years)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.000   7.000   8.000   8.344   8.000  17.000

Outlier Handling

We will drop some of the vehicles which age_years is greater than 15.

# box plot of age_years
boxplot(df$age_years)

# show vehicles that age_years is less than 15 years
# df[df$age_years<15,]

# keep only cars with less than 15 age_years
df <- df[df$age_years<15,]

# number of rows
dim(df)

## [1] 210   7

Feature Selection

We identify the dependent and independent variables.

library(corrplot)
# correlation plot of numeric variables
# df %>% select_if(is.numeric) %>% cor() %>% corrplot()

df %>% 
  select_if(is.numeric) %>% 
  cor() %>%
 corrplot(method = 'square', order = 'FPC', 
          type = 'lower', diag = FALSE)

yom has high correlation with age_years

We will use only one of the columns in our model to avoid multicollinearity

Drop `yom`

To avoid multicollinearity let’s drop the yom column

We get the index of the column and use the index to exclude the column from the results

# get column names
names(df)

## [1] "title"     "model"     "condition" "yom"       "mileage"   "price"    
## [7] "age_years"

# it's on the fourth index 
# df <- df[, -4]

# using dplyr
df <- df %>% select(-yom) # select all columns except yom

head(df)

##                                title    model    condition mileage   price
## 2          Subaru Forester 2014 Blue Forester Foreign Used  100862 2400000
## 3  Subaru XV 2014 Sport Package Blue       XV Foreign Used  115000 1850000
## 4         Subaru Forester 2014 Black Forester Foreign Used   38000 2400000
## 6         Subaru Forester 2014 Green Forester Foreign Used   89021 2700000
## 9          Subaru Impreza 2014 White  Impreza Foreign Used   83000 1350000
## 11        Subaru Impreza 2014 Silver  Impreza Foreign Used   64000 1240000
##    age_years
## 2          8
## 3          8
## 4          8
## 6          8
## 9          8
## 11         8

Categorical Variables

# model and associated frequency table
freq_model <- table(df$model)

freq_model <- sort(freq_model,decreasing = TRUE)

barplot(freq_model,main="Number of cars by model",las=2,col="lightblue")

How does model affect price

# use dplyr group_by to group by model
avg_price.model <- df %>% group_by(model) %>% summarise(avg_price=mean(price)) %>% arrange(desc(avg_price))

barplot(avg_price.model$avg_price, names.arg=avg_price.model$model,
        main="Average Price of Subaru Models",
        col="lightblue",las=2)

Does condition affect price

# use dplyr group_by to group by model
avg_price.cond <- df %>% group_by(condition) %>% summarise(avg_price=mean(price)) %>% arrange(desc(avg_price))

barplot(avg_price.cond$avg_price, names.arg=avg_price.cond$condition,
        main="Average Price of Subaru Based on Condition",
        col="lightblue",las=2)

Dummy Variables

A dummy variable is a type of variable that we create in regression analysis so that we can represent a categorical variable as a numerical variable that takes on one of two values: zero or one.

Dummy Variables

Install `fastDummies` package

install.packages('fastDummies',binary=TRUE)

## Warning: package 'fastDummies' is in use and will not be installed

Load the `fastDummies` library

# load the package to your session
library(fastDummies)

Convert Categorical Variables to Dummy Variables

use the dummy_cols() function to make the dummy variables.

#create dummy variables
df <- dummy_cols(
  df,
  select_columns = c('condition', 'model'),
  remove_selected_columns = TRUE,
  remove_first_dummy = TRUE
)

# let's see the first 5 rows of our new dt
head(df)

##                               title mileage   price age_years
## 1         Subaru Forester 2014 Blue  100862 2400000         8
## 2 Subaru XV 2014 Sport Package Blue  115000 1850000         8
## 3        Subaru Forester 2014 Black   38000 2400000         8
## 4        Subaru Forester 2014 Green   89021 2700000         8
## 5         Subaru Impreza 2014 White   83000 1350000         8
## 6        Subaru Impreza 2014 Silver   64000 1240000         8
##   condition_Foreign Used condition_Kenyan Used model_Forester model_Impreza
## 1                      1                     0              1             0
## 2                      1                     0              0             0
## 3                      1                     0              1             0
## 4                      1                     0              1             0
## 5                      1                     0              0             1
## 6                      1                     0              0             1
##   model_Legacy model_Levorg model_Outback model_SVX model_Trezia model_Tribeca
## 1            0            0             0         0            0             0
## 2            0            0             0         0            0             0
## 3            0            0             0         0            0             0
## 4            0            0             0         0            0             0
## 5            0            0             0         0            0             0
## 6            0            0             0         0            0             0
##   model_XV
## 1        0
## 2        1
## 3        0
## 4        0
## 5        0
## 6        0

Data Splitting

Before we split the data into a training and testing set, we should drop the title variable since it does not add any valuable information to our model.

# drop the first column
df <- df[,-1]
head(df)

##   mileage   price age_years condition_Foreign Used condition_Kenyan Used
## 1  100862 2400000         8                      1                     0
## 2  115000 1850000         8                      1                     0
## 3   38000 2400000         8                      1                     0
## 4   89021 2700000         8                      1                     0
## 5   83000 1350000         8                      1                     0
## 6   64000 1240000         8                      1                     0
##   model_Forester model_Impreza model_Legacy model_Levorg model_Outback
## 1              1             0            0            0             0
## 2              0             0            0            0             0
## 3              1             0            0            0             0
## 4              1             0            0            0             0
## 5              0             1            0            0             0
## 6              0             1            0            0             0
##   model_SVX model_Trezia model_Tribeca model_XV
## 1         0            0             0        0
## 2         0            0             0        1
## 3         0            0             0        0
## 4         0            0             0        0
## 5         0            0             0        0
## 6         0            0             0        0

We will also shift the price column to the end of the data frame

# shift price column to end of df
df <- df %>% 
  select(-price,everything())

install.packages("caTools") # for data splitting

library(caTools)

# set seed to ensure you always have same random numbers generated
set.seed(123)

# splits the data in the ratio mentioned in SplitRatio. 
sample = sample.split(df,SplitRatio = 0.8)

# create a training set 
train_set =subset(df,sample ==TRUE)

# create a testing set
test_set=subset(df, sample==FALSE)

# number of rows and columns for training data
dim(train_set)

## [1] 165  14

# number of rows and columns for test data
dim(test_set)

## [1] 45 14

Model Building

Now that our data is ready for regression analysis, let’s choose a regression algorithm.

Linear Regression

Linear regression is a regression model that uses a straight line to describe the relationship between variables. It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model.

Assumptions of a linear regression model

Linear assumption — model assumes that the relationship between variables is linear

No noise — model assumes that the input and output variables are not noisy — so remove outliers if possible

No collinearity — model will overfit when you have highly correlated input variables

Normal distribution — the model will make more reliable predictions if your input and output variables are normally distributed. If that’s not the case, try using some transforms on your variables to make them more normal-looking

Normality of residuals - The residual errors are assumed to be normally distributed.

Rescaled inputs — use scalers or normalizer to make more reliable predictions

You should be aware of these assumptions every time you’re creating linear models. We’ll ignore most of them for the purpose of this tutorial, as the goal is to show you the general syntax you can copy-paste between your projects.

There are two main types of linear regression:

Simple linear regression : uses only one independent variable

Multiple linear regression : uses two or more independent variables

Simple Linear Regression

n this process, a relationship is established between independent and dependent variables by fitting them to a line. This line is known as the regression line and represented by a linear equation:

y = mX + c

In this equation:

Y – Dependent Variable

m – Slope or Gradient

X – Independent variable

c – Y Intercept

The coefficients a & b are derived by minimizing the sum of the squared difference of distance between data points and the regression line.

#
plot(train_set$age_years,train_set$price)

Creating a Linear regression model

We will use the function lm() to fit a linear model

# fit a linear model using the training data
linear_model <- lm(price~age_years, data=train_set)

# see the output of the model
summary(linear_model)

## 
## Call:
## lm(formula = price ~ age_years, data = train_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1294542  -524542    75458   515172  1365172 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4642254     292129  15.891  < 2e-16 ***
## age_years    -289714      35006  -8.276 4.33e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 636600 on 163 degrees of freedom
## Multiple R-squared:  0.2959, Adjusted R-squared:  0.2916 
## F-statistic:  68.5 on 1 and 163 DF,  p-value: 4.331e-14

#
plot(train_set$age_years,train_set$price)
# draw the regression line
abline(linear_model,col="blue")

Using ggplot2

# fit a regression line on the scatter plot
ggplot(data=train_set,aes(x=age_years,y=price)) + 
  geom_point() +
  stat_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

Creating a Multilinear regression model

# fit a multiple linear model
multi_model <- lm(price ~ ., data=train_set)

summary(multi_model)

## 
## Call:
## lm(formula = price ~ ., data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -847779 -223446    6425  223238 1119803 
## 
## Coefficients: (1 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.187e+06  4.067e+05   7.836 7.51e-13 ***
## mileage                  -4.219e-01  1.310e+00  -0.322 0.747737    
## age_years                -2.055e+05  2.679e+04  -7.671 1.91e-12 ***
## `condition_Foreign Used` -1.296e+04  1.199e+05  -0.108 0.914058    
## `condition_Kenyan Used`  -2.774e+05  1.458e+05  -1.903 0.058917 .  
## model_Forester            1.153e+06  3.412e+05   3.380 0.000922 ***
## model_Impreza             1.357e+05  3.435e+05   0.395 0.693373    
## model_Legacy              3.453e+05  3.516e+05   0.982 0.327577    
## model_Levorg              3.815e+05  3.648e+05   1.046 0.297326    
## model_Outback             1.472e+06  3.417e+05   4.308 2.94e-05 ***
## model_SVX                 1.523e+06  4.788e+05   3.181 0.001782 ** 
## model_Trezia             -2.806e+05  3.906e+05  -0.718 0.473606    
## model_Tribeca                    NA         NA      NA       NA    
## model_XV                  4.429e+05  3.484e+05   1.271 0.205592    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 336500 on 152 degrees of freedom
## Multiple R-squared:  0.8165, Adjusted R-squared:  0.802 
## F-statistic: 56.36 on 12 and 152 DF,  p-value: < 2.2e-16

Interpretation of the Model

Residuals

The residuals are the difference between the actual values and the predicted values.

Coefficients

t-statistic and p-values

For a given predictor, the t-statistic (and its associated p-value) tests whether or not there is a statistically significant relationship between a given predictor and the outcome variable, that is whether or not the beta coefficient of the predictor is significantly different from zero.

The statistical hypotheses are as follow:

Null hypothesis (H0): the coefficients are equal to zero (i.e., no relationship between x and y)

Alternative Hypothesis (Ha): the coefficients are not equal to zero (i.e., there is some relationship between x and y)

Another aspect to pay attention to your linear models is the p-value of the coefficients.

A p-value indicates whether or not you can reject or accept a hypothesis.

A very small p value means that the predictor is probably an excellent addition to your model.

A standard way to test if the predictors are not meaningful is looking if the p-values smaller than 0.05.

To the right of the p-values you’ll see several asterisks (or none if the coefficient is not significant to the model). The number of asterisks corresponds with the significance of the coefficient as described in the legend just under the coefficients section. The more asterisks, the more significant.

R Squared / Coefficient of Determination

is a measure of the goodness of fit of the model. It measures the proportion of the total variability that is explained by the model, or how well the model fits the data.

R^2 varies between 0 and 1:

R^2 = 0: the model explains nothing

R^2 = 1: the model explains everything

0 < R^2 < 1: the model explains part of the variability

The higher the R^2, the better the model explains the dependent variable. As a rule of thumb, a R^2>0.7 indicates a good fit of the model

Multiple R Squared

The Multiple R-squared value is most often used for simple linear regression (one predictor). It tells us what percentage of the variation within our dependent variable that the independent variable is explaining. In other words, it’s another method to determine how well our model is fitting the data.

Adjusted R Squared

The Adjusted R-squared value is used when running multiple linear regression

If we add variables no matter if its significant in prediction or not the value of R-squared will increase which the reason Adjusted R-squared is used because if the variable added isn’t significant for the prediction of the model the value of Adjusted R-squared will reduce, it one of the most helpful tools to avoid overfitting of the model.

F-Statistic

A small pvalue from the F-statistic confirms that there is a linear relationship between predictors and response variable.

Model Validation

Predicting values for our test set

# predict price using test_set
prediction <- predict(multi_model, newdata = test_set)

## Warning in predict.lm(multi_model, newdata = test_set): prediction from a rank-
## deficient fit may be misleading

prediction

##         5         8        11        19        22        25        33        36 
## 1630618.9 1873753.9 2650166.2 2644385.5 1881968.7 1502528.7 2859577.1 1841631.4 
##        39        47        50        53        61        64        67        75 
## 1874738.3 2860981.2 2953736.3 1960876.8 2661136.2  941424.1 2870822.6 1877691.7 
##        78        81        89        92        95       103       106       109 
## 2653177.5 2058610.7 2860316.7 2991915.1 2866600.0 2048914.6 2643837.4 1949185.2 
##       117       120       123       131       134       137       145       148 
## 2868291.0 2137667.5 2663034.8 1627821.6 1640465.8 2374764.6 2837912.7 1946649.0 
##       151       159       162       165       173       176       179       187 
## 2272041.5 1656514.3 3170764.2 2977785.9 1843621.2 2461452.6 2653541.6 2983270.8 
##       190       193       201       204       207 
## 1637369.7 2441103.0 2644866.9 2950361.0 2049879.9

Mean Squared Error and Root Mean Squared Error

If you want a more concrete way of evaluating your regression models, look no further than RMSE (Root Mean Squared Error). This metric will inform you how wrong your model is on average. In this case, it reports back the average monetary units in Kshs the model is wrong

# compute the residuals (mean squared error)

mse <- mean((test_set$price - prediction)^2)

paste("Mean Squared Error",mse)

## [1] "Mean Squared Error 90931834152.6189"

rmse <- sqrt(mse)

paste("Root Mean Squared Error",rmse)

## [1] "Root Mean Squared Error 301549.057621838"

Model Evaluation

Model Performance

R^2 for the simple linear regression model is 0.29, which means that 29% of the variability of the price of a Subaru car is explained by the age of the car.

The Adjusted R^2 of the multiple regression model is 0.8, which means that 80% of the variability of the price of a Subaru car is explained by the mileage, age_years, condition and the model of the car.

The relatively high R^2 means that the model, age_years and condition of a car are good characteristics to explain the price of the car.

Examine Residuals

One core assumption of linear regression analysis is that the residuals of the regression are normally distributed.

When the normality assumption is violated, interpretation and inferences may not be reliable or not at all valid.

It is important that we examine the normality of the residuals.

Plot the distribution of the residuals

Examine Distribution of Simple Linear Regression Model

# get the residuals from the simple linear regression model
linear_residuals <- residuals(linear_model)
linear_residuals <- as.data.frame(linear_residuals)

# distribution of simple linear model
ggplot(linear_residuals, aes(residuals(linear_model))) +
    geom_histogram(aes(y=..density..),fill = "#0099f9", color = "black") +
    geom_density() + 
    theme_classic() +
    labs(title = "Residuals of Simple Linear Regression Model")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Get residuals
multi_residuals <- as.data.frame(residuals(multi_model))

# Visualize residuals
ggplot(multi_residuals, aes(residuals(multi_model))) +
  geom_histogram(fill = "#0099f9", color = "black") +
  theme_classic() +
  labs(title = "Residuals of Multiple Linear Regression Model")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#plot(multi_model$residuals, pch = 16, col = "red")

Compare Residuals of Multiple Linear Regression with Normal Density

# Get residuals
multi_residuals <- as.data.frame(residuals(multi_model))

# Visualize residuals
ggplot(multi_residuals, aes(residuals(multi_model))) +
  geom_histogram(aes(y=..density..),fill = "#0099f9", color = "black") +
  geom_density() +
  theme_classic() +
  labs(title = "Residuals of Multiple Linear Regression Model")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Summary

We have learned how to train linear regression models in R. We’ve implemented a simple linear regression model entirely from scratch, and a multiple linear regression model with a real world dataset.

You’ve also learned how to evaluate the model through summary functions, residuals plots, and various metrics such as MSE and RMSE.

Machine Learning Algorithms with R : Linear Regression

Kennedy N. Waweru

3/29/2022

Resource

Dataset

Definition

Exploratory Data Analysis

Outcomes of EDA

Machine Learning

Types of Machine Learning

Supervised Machine Learning Algorithms

Classification

Regression

Unsupervised Machine Learning Algorithms

Reinforcement Learning

Steps in Building ML Model

The Data

Problem Statement

Business Problem

Data Problem

Data Cleaning

Change Data Types

Handle NA and Missing Values

Outlier Detection

Box Plots

Replace Outliers

Capping

Scatter Plot

Using ggplot2

Data Pre-processing

Feature Engineering

Using dplyr to create new columns from

Outlier Handling

Feature Selection

Drop yom

Categorical Variables

Dummy Variables

Install fastDummies package

Load the fastDummies library

Convert Categorical Variables to Dummy Variables

Data Splitting

Model Building

Linear Regression

Assumptions of a linear regression model

Simple Linear Regression

Creating a Linear regression model

Creating a Multilinear regression model

Interpretation of the Model

Residuals

Coefficients

t-statistic and p-values

R Squared / Coefficient of Determination

Multiple R Squared

Adjusted R Squared

F-Statistic

Model Validation

Predicting values for our test set

Mean Squared Error and Root Mean Squared Error

Model Evaluation

Model Performance

Examine Residuals

Examine Distribution of Simple Linear Regression Model

Compare Residuals of Multiple Linear Regression with Normal Density

Summary

Links

Previous: Communicating Results using RMarkdown

Next: Machine Learning Classification Algorithms

References

Drop `yom`

Install `fastDummies` package

Load the `fastDummies` library