We will be using a dataset that contains vehicles from the manufacturer Subaru.
Download from the following link, and copy the file in your working directory
Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
In EDA we use visualisation and transformation to explore our data in a systematic way.
Understanding the given dataset and helps clean up the given dataset.
It gives you a clear picture of the features and the relationships between them.
Providing guidelines for essential variables and leaving behind/removing non-essential variables.
Handling Missing values or human error.
Identifying outliers.
Machine learning algorithms are programs that can learn from data and improve from experience, without human intervention.
Machine learning algorithms use historical data as input to predict new output values.
Supervised learning uses labeled training data to learn the mapping function that turns input variables (X) into the output variable (Y).
Used to predict the outcome of a categorical variable.
If your data has a label column, target variable, of qualitative nature and occurs in categories, we use classification algorithms.
Used to predict the outcome of a continuous variable. When the target variable in our data is continuous, we use regression algorithms.
Unsupervised learning models are used when we only have the input variables (X) and no corresponding output variables. They use unlabeled training data to model the underlying structure of the data.
Clustering
Dimensionality Reduction
Association
Reinforcement learning is a type of machine learning algorithm that allows an agent to decide the best next action based on its current state by learning behaviors that will maximize a reward.
Reinforcement algorithms usually learn optimal actions through trial and error
Problem Formulation
Data Tidying
Freature Pre-processing
Data Splitting
Model Building
Model Evaluation
Prediction
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.1.3
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
# load the dataset into a dataframe
df <- read.csv(file.choose(),header=TRUE)
# Number of rows and columns
dim(df)
## [1] 276 6
# first 5 rows of the df
head(df)
## title model condition
## 1 Subaru Forester 2014 Gray Forester Foreign Used
## 2 Subaru Forester 2014 Blue Forester Foreign Used
## 3 Subaru XV 2014 Sport Package Blue XV Foreign Used
## 4 Subaru Forester 2014 Black Forester Foreign Used
## 5 New Subaru Impreza 2012 WRX Hatchback STI Limited White Impreza Brand New
## 6 Subaru Forester 2014 Green Forester Foreign Used
## yom mileage price
## 1 2014 NA 2890000
## 2 2014 100862 2400000
## 3 2014 115000 1850000
## 4 2014 38000 2400000
## 5 2012 6683 1950000
## 6 2014 89021 2700000
# Return the dataframe in a table
# View(df)
To see the structure of the dataframe use str()
# structure of the dataframe
str(df)
## 'data.frame': 276 obs. of 6 variables:
## $ title : chr "Subaru Forester 2014 Gray" "Subaru Forester 2014 Blue" "Subaru XV 2014 Sport Package Blue" "Subaru Forester 2014 Black" ...
## $ model : chr "Forester" "Forester" "XV" "Forester" ...
## $ condition: chr "Foreign Used" "Foreign Used" "Foreign Used" "Foreign Used" ...
## $ yom : chr "2014" "2014" "2014" "2014" ...
## $ mileage : num NA 100862 115000 38000 6683 ...
## $ price : int 2890000 2400000 1850000 2400000 1950000 2700000 870000 1350000 1350000 1950000 ...
I am going to walk you through how we can train a model that will help us predict car prices. you’ll practice the machine learning workflow you’ve learned so far to predict a car’s market price using its attributes. The dataset we will be working with contains information on various cars of Subaru make.
This study deals with predicting the price of used cars in the Kenyan market.
In the world of Machine Learning, this problem is a Regression problem using which we can predict the price of a used car given a variety of features like (mileage, model, condition, year of manufacture)
# change model column to categorical type
df$model <- as.factor(df$model)
# change condition column to categorical type
df$condition <- as.factor(df$condition)
# change yom column to numeric type
df$yom <- as.numeric(df$yom)
## Warning: NAs introduced by coercion
# structure of dataframe
str(df)
## 'data.frame': 276 obs. of 6 variables:
## $ title : chr "Subaru Forester 2014 Gray" "Subaru Forester 2014 Blue" "Subaru XV 2014 Sport Package Blue" "Subaru Forester 2014 Black" ...
## $ model : Factor w/ 10 levels "Exiga","Forester",..: 2 2 10 2 3 2 4 3 3 5 ...
## $ condition: Factor w/ 4 levels "Brand New","Foreign Used",..: 2 2 2 2 1 2 3 2 2 2 ...
## $ yom : num 2014 2014 2014 2014 2012 ...
## $ mileage : num NA 100862 115000 38000 6683 ...
## $ price : int 2890000 2400000 1850000 2400000 1950000 2700000 870000 1350000 1350000 1950000 ...
Observe columns that have missing values and deal with them. You can drop the rows with missing values, or replace the missing values with the mean, mode or median
# summary of the dataframe
summary(df)
## title model condition yom
## Length:276 Forester:94 Brand New : 12 Min. :2000
## Class :character Outback :60 Foreign Used:212 1st Qu.:2013
## Mode :character Impreza :51 Kenyan Used : 51 Median :2014
## Legacy :26 NULL : 1 Mean :2013
## XV :22 3rd Qu.:2014
## Levorg :16 Max. :2016
## (Other) : 7 NA's :1
## mileage price
## Min. : 6683 Min. : 465000
## 1st Qu.: 65566 1st Qu.:1450000
## Median : 80000 Median :2140000
## Mean : 96104 Mean :2137547
## 3rd Qu.:101431 3rd Qu.:2749250
## Max. :868301 Max. :3650000
## NA's :41
# condition has 1 NULL
# yom has 1 NA
# mileage has 41 missing values
Dropping observations with missing values. All cars without mileage are dropped.
# df[is.na(df$mileage),]
df <- df[!is.na(df$mileage),]
Fill in the missing value for yom column
# notice how the title of the row has the year of make 2014
df[is.na(df$yom),]
## title model condition yom mileage price
## 271 Subaru Outback 2014 black Outback NULL NA 180097 1649999
# set the NA value to 2014
df[is.na(df$yom),'yom'] <- 2014
The other way we could achieve the same is by filling the missing value with the mode of the column.
We can get the mode by easily visualizing using a barplot
# table of year and frequency
freq_yom <- table(df$yom)
# sort in descending order
freq_yom <- sort(freq_yom,decreasing=TRUE)
barplot(freq_yom,col="lightblue",
ylab="Number of cars",xlab="yom",las=2,
main="Distribution of cars and their year of make")
Use the mode year and replace the missing value
# table of year and frequency
freq_yom <- table(df$yom)
# sort in descending order
freq_yom <- sort(freq_yom,decreasing=TRUE)
# get the mode year
mode_year <- names(freq_yom)[1]
# change to numeric
mode_year <- as.numeric(mode_year)
# replace the missing value
df[is.na(df$yom),'yom'] <- mode_year
condition column does not contain any missing values. However, there is “NULL” as one of the observations.
We will replace the “NULL” using the mode method discussed above
# table of year and frequency
freq_cond <- table(df$condition)
# sort in descending order
freq_cond <- sort(freq_cond,decreasing=TRUE)
barplot(freq_cond,col="lightblue",
ylab="Number of cars",xlab="condition",
main="Distribution of cars and their condition")
Foreign Used is the mode, or the majority class.
# table of year and frequency
freq_cond <- table(df$condition)
# sort in descending order
freq_cond <- sort(freq_cond,decreasing=TRUE)
# grab the name of the top condition
mode_cond <- names(freq_cond)[1]
# update the row with NULL with the mode condition
df[df$condition=="NULL",'condition'] <- mode_cond
# summary of data frame
summary(df)
## title model condition yom
## Length:235 Forester:79 Brand New : 12 Min. :2000
## Class :character Outback :51 Foreign Used:178 1st Qu.:2013
## Mode :character Impreza :46 Kenyan Used : 45 Median :2014
## Legacy :22 NULL : 0 Mean :2013
## XV :19 3rd Qu.:2014
## Levorg :11 Max. :2016
## (Other) : 7
## mileage price
## Min. : 6683 Min. : 465000
## 1st Qu.: 65566 1st Qu.:1435000
## Median : 80000 Median :2100000
## Mean : 96104 Mean :2126740
## 3rd Qu.:101431 3rd Qu.:2715000
## Max. :868301 Max. :3650000
##
# update categories in condition column
df$condition <- as.factor(as.character(df$condition))
summary(df)
## title model condition yom
## Length:235 Forester:79 Brand New : 12 Min. :2000
## Class :character Outback :51 Foreign Used:178 1st Qu.:2013
## Mode :character Impreza :46 Kenyan Used : 45 Median :2014
## Legacy :22 Mean :2013
## XV :19 3rd Qu.:2014
## Levorg :11 Max. :2016
## (Other) : 7
## mileage price
## Min. : 6683 Min. : 465000
## 1st Qu.: 65566 1st Qu.:1435000
## Median : 80000 Median :2100000
## Mean : 96104 Mean :2126740
## 3rd Qu.:101431 3rd Qu.:2715000
## Max. :868301 Max. :3650000
##
Outliers, as the name suggests, are the data points that lie away from the other points of the dataset. That is the data values that appear away from other data values and hence disturb the overall distribution of the dataset.
Outliers in data can distort predictions and affect the accuracy, if you don’t detect and handle them appropriately especially in regression models.
For a given continuous variable, outliers are those observations that lie outside 1.5*IQR
, where IQR, the Inter Quartile Range is the difference between 75th and 25th quartiles.
A boxplot graphically represents the distribution of a quantitative variable by visually displaying five common location summary (minimum, median, first/third quartiles and maximum) and any observation that was classified as a suspected outlier using the interquartile range (IQR) criterion.
# create boxplot of mileage
boxplot(df$mileage)
# get the lower quartile of mileage
lq <- quantile(df$mileage,0.25)
lq
## 25%
## 65566
# get the upper quartile value
uq <- quantile(df$mileage,0.75)
uq
## 75%
## 101431
For values that lie outside the 1.5*IQR
limits, we could cap it by replacing those observations outside the lower limit with the value of 5th %ile and those that lie above the upper limit, with the value of 95th %ile.
# get the 5 percentile value
cap_lower <- quantile(df$mileage,0.5)
cap_lower
## 50%
## 80000
# get the 95 percentile value
cap_upper <- quantile(df$mileage,0.95)
cap_upper
## 95%
## 180151.9
Replace the outliers with their respective capping values
# interquartile range
# iqr <- uq - lq
iqr <- IQR(df$mileage)
# show the outliers to the lower side
df[df$mileage<(lq-(1.5*iqr)),]
## title model condition
## 5 New Subaru Impreza 2012 WRX Hatchback STI Limited White Impreza Brand New
## yom mileage price
## 5 2012 6683 1950000
# drop the rows
df <- df[!df$mileage<(lq-(1.5*iqr)),]
# replace it with 5th percentile value
# df[df$mileage<(lq-(1.5*iqr)),'mileage'] <- cap_lower
# show the outliers to the upper side
# df[df$mileage > (uq+(1.5*iqr)),]
# drop the rows
df <- df[!df$mileage > (uq+(1.5*iqr)),]
# replace values that are greater than 1.5 upper quartile
# df[df$mileage > (uq+(1.5*iqr)),'mileage'] <- cap_upper
summary(df$mileage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15000 65000 78633 81713 92500 155000
Let’s see the boxplot of the cleaned mileage column
boxplot(df$mileage)
We can use a scatter plot to visualize the distribution of one variable in relation to another variable. Let’s see how mileage and price are related
plot(df$mileage,df$price)
ggplot(df,aes(x=mileage,y=price)) + geom_point() + scale_y_continuous("Price of vehicle",labels = scales::comma) + labs(title="Distribution of Prive vs Mileage of Subaru Cars")
Let’s create a new column age_years that captures the number of years that have passed since its year of manufacture yom
# create a new column
df$age_years <- (2022 - df$yom)
# using pipe syntax
df <- df %>% mutate(age_years = (2022-yom))
Summary of the new column
# summary of age_years
summary(df$age_years)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.000 7.000 8.000 8.344 8.000 17.000
We will drop some of the vehicles which age_years is greater than 15.
# box plot of age_years
boxplot(df$age_years)
# show vehicles that age_years is less than 15 years
# df[df$age_years<15,]
# keep only cars with less than 15 age_years
df <- df[df$age_years<15,]
# number of rows
dim(df)
## [1] 210 7
We identify the dependent and independent variables.
library(corrplot)
# correlation plot of numeric variables
# df %>% select_if(is.numeric) %>% cor() %>% corrplot()
df %>%
select_if(is.numeric) %>%
cor() %>%
corrplot(method = 'square', order = 'FPC',
type = 'lower', diag = FALSE)
yom has high correlation with age_years
We will use only one of the columns in our model to avoid multicollinearity
yom
To avoid multicollinearity let’s drop the yom column
We get the index of the column and use the index to exclude the column from the results
# get column names
names(df)
## [1] "title" "model" "condition" "yom" "mileage" "price"
## [7] "age_years"
# it's on the fourth index
# df <- df[, -4]
# using dplyr
df <- df %>% select(-yom) # select all columns except yom
head(df)
## title model condition mileage price
## 2 Subaru Forester 2014 Blue Forester Foreign Used 100862 2400000
## 3 Subaru XV 2014 Sport Package Blue XV Foreign Used 115000 1850000
## 4 Subaru Forester 2014 Black Forester Foreign Used 38000 2400000
## 6 Subaru Forester 2014 Green Forester Foreign Used 89021 2700000
## 9 Subaru Impreza 2014 White Impreza Foreign Used 83000 1350000
## 11 Subaru Impreza 2014 Silver Impreza Foreign Used 64000 1240000
## age_years
## 2 8
## 3 8
## 4 8
## 6 8
## 9 8
## 11 8
# model and associated frequency table
freq_model <- table(df$model)
freq_model <- sort(freq_model,decreasing = TRUE)
barplot(freq_model,main="Number of cars by model",las=2,col="lightblue")
How does model affect price
# use dplyr group_by to group by model
avg_price.model <- df %>% group_by(model) %>% summarise(avg_price=mean(price)) %>% arrange(desc(avg_price))
barplot(avg_price.model$avg_price, names.arg=avg_price.model$model,
main="Average Price of Subaru Models",
col="lightblue",las=2)
Does condition affect price
# use dplyr group_by to group by model
avg_price.cond <- df %>% group_by(condition) %>% summarise(avg_price=mean(price)) %>% arrange(desc(avg_price))
barplot(avg_price.cond$avg_price, names.arg=avg_price.cond$condition,
main="Average Price of Subaru Based on Condition",
col="lightblue",las=2)
A dummy variable is a type of variable that we create in regression analysis so that we can represent a categorical variable as a numerical variable that takes on one of two values: zero or one.
fastDummies
packageinstall.packages('fastDummies',binary=TRUE)
## Warning: package 'fastDummies' is in use and will not be installed
fastDummies
library# load the package to your session
library(fastDummies)
use the dummy_cols()
function to make the dummy variables.
#create dummy variables
df <- dummy_cols(
df,
select_columns = c('condition', 'model'),
remove_selected_columns = TRUE,
remove_first_dummy = TRUE
)
# let's see the first 5 rows of our new dt
head(df)
## title mileage price age_years
## 1 Subaru Forester 2014 Blue 100862 2400000 8
## 2 Subaru XV 2014 Sport Package Blue 115000 1850000 8
## 3 Subaru Forester 2014 Black 38000 2400000 8
## 4 Subaru Forester 2014 Green 89021 2700000 8
## 5 Subaru Impreza 2014 White 83000 1350000 8
## 6 Subaru Impreza 2014 Silver 64000 1240000 8
## condition_Foreign Used condition_Kenyan Used model_Forester model_Impreza
## 1 1 0 1 0
## 2 1 0 0 0
## 3 1 0 1 0
## 4 1 0 1 0
## 5 1 0 0 1
## 6 1 0 0 1
## model_Legacy model_Levorg model_Outback model_SVX model_Trezia model_Tribeca
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## model_XV
## 1 0
## 2 1
## 3 0
## 4 0
## 5 0
## 6 0
Before we split the data into a training and testing set, we should drop the title
variable since it does not add any valuable information to our model.
# drop the first column
df <- df[,-1]
head(df)
## mileage price age_years condition_Foreign Used condition_Kenyan Used
## 1 100862 2400000 8 1 0
## 2 115000 1850000 8 1 0
## 3 38000 2400000 8 1 0
## 4 89021 2700000 8 1 0
## 5 83000 1350000 8 1 0
## 6 64000 1240000 8 1 0
## model_Forester model_Impreza model_Legacy model_Levorg model_Outback
## 1 1 0 0 0 0
## 2 0 0 0 0 0
## 3 1 0 0 0 0
## 4 1 0 0 0 0
## 5 0 1 0 0 0
## 6 0 1 0 0 0
## model_SVX model_Trezia model_Tribeca model_XV
## 1 0 0 0 0
## 2 0 0 0 1
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
We will also shift the price column to the end of the data frame
# shift price column to end of df
df <- df %>%
select(-price,everything())
install.packages("caTools") # for data splitting
library(caTools)
# set seed to ensure you always have same random numbers generated
set.seed(123)
# splits the data in the ratio mentioned in SplitRatio.
sample = sample.split(df,SplitRatio = 0.8)
# create a training set
train_set =subset(df,sample ==TRUE)
# create a testing set
test_set=subset(df, sample==FALSE)
# number of rows and columns for training data
dim(train_set)
## [1] 165 14
# number of rows and columns for test data
dim(test_set)
## [1] 45 14
Now that our data is ready for regression analysis, let’s choose a regression algorithm.
Linear regression is a regression model that uses a straight line to describe the relationship between variables. It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model.
No noise — model assumes that the input and output variables are not noisy — so remove outliers if possible
No collinearity — model will overfit when you have highly correlated input variables
Normal distribution — the model will make more reliable predictions if your input and output variables are normally distributed. If that’s not the case, try using some transforms on your variables to make them more normal-looking
Normality of residuals - The residual errors are assumed to be normally distributed.
Rescaled inputs — use scalers or normalizer to make more reliable predictions
You should be aware of these assumptions every time you’re creating linear models. We’ll ignore most of them for the purpose of this tutorial, as the goal is to show you the general syntax you can copy-paste between your projects.
There are two main types of linear regression:
Simple linear regression : uses only one independent variable
Multiple linear regression : uses two or more independent variables
n this process, a relationship is established between independent and dependent variables by fitting them to a line. This line is known as the regression line and represented by a linear equation:
y = mX + c
In this equation:
Y – Dependent Variable
m – Slope or Gradient
X – Independent variable
c – Y Intercept
The coefficients a & b are derived by minimizing the sum of the squared difference of distance between data points and the regression line.
#
plot(train_set$age_years,train_set$price)
We will use the function lm()
to fit a linear model
# fit a linear model using the training data
linear_model <- lm(price~age_years, data=train_set)
# see the output of the model
summary(linear_model)
##
## Call:
## lm(formula = price ~ age_years, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1294542 -524542 75458 515172 1365172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4642254 292129 15.891 < 2e-16 ***
## age_years -289714 35006 -8.276 4.33e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 636600 on 163 degrees of freedom
## Multiple R-squared: 0.2959, Adjusted R-squared: 0.2916
## F-statistic: 68.5 on 1 and 163 DF, p-value: 4.331e-14
#
plot(train_set$age_years,train_set$price)
# draw the regression line
abline(linear_model,col="blue")
Using ggplot2
# fit a regression line on the scatter plot
ggplot(data=train_set,aes(x=age_years,y=price)) +
geom_point() +
stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
# fit a multiple linear model
multi_model <- lm(price ~ ., data=train_set)
summary(multi_model)
##
## Call:
## lm(formula = price ~ ., data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -847779 -223446 6425 223238 1119803
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.187e+06 4.067e+05 7.836 7.51e-13 ***
## mileage -4.219e-01 1.310e+00 -0.322 0.747737
## age_years -2.055e+05 2.679e+04 -7.671 1.91e-12 ***
## `condition_Foreign Used` -1.296e+04 1.199e+05 -0.108 0.914058
## `condition_Kenyan Used` -2.774e+05 1.458e+05 -1.903 0.058917 .
## model_Forester 1.153e+06 3.412e+05 3.380 0.000922 ***
## model_Impreza 1.357e+05 3.435e+05 0.395 0.693373
## model_Legacy 3.453e+05 3.516e+05 0.982 0.327577
## model_Levorg 3.815e+05 3.648e+05 1.046 0.297326
## model_Outback 1.472e+06 3.417e+05 4.308 2.94e-05 ***
## model_SVX 1.523e+06 4.788e+05 3.181 0.001782 **
## model_Trezia -2.806e+05 3.906e+05 -0.718 0.473606
## model_Tribeca NA NA NA NA
## model_XV 4.429e+05 3.484e+05 1.271 0.205592
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 336500 on 152 degrees of freedom
## Multiple R-squared: 0.8165, Adjusted R-squared: 0.802
## F-statistic: 56.36 on 12 and 152 DF, p-value: < 2.2e-16
The residuals are the difference between the actual values and the predicted values.
For a given predictor, the t-statistic (and its associated p-value) tests whether or not there is a statistically significant relationship between a given predictor and the outcome variable, that is whether or not the beta coefficient of the predictor is significantly different from zero.
The statistical hypotheses are as follow:
Null hypothesis (H0): the coefficients are equal to zero (i.e., no relationship between x and y)
Alternative Hypothesis (Ha): the coefficients are not equal to zero (i.e., there is some relationship between x and y)
Another aspect to pay attention to your linear models is the p-value of the coefficients.
A p-value indicates whether or not you can reject or accept a hypothesis.
A very small p value means that the predictor is probably an excellent addition to your model.
A standard way to test if the predictors are not meaningful is looking if the p-values smaller than 0.05.
To the right of the p-values you’ll see several asterisks (or none if the coefficient is not significant to the model). The number of asterisks corresponds with the significance of the coefficient as described in the legend just under the coefficients section. The more asterisks, the more significant.
is a measure of the goodness of fit of the model. It measures the proportion of the total variability that is explained by the model, or how well the model fits the data.
R^2 varies between 0 and 1:
R^2 = 0: the model explains nothing
R^2 = 1: the model explains everything
0 < R^2 < 1: the model explains part of the variability
The higher the R^2, the better the model explains the dependent variable. As a rule of thumb, a R^2>0.7 indicates a good fit of the model
The Multiple R-squared value is most often used for simple linear regression (one predictor). It tells us what percentage of the variation within our dependent variable that the independent variable is explaining. In other words, it’s another method to determine how well our model is fitting the data.
The Adjusted R-squared value is used when running multiple linear regression
If we add variables no matter if its significant in prediction or not the value of R-squared will increase which the reason Adjusted R-squared is used because if the variable added isn’t significant for the prediction of the model the value of Adjusted R-squared will reduce, it one of the most helpful tools to avoid overfitting of the model.
A small pvalue from the F-statistic confirms that there is a linear relationship between predictors and response variable.
# predict price using test_set
prediction <- predict(multi_model, newdata = test_set)
## Warning in predict.lm(multi_model, newdata = test_set): prediction from a rank-
## deficient fit may be misleading
prediction
## 5 8 11 19 22 25 33 36
## 1630618.9 1873753.9 2650166.2 2644385.5 1881968.7 1502528.7 2859577.1 1841631.4
## 39 47 50 53 61 64 67 75
## 1874738.3 2860981.2 2953736.3 1960876.8 2661136.2 941424.1 2870822.6 1877691.7
## 78 81 89 92 95 103 106 109
## 2653177.5 2058610.7 2860316.7 2991915.1 2866600.0 2048914.6 2643837.4 1949185.2
## 117 120 123 131 134 137 145 148
## 2868291.0 2137667.5 2663034.8 1627821.6 1640465.8 2374764.6 2837912.7 1946649.0
## 151 159 162 165 173 176 179 187
## 2272041.5 1656514.3 3170764.2 2977785.9 1843621.2 2461452.6 2653541.6 2983270.8
## 190 193 201 204 207
## 1637369.7 2441103.0 2644866.9 2950361.0 2049879.9
If you want a more concrete way of evaluating your regression models, look no further than RMSE (Root Mean Squared Error). This metric will inform you how wrong your model is on average. In this case, it reports back the average monetary units in Kshs the model is wrong
# compute the residuals (mean squared error)
mse <- mean((test_set$price - prediction)^2)
paste("Mean Squared Error",mse)
## [1] "Mean Squared Error 90931834152.6189"
rmse <- sqrt(mse)
paste("Root Mean Squared Error",rmse)
## [1] "Root Mean Squared Error 301549.057621838"
R^2 for the simple linear regression model is 0.29, which means that 29% of the variability of the price of a Subaru car is explained by the age of the car.
The Adjusted R^2 of the multiple regression model is 0.8, which means that 80% of the variability of the price of a Subaru car is explained by the mileage, age_years, condition and the model of the car.
The relatively high R^2 means that the model, age_years and condition of a car are good characteristics to explain the price of the car.
One core assumption of linear regression analysis is that the residuals of the regression are normally distributed.
When the normality assumption is violated, interpretation and inferences may not be reliable or not at all valid.
It is important that we examine the normality of the residuals.
Plot the distribution of the residuals
# get the residuals from the simple linear regression model
linear_residuals <- residuals(linear_model)
linear_residuals <- as.data.frame(linear_residuals)
# distribution of simple linear model
ggplot(linear_residuals, aes(residuals(linear_model))) +
geom_histogram(aes(y=..density..),fill = "#0099f9", color = "black") +
geom_density() +
theme_classic() +
labs(title = "Residuals of Simple Linear Regression Model")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Get residuals
multi_residuals <- as.data.frame(residuals(multi_model))
# Visualize residuals
ggplot(multi_residuals, aes(residuals(multi_model))) +
geom_histogram(fill = "#0099f9", color = "black") +
theme_classic() +
labs(title = "Residuals of Multiple Linear Regression Model")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#plot(multi_model$residuals, pch = 16, col = "red")
# Get residuals
multi_residuals <- as.data.frame(residuals(multi_model))
# Visualize residuals
ggplot(multi_residuals, aes(residuals(multi_model))) +
geom_histogram(aes(y=..density..),fill = "#0099f9", color = "black") +
geom_density() +
theme_classic() +
labs(title = "Residuals of Multiple Linear Regression Model")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We have learned how to train linear regression models in R. We’ve implemented a simple linear regression model entirely from scratch, and a multiple linear regression model with a real world dataset.
You’ve also learned how to evaluate the model through summary functions, residuals plots, and various metrics such as MSE and RMSE.
https://www.dataquest.io/blog/top-10-machine-learning-algorithms-for-beginners/
https://cran.r-project.org/web/packages/fastDummies/vignettes/making-dummy-variables.html
https://www.datacamp.com/community/tutorials/linear-regression-R
https://towardsdatascience.com/understanding-linear-regression-output-in-r-7a9cbda948b3
https://www.r-bloggers.com/2020/12/machine-learning-with-r-a-complete-guide-to-linear-regression/