Validation & Crossvalidation for Predictive Modeling including Linear Model as well as Multi Linear Model
Before starting topic, let’s be familier on some term.
Validation : An act of confirming something as true or correct.
Also, Validation is the process of establishing documentary evidence
that a procedure, process, or activity was carried out in testing before
being put into production.
Cross_Validation: Crossvalidation, also known as rotation
estimation or outofsample testing, is a set of model validation
procedures for determining how well the results of a statistical
investigation will generalize to new data.
Linear Model: The term “linear model” refers to a model that has a
linear relationship between the target variable and the independent
variable.
Multi Linear Model: A regression model that uses a straight line to
evaluate the connection between a quantitative dependent variable and
two or more independent variables is known as multiple linear
regression.
Here we will use R’s bulit in data mtcars
for coding purpose. At first
let’s divided data into train set and test set in the ratio of 70% to
30%. While doing that task never forgot to use seed()
function.
seed(): The random number generator is initialized using the seed()
method. To generate a random number, the random number generator
requires a starting value (seed value). The random number generator
defaults to using the current system time.
#Define the mtcars data as “data”:
data < mtcars
#Use random seed to replicate the result
set.seed(123)
#Do random sampling to divide the cases into two independent samples
ind < sample(2, nrow(mtcars), replace = T, prob = c(0.7, 0.3))
#Data partition
train.data < data[ind==1,]
test.data < data[ind==2,]
We divided our data into training and testing set in the ratio of 70 %
to 30%.
Let’s fit Linear Model
Set mile per gallon(mpg) as dependent variable and weight(wt) as
independent variable.
lmodel < lm(mpg~wt, data = train.data, method = "lm")
Let’s do model prediction.
pred < predict(lmodel, data= test.data)
Check value of R square and error value. To do at first we should load
library(caret)
into our R studio.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
pred < predict(lmodel, data= test.data)
R2 < R2(pred, train.data$mpg)
R2
## [1] 0.7377021
Here, we found value of Rsquare 73.77% that means 73.77% data fit the
linear model. Let’s check for error,
RMSE < RMSE(pred, test.data$mpg)
## Warning in pred  obs: longer object length is not a multiple of shorter object
## length
RMSE
## [1] 8.786064
Hence error for the model is 12.6374.
LeaveOneOut CrossValidation approach
It’s usual practice when building a machine learning model to validate
your methods by setting aside a subset of your data as a test set.
LOOCV (leaveonepersonout cross validation) is a type of cross
validation that uses each individual as a “test” set. It’s a form of
kfold cross validation in which the number of folds, k, equals the
number of participants in the dataset.
library(caret)
# Define training control
train.control < trainControl(method = "LOOCV")
# Train the model
model1 < train(mpg ~wt, data = mtcars, method =
"lm",
trControl = train.control)
print(model1)
## Linear Regression
##
## 32 samples
## 1 predictor
##
## No preprocessing
## Resampling: LeaveOneOut CrossValidation
## Summary of sample sizes: 31, 31, 31, 31, 31, 31, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3.201673 0.7104641 2.517436
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
pred1 < predict(model1, test.data)
R2 < R2(pred1, test.data$mpg)
R2
## [1] 0.7864736
We receive a value of R square 78.46 percent when fitting the model
using the leaveoneout strategy, which is higher than the linear
regression model.
RMSE < RMSE(pred1, test.data$mpg)
RMSE
## [1] 2.843768
Error is only 2.44 which is very lower than previous one.
Let’s fit the model using Kfolds CrossValidation approach
A Kfold CV is one in which a given data set is divided into K
sections/folds, with each fold serving as a testing set at some point.
Let’s look at a 10fold cross validation case (K=10). The data set is
divided into ten folds here. The first fold is used to test the model,
while the others are used to train it in the first iteration. The second
iteration uses the second fold as the testing set and the rest as the
training set. This procedure is repeated until each of the ten folds has
been utilized as a test set.
#kfold cross validation
library(caret)
# Define training control
set.seed(123)
train.control < trainControl(method = "cv", number = 10)
# Train the model
model2 < train(mpg ~ wt, data = train.data, method =
"lm",
trControl = train.control)
Calculate value of R sqauere and error observed is it will come
diffrerent from previous one.
library(caret)
pred2 < predict(model2, train.data)
R2 < R2(pred2, train.data$mpg)
R2
## [1] 0.7377021
This method gives the value of R square 73.77%. Which meand 73% data
fitted by the model.
Fit the model using Repeated Kfolds CrossValidation approach
Repeated kfold crossvalidation is a technique for improving a machine
learning model’s predicted performance. Simply repeat the
crossvalidation technique several times and return the mean result
across all folds from all runs.
#repeated kfold cross validation
library(caret)
# Define training control
set.seed(123)
train.control < trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
model < train(mpg ~wt, data = mtcars, method =
"lm",
trControl = train.control)
# Summarize the results
print(model)
## Linear Regression
##
## 32 samples
## 1 predictor
##
## No preprocessing
## Resampling: CrossValidated (10 fold, repeated 3 times)
## Summary of sample sizes: 28, 28, 29, 29, 29, 30, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.975392 0.8351572 2.539797
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Hence we get value of R square 83.51% similarly value of RMSE 2.97.
Summary: Which one should be used based on Rsquared values of “lm” model?

Rsquare for training set: 0.7013

Rsquare for training with LOOCV: 0.7104641

Rsquare for training with kfolds CV: 0.7346939

Rsquare for training with repeated kfolds CV: 0.8351572

Rsquare for testing set: 0.9031085

Rsquare for testing with LOOCV: 0.9031085

Rsquare for testing with kfolds CV: 0.9031085

Rsquare for testing with repeated kfolds CV: 0.9031085
Which one should be used based on RMSE value?

RMSE for training set: 3.08648

RMSE for training with LOOCV
3.201673 
RMSE for training with kfolds CV: 2.85133

RMSE for training with repeated k folds CV: 2.975392

RMSE for testing test: 2.279303

RMSE for testing with LOOCV: 2.244232

RMSE for testing with kfolds CV: 2.244232

RMSE for testing with repeated k folds CV: 2.244232
Let’s Repeate same process for Multilinear Regression Model
It is an extension of the simple linear regression. Multi linear
regression have more than one (two or more) independent variables.
Multi linear regression has one (1) continuous dependent variable. It is a supervised learning. All the assumptions of the simple linear regression are also applicable here. There is one more condition.
Multicollinearity must not be present i.e. correlations between
independent variables must not be “high”.
Fitting Multi Linear Regression Model
mlr < lm(mpg~., data = mtcars)
Let’s check variance inflection factor of mlr
. The inflation factor is
the difference between the variance of estimating a parameter in a model
with many other factors and the variance of a model with only one term.
which is avilable in car packages.
library(car)
## Loading required package: carData
vif(mlr)
## cyl disp hp drat wt qsec vs am
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487
## gear carb
## 5.357452 7.908747
We need to drop the independent variable with highest VIF and run the
model again until all the VIF \<10!
#Removing “disp” variable:
mlr1 < lm(mpg ~ cyl+hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars)
vif(mlr)
## cyl disp hp drat wt qsec vs am
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487
## gear carb
## 5.357452 7.908747
#Removing “cyl” variable:
mlr2 < lm(mpg ~
hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars)
summary(mlr1)
##
## Call:
## lm(formula = mpg ~ cyl + hp + drat + wt + qsec + vs + am + gear +
## carb, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.7863 1.4055 0.2635 1.2029 4.4753
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 12.55052 18.52585 0.677 0.5052
## cyl 0.09627 0.99715 0.097 0.9240
## hp 0.01295 0.01834 0.706 0.4876
## drat 0.92864 1.60794 0.578 0.5694
## wt 2.62694 1.19800 2.193 0.0392 *
## qsec 0.66523 0.69335 0.959 0.3478
## vs 0.16035 2.07277 0.077 0.9390
## am 2.47882 2.03513 1.218 0.2361
## gear 0.74300 1.47360 0.504 0.6191
## carb 0.61686 0.60566 1.018 0.3195
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.623 on 22 degrees of freedom
## Multiple Rsquared: 0.8655, Adjusted Rsquared: 0.8105
## Fstatistic: 15.73 on 9 and 22 DF, pvalue: 1.183e07
vif(mlr2)
## hp drat wt qsec vs am gear carb
## 6.015788 3.111501 6.051127 5.918682 4.270956 4.285815 4.690187 4.290468
Now all Vif less than 10 so, data is ready to fit different prediction
model.
LeaveOneOut CrossValidation approach on Multi Regression Model.
#Leave one out CV
library(caret)
# Define training control
train.control < trainControl(method = "LOOCV")
# Train the model
mlr < train(mpg ~ hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars, method = "lm",
trControl = train.control)
# Summarize
summary(mlr)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.8187 1.3903 0.3045 1.2269 4.5183
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 13.80810 12.88582 1.072 0.2950
## hp 0.01225 0.01649 0.743 0.4650
## drat 0.88894 1.52061 0.585 0.5645
## wt 2.60968 1.15878 2.252 0.0342 *
## qsec 0.63983 0.62752 1.020 0.3185
## vs 0.08786 1.88992 0.046 0.9633
## am 2.42418 1.91227 1.268 0.2176
## gear 0.69390 1.35294 0.513 0.6129
## carb 0.61286 0.59109 1.037 0.3106
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.566 on 23 degrees of freedom
## Multiple Rsquared: 0.8655, Adjusted Rsquared: 0.8187
## Fstatistic: 18.5 on 8 and 23 DF, pvalue: 2.627e08
We got value of R square is 86.55% value of error is 2.566 on 23 degree
of freedom.
Let’s fit the model using Kfolds CrossValidation approach on Multi Linear Regression Model.
#K folds Cross Validation
library(caret)
# Define training control
train.control < trainControl(method = "cv", number = 10)
# Train the model
mlr1< train(mpg ~ hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars, method = "lm",
trControl = train.control)
# Summarize
summary(mlr1)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.8187 1.3903 0.3045 1.2269 4.5183
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 13.80810 12.88582 1.072 0.2950
## hp 0.01225 0.01649 0.743 0.4650
## drat 0.88894 1.52061 0.585 0.5645
## wt 2.60968 1.15878 2.252 0.0342 *
## qsec 0.63983 0.62752 1.020 0.3185
## vs 0.08786 1.88992 0.046 0.9633
## am 2.42418 1.91227 1.268 0.2176
## gear 0.69390 1.35294 0.513 0.6129
## carb 0.61286 0.59109 1.037 0.3106
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.566 on 23 degrees of freedom
## Multiple Rsquared: 0.8655, Adjusted Rsquared: 0.8187
## Fstatistic: 18.5 on 8 and 23 DF, pvalue: 2.627e08
Again, we got value of r square 86.55% similarly, value for the error is
2.566.
Fit the model using Repeated Kfolds CrossValidation approach
set.seed(224)
# Repeated K folds Cross Validation
library(caret)
# Define training control
train.control < trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
mlr2< train(mpg ~ hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars, method = "lm",
trControl = train.control)
# Summarize
summary(mlr2)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.8187 1.3903 0.3045 1.2269 4.5183
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 13.80810 12.88582 1.072 0.2950
## hp 0.01225 0.01649 0.743 0.4650
## drat 0.88894 1.52061 0.585 0.5645
## wt 2.60968 1.15878 2.252 0.0342 *
## qsec 0.63983 0.62752 1.020 0.3185
## vs 0.08786 1.88992 0.046 0.9633
## am 2.42418 1.91227 1.268 0.2176
## gear 0.69390 1.35294 0.513 0.6129
## carb 0.61286 0.59109 1.037 0.3106
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.566 on 23 degrees of freedom
## Multiple Rsquared: 0.8655, Adjusted Rsquared: 0.8187
## Fstatistic: 18.5 on 8 and 23 DF, pvalue: 2.627e08
We got value for R square 86.55 % and value for error is 2.566.
Than you for Reading