Polynomial Regression Model in R

# Polynomial Regression

Curve fitting or curve-linear regression are additional words for the same thing. It is used when a scatterplot shows a non-linear relationship. It’s most typically employed with time series data, but it can be applied to a variety of other situations.

## Let’s use the Nepal Covid data and fit a polynomial models on Covid deaths using R

To do this first import excel file in R studio using readxl library. Like below.

library(readxl)
data <- read_excel("F:/MDS-Private-Study-Materials/First Semester/Statistical Computing with R/Assignments/Data/covid_tbl_final.xlsx")
head(data)
## # A tibble: 6 x 14
##      SN Date                Confirmed_cases_~ Confirmed_cases~ Confirmed _case~
##
## 1     1 2020-01-23 00:00:00                 1                1                 1
## 2     2 2020-01-24 00:00:00                 1                0                 1
## 3     3 2020-01-25 00:00:00                 1                0                 1
## 4     4 2020-01-26 00:00:00                 1                0                 1
## 5     5 2020-01-27 00:00:00                 1                0                 1
## 6     6 2020-01-28 00:00:00                 1                0                 1
## # ... with 9 more variables: Recoveries_total , Recoveries_daily ,
## #   Deaths_total , Deaths_daily , RT-PCR_tests_total ,
## #   RT-PCR_tests_daily , Test_positivity_rate , Recovery_rate ,
## #   Case_fatality_rate 

head() function return top 6 rows of dataframe along with all columns.

str(data)
## tibble [495 x 14] (S3: tbl_df/tbl/data.frame)
##  $SN : num [1:495] 1 2 3 4 5 6 7 8 9 10 ... ##$ Date                   : POSIXct[1:495], format: "2020-01-23" "2020-01-24" ...
##  $Confirmed_cases_total : num [1:495] 1 1 1 1 1 1 1 1 1 1 ... ##$ Confirmed_cases_new    : num [1:495] 1 0 0 0 0 0 0 0 0 0 ...
##  $Confirmed _cases_active: num [1:495] 1 1 1 1 1 1 0 0 0 0 ... ##$ Recoveries_total       : num [1:495] 0 0 0 0 0 0 1 1 1 1 ...
##  $Recoveries_daily : num [1:495] 0 0 0 0 0 0 1 0 0 0 ... ##$ Deaths_total           : num [1:495] 0 0 0 0 0 0 0 0 0 0 ...
##  $Deaths_daily : num [1:495] 0 0 0 0 0 0 0 0 0 0 ... ##$ RT-PCR_tests_total     : num [1:495] NA NA NA NA NA 3 4 5 5 NA ...
##  $RT-PCR_tests_daily : num [1:495] NA NA NA NA NA NA 1 1 0 NA ... ##$ Test_positivity_rate   : num [1:495] NA NA NA NA NA ...
##  $Recovery_rate : num [1:495] 0 0 0 0 0 0 100 100 100 100 ... ##$ Case_fatality_rate     : num [1:495] 0 0 0 0 0 0 0 0 0 0 ...

The’str()’ method examines each column’s data type. The data type number for Confirmed cases total is the same as the data type number for the other columns.

## Let us plot the daily deaths by date and see what is causing the problem

plot(data$Date,data$Deaths_daily, main= "Daily Deaths:23jan 2020-31 may 2021 ",xlab = "Date",
ylab = "Daily Deaths"   ) The problem is associated with the three outliers (all the missed deaths a priori added to the data on those 3 days!)

plot.data <- data[data$SN <= 398,] plot(plot.data$Date, plot.data$Deaths_total, main= "Daily Covid Deaths,Nepal:23 jan-23 feb2021", xlab= "Date", ylab= "Daily Deaths") As a result, we eliminate outliers. Our data is now ready to be fitted into a model. Let’s divide our model into a train set and a test set in the proportions of 70% to 30%. set.seed(132) ind <- sample(2, nrow(plot.data), replace = T, prob = c(0.7,0.3)) train_data <- plot.data[ind==1,] test_data <- plot.data[ind==2,] seed() function in R is used to reproduce results i.e. it produces the same sample again and again. When we generate randoms numbers without set. seed() function it will produce different samples at different time of execution. ## Let us fit a linear model in the filtered data (plot.data) using SN as time variable library(caret) ## Warning: package 'caret' was built under R version 4.1.2 ## Loading required package: ggplot2 ## Loading required package: lattice lm1 <- lm(plot.data$Deaths_total~plot.data$SN, data= train_data) Using the caret package, we fit a linear model to the covid data. Let’s make a prediction based on the test data. Before calculating the linear model summary, it is necessary to master some concepts in order to comprehend the summary. Coefficent of Determination : The coefficient of determination is a statistical measurement that examines how differences in one variable can be explained by the difference in a second variable. Higher the value of R square better will be the model. Residual Standard Error: The residual standard error is used to measure how well a regression model fits a dataset. Lower the value of residual standard error better will be the model. summary(lm1) ## ## Call: ## lm(formula = plot.data$Deaths_total ~ plot.data$SN, data = train_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -537.91 -344.76 22.38 351.50 582.90 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -588.8326 35.1575 -16.75 <2e-16> When we fit a linear model, we get an R2 of 79.21%, which suggests that only 79.21% of the variance is explained by independent factors in relation to dependent variables. On 396 degrees of freedom, the value of residual standard error is 350. ## Let’s plot the linear model plot(plot.data$SN, plot.data$Deaths_total, data= plot.data, main= "Daily Covid Deaths,Nepal:23 jan-23 feb2021", xlab= "Date", ylab= "Daily Deaths") abline(lm(plot.data$Deaths_total~plot.data$SN,data= plot.data), col="red",lwd=2) ## Let us fit a quadratic linear model in the filtered data qlm <- lm(plot.data$Deaths_total~ poly(plot.data$SN,2), data= train_data) summary(qlm) ## ## Call: ## lm(formula = plot.data$Deaths_total ~ poly(plot.data$SN, 2), ## data = train_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -422.04 -110.87 8.94 81.97 282.94 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 594.495 6.763 87.90 <2e-16> The value of R2 96.92 percent was obtained in this case. In terms of dependent variables, independent factors account for 96.92 percent of variability. Similarly, the residual standard error on 395 degrees of freedom is 134.9. In comparison to the linear model, we can see that the R2 value is increasing and the error is decreasing. ## Let’s plot the quardatic model plot(plot.data$SN, plot.data$Deaths_total, data= plot.data, main= "Daily Covid Deaths,Nepal:23 jan-23 feb2021", xlab= "Date", ylab= "Daily Deaths") ## Warning in plot.window(...): "data" is not a graphical parameter ## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter ## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a ## graphical parameter ## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a ## graphical parameter ## Warning in box(...): "data" is not a graphical parameter ## Warning in title(...): "data" is not a graphical parameter lines(fitted(qlm)~SN, data=plot.data, col= "red",lwd=2) Quardatic model fited data more welly than linear model. ## Let’s Fit Cubic Model We fit the cubic model in the following way. clm <- lm(plot.data$Deaths_total~poly(SN,3), data= plot.data)

Let’s calculate the summary of cubic model and observed what changes came,

summary(clm)
##
## Call:
## lm(formula = plot.data$Deaths_total ~ poly(SN, 3), data = plot.data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -369.58 -123.49 12.82 99.36 267.65 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 594.495 6.696 88.789 < 2e-16 *** ## poly(SN, 3)1 13595.485 133.576 101.781 < 2e-16 *** ## poly(SN, 3)2 6428.710 133.576 48.128 < 2e-16 *** ## poly(SN, 3)3 -401.539 133.576 -3.006 0.00282 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 133.6 on 394 degrees of freedom ## Multiple R-squared: 0.9699, Adjusted R-squared: 0.9696 ## F-statistic: 4228 on 3 and 394 DF, p-value: < 2.2e-16 The R-square value is 96.99 percent, and the residual standard error is 133.6. When we compare the prior model to this one, we can immediately see the differences. ## Let’s Plot the Cubic Model plot(plot.data$SN, plot.data$Deaths_total, data= plot.data, main= "Daily Covid Deaths,Nepal:23 jan-23 feb2021", xlab= "Date", ylab= "Daily Deaths") ## Warning in plot.window(...): "data" is not a graphical parameter ## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter ## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a ## graphical parameter ## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a ## graphical parameter ## Warning in box(...): "data" is not a graphical parameter ## Warning in title(...): "data" is not a graphical parameter lines(fitted(clm)~plot.data$SN,data = plot.data, col= "red",lwd= 2) From figure we can see that predicted model and actual model are more closure than in case of quardatic model.

## Let’s Fit Double Quardatic Model

dlm <- lm(plot.data$Deaths_total~poly(plot.data$SN,4))

Let’s calculate the summary of it.

summary(dlm)
##
## Call:
## lm(formula = plot.data$Deaths_total ~ poly(plot.data$SN, 4))
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -105.44  -53.22  -12.50   53.61  159.13
##
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)              594.50       3.13  189.92  < 2e-16 ***
## poly(plot.data$SN, 4)1 13595.49 62.45 217.71 < 2e-16 *** ## poly(plot.data$SN, 4)2  6428.71      62.45  102.94  < 2e-16 ***
## poly(plot.data$SN, 4)3 -401.54 62.45 -6.43 3.71e-10 *** ## poly(plot.data$SN, 4)4 -2344.63      62.45  -37.55  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.45 on 393 degrees of freedom
## Multiple R-squared:  0.9934, Adjusted R-squared:  0.9934
## F-statistic: 1.486e+04 on 4 and 393 DF,  p-value: < 2.2e-16

In this scenario, the independent variables have a 99.34 percent variability with respect to the dependent variable. In addition, the residual standard error is 62.45, which is half of the cubic model’s.

## Let’s fit the Double Quardatic Model

plot(plot.data$SN, plot.data$Deaths_total, data= plot.data,
main= "Daily Covid Deaths,Nepal:23 jan-23 feb2021",
xlab= "Date",
ylab= "Daily Deaths")
## Warning in plot.window(...): "data" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in box(...): "data" is not a graphical parameter

## Warning in title(...): "data" is not a graphical parameter
lines(fitted(dlm)~plot.data$SN,data = plot.data, col= "red",lwd= 2) Here both the model are near to overlap ## Let’s Plot Fifth Order Ploynomial flm <- lm(plot.data$Deaths_total~poly(plot.data$SN,5),data= plot.data) Let’s calculate the summary of flm to see the value of R square and residual standard error. summary(flm) ## ## Call: ## lm(formula = plot.data$Deaths_total ~ poly(plot.data\$SN, 5),
##     data = plot.data)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -77.300 -16.980  -3.571  19.199 140.089
##
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)              594.495      1.716  346.36   <2e-16>`

In this case, the residual error is approximately half that of the double quardatic model, and the R square is 99.98 percent. Our model performs better than the previous one since we used a higher order ploynomial. As a result, higher order polynomial models are preferred since they reduce error and improve model accuracy.