ISLR Applied Exercises Chapter 3 — Question 09

Uvini Ranaweera
9 min readJan 2, 2023

--

This article explains my Kaggle notebook further, which answers question 08 of chapter 03 of “An Introduction to Statistical Learning”.

The data set used: Auto dataset of islr2 package.

Language: R

Let's look into one question at a time.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

# import library
library(GGally)
# plot
ggpairs(Auto[,-9])

Note: I have used the GGally library to generate a much more interactive scatter plot. Instead of the above function, you can use pairs() to generate the scatterplot matrix.

Scatterplot Matrix

According to the generated scatterplot matrix above, mpg and acceleration have a non-linear relationship compared to other pairs.

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

cor(Auto[,-9])

mpg cylinders displacement horsepower weight acceleration year origin
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442 0.4233285 0.5805410 0.5652088
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273 -0.5046834 -0.3456474 -0.5689316
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944 -0.5438005 -0.3698552 -0.6145351
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377 -0.6891955 -0.4163615 -0.4551715
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000 -0.4168392 -0.3091199 -0.5850054
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392 1.0000000 0.2903161 0.2127458
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199 0.2903161 1.0000000 0.1815277
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054 0.2127458 0.1815277 1.0000000

Based on the above correlation matrix below, conclusions can be made;

  • weight is highly correlated with cylinders.
  • displacement is highly correlated with cylinders, weight and horsepower.

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

model2 <- lm(mpg ~ . - car.name, data = Auto)

# view the summary
summary(model2)

Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
Min 1Q Median 3Q Max
-9.5903 -2.1565 -0.1169 1.8690 13.0604

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.218435 4.644294 -3.707 0.00024 ***
cylinders -0.493376 0.323282 -1.526 0.12780
displacement 0.019896 0.007515 2.647 0.00844 **
horsepower -0.016951 0.013787 -1.230 0.21963
weight -0.006474 0.000652 -9.929 < 2e-16 ***
acceleration 0.080576 0.098845 0.815 0.41548
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 4.67e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16

i. Is there a relationship between the predictors and the re-response?

Out of the 6 predictor variables, we can see that some significantly impact the predictor variable “mpg”. This is decided by looking into the p-values.

For this study, let’s take the significance level as 0.05. Accordingly, when the p-value is less than 0.05, the chances of the respective predictor variable not affecting the response variable are very low. And such predictor variables are assumed to have a relationship with the response variable.

ii. Which predictors appear to have a statistically significant relationship to the response?

  • displacement
  • weight
  • year
  • origin

iii. What does the coefficient for the year variable suggest?

The coefficient of the year is 0.75. This implies that when the year increases by one unit, while all the other independent variables remain constant, the mpg will increase by 75%.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2, 2))
plot(model2,col="blue",pch=20)
Diagnostic Plots

The residual plot shows the residuals (prediction errors) of the model plotted against the fitted values. If the plot shows a pattern, it may indicate that the model is not adequately capturing the underlying relationships in the data. Since the above residual plot shows a funnel-shaped pattern homoscedasticity assumption is violated.

The normal Q-Q plot compares the quantiles of the residuals to the quantiles of a normal distribution. If the residuals are normally distributed, the points on the plot should lie approximately in a straight line. In our study, we can see that the Q-Q plot follows an approximately straight line, with only a few points deviating (326,327,328). Thus, the data set meets the assumption of normality.

The scale-location plot shows the residuals plotted against the fitted values, with the square root of the leverage on the x-axis. This plot can help you identify any outliers in the data. As a rule of thumb, we use the range -3,3 to identify outliers. Any point going beyond the range will be treated as an outlier. Since no data points can be found beyond -3 and 3, the data set has no outliers.

The leverage plot shows the influence of each observation on the fit of the model. Observations with high leverage can have a disproportionate influence on the fit of the model. The Cook’s distance plot shows the influence of each observation on the fit of the model, with observations with higher influence shown above the Cook’s line. In the above plot, we can see that no data point goes beyond the dashed red line. Thus the data set has no high leverage points.

Conclusion: Since the assumption of homoscedasticity is not met, in further studies, we need to adjust the model.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

# interaction between horsepower and displacement 
modeli1 = lm(mpg~.-name + horsepower*displacement, data=Auto)
summary(modeli1)

Call:
lm(formula = mpg ~ . - name + horsepower * displacement, data = Auto)

Residuals:
Min 1Q Median 3Q Max
-8.7010 -1.6009 -0.0967 1.4119 12.6734

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.894e+00 4.302e+00 -0.440 0.66007
cylinders 6.466e-01 3.017e-01 2.143 0.03275 *
displacement -7.487e-02 1.092e-02 -6.859 2.80e-11 ***
horsepower -1.975e-01 2.052e-02 -9.624 < 2e-16 ***
weight -3.147e-03 6.475e-04 -4.861 1.71e-06 ***
acceleration -2.131e-01 9.062e-02 -2.351 0.01921 *
year 7.379e-01 4.463e-02 16.534 < 2e-16 ***
origin 6.891e-01 2.527e-01 2.727 0.00668 **
displacement:horsepower 5.236e-04 4.813e-05 10.878 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.912 on 383 degrees of freedom
Multiple R-squared: 0.8636, Adjusted R-squared: 0.8608
F-statistic: 303.1 on 8 and 383 DF, p-value: < 2.2e-16


# interaction between horsepower and origin
modeli2 = lm(mpg~.-name + horsepower*origin, data=Auto)
summary(modeli2)

Call:
lm(formula = mpg ~ . - name + horsepower * origin, data = Auto)

Residuals:
Min 1Q Median 3Q Max
-9.277 -1.875 -0.225 1.570 12.080

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.196e+01 4.396e+00 -4.996 8.94e-07 ***
cylinders -5.275e-01 3.028e-01 -1.742 0.0823 .
displacement -1.486e-03 7.607e-03 -0.195 0.8452
horsepower 8.173e-02 1.856e-02 4.404 1.38e-05 ***
weight -4.710e-03 6.555e-04 -7.186 3.52e-12 ***
acceleration -1.124e-01 9.617e-02 -1.168 0.2434
year 7.327e-01 4.780e-02 15.328 < 2e-16 ***
origin 7.695e+00 8.858e-01 8.687 < 2e-16 ***
horsepower:origin -7.955e-02 1.074e-02 -7.405 8.44e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.116 on 383 degrees of freedom
Multiple R-squared: 0.8438, Adjusted R-squared: 0.8406
F-statistic: 258.7 on 8 and 383 DF, p-value: < 2.2e-16

# interaction between origin and cylinders
modeli5 = lm(mpg~.-car.name + origin*cylinders, data=Auto)
summary(modeli5)

Call:
lm(formula = mpg ~ . - name + origin * cylinders, data = Auto)

Residuals:
Min 1Q Median 3Q Max
-9.7908 -2.1679 -0.1234 1.9043 13.0306

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.792e+01 5.169e+00 -3.466 0.000587 ***
cylinders -3.572e-01 5.459e-01 -0.654 0.513327
displacement 1.919e-02 7.865e-03 2.439 0.015169 *
horsepower -1.656e-02 1.386e-02 -1.195 0.232926
weight -6.462e-03 6.541e-04 -9.879 < 2e-16 ***
acceleration 8.068e-02 9.896e-02 0.815 0.415449
year 7.527e-01 5.141e-02 14.640 < 2e-16 ***
origin 1.838e+00 1.358e+00 1.353 0.176836
cylinders:origin -9.985e-02 3.223e-01 -0.310 0.756898
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.332 on 383 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8178
F-statistic: 220.4 on 8 and 383 DF, p-value: < 2.2e-16

As you can see, the interaction terms introduced to horsepower and displacement, horsepower and origin are significant to the model. This means when one of these independent variables changes, the respective interacted variable also has an effect from the change.

But the two variables, origin and cylinders, do not have a significant effect, so they do not have an interactive relationship.

Likewise, we can try out several pairs and find out significant interaction terms.

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

# Squared transformations of the variables
modelt1 <- lm(mpg~.-car.name +I(horsepower^2), data=Auto)
summary(modelt1)

Call:
lm(formula = mpg ~ . - name + I(horsepower^2), data = Auto)

Residuals:
Min 1Q Median 3Q Max
-8.5497 -1.7311 -0.2236 1.5877 11.9955

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3236564 4.6247696 0.286 0.774872
cylinders 0.3489063 0.3048310 1.145 0.253094
displacement -0.0075649 0.0073733 -1.026 0.305550
horsepower -0.3194633 0.0343447 -9.302 < 2e-16 ***
weight -0.0032712 0.0006787 -4.820 2.07e-06 ***
acceleration -0.3305981 0.0991849 -3.333 0.000942 ***
year 0.7353414 0.0459918 15.989 < 2e-16 ***
origin 1.0144130 0.2545545 3.985 8.08e-05 ***
I(horsepower^2) 0.0010060 0.0001065 9.449 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.001 on 383 degrees of freedom
Multiple R-squared: 0.8552, Adjusted R-squared: 0.8522
F-statistic: 282.8 on 8 and 383 DF, p-value: < 2.2e-16


# Log transformations of the variables
modelt2 <- lm(mpg~.-name +log(acceleration), data=Auto)
summary(modelt2)

Call:
lm(formula = mpg ~ . - name + log(acceleration), data = Auto)

Residuals:
Min 1Q Median 3Q Max
-9.7931 -2.0052 -0.1279 1.9299 13.1085

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.552e+01 1.479e+01 3.077 0.00224 **
cylinders -2.796e-01 3.193e-01 -0.876 0.38172
displacement 8.042e-03 7.805e-03 1.030 0.30344
horsepower -3.434e-02 1.401e-02 -2.450 0.01473 *
weight -5.343e-03 6.854e-04 -7.795 6.15e-14 ***
acceleration 2.167e+00 4.782e-01 4.532 7.82e-06 ***
year 7.560e-01 4.978e-02 15.186 < 2e-16 ***
origin 1.329e+00 2.724e-01 4.877 1.58e-06 ***
log(acceleration) -3.513e+01 7.886e+00 -4.455 1.10e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.249 on 383 degrees of freedom
Multiple R-squared: 0.8303, Adjusted R-squared: 0.8267
F-statistic: 234.2 on 8 and 383 DF, p-value: < 2.2e-16


# Root transformations of the variables
modelt3 <- lm(mpg~.-name +I(cylinders^0.5), data=Auto)
summary(modelt3)

Call:
lm(formula = mpg ~ . - name + I(cylinders^0.5), data = Auto)

Residuals:
Min 1Q Median 3Q Max
-11.7190 -2.1361 -0.1756 1.7299 12.9229

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.281e+01 1.453e+01 2.258 0.024490 *
cylinders 8.550e+00 2.513e+00 3.402 0.000739 ***
displacement 2.001e-02 7.399e-03 2.704 0.007149 **
horsepower -2.867e-02 1.395e-02 -2.055 0.040585 *
weight -6.365e-03 6.427e-04 -9.905 < 2e-16 ***
acceleration 1.062e-01 9.757e-02 1.088 0.277224
year 7.474e-01 5.019e-02 14.891 < 2e-16 ***
origin 1.255e+00 2.779e-01 4.514 8.46e-06 ***
I(cylinders^0.5) -4.261e+01 1.175e+01 -3.628 0.000325 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.276 on 383 degrees of freedom
Multiple R-squared: 0.8274, Adjusted R-squared: 0.8238
F-statistic: 229.5 on 8 and 383 DF, p-value: < 2.2e-16

By carrying out the transformation on predictor variables, we try to fit a more accurate model. As usual, to evaluate the significance of the transformation, we use the p-values. For our study, the below conclusions can be arrived at by looking at the above output;

  • The squared transformation of horsepower does not change the significance
  • The log transformation of acceleration is significant but not as significant as the acceleration.
  • The root transformation of cylinders is more significant than that of cylinders.

The end of question 09. Let’s meet again for another discussion 💕

--

--