ISLR Applied Exercises Chapter 3 — Question 09

9 min readJan 2, 2023

This article explains my Kaggle notebook further, which answers question 08 of chapter 03 of “An Introduction to Statistical Learning”.
The data set used: Auto dataset of islr2 package.
Language: R

Let's look into one question at a time.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

# import library
library(GGally)
# plot
ggpairs(Auto[,-9])

Note: I have used the GGally library to generate a much more interactive scatter plot. Instead of the above function, you can use pairs() to generate the scatterplot matrix.

According to the generated scatterplot matrix above, mpg and acceleration have a non-linear relationship compared to other pairs.

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

cor(Auto[,-9])

                    mpg  cylinders displacement horsepower     weight acceleration       year     origin
mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442    0.4233285  0.5805410  0.5652088
cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273   -0.5046834 -0.3456474 -0.5689316
displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944   -0.5438005 -0.3698552 -0.6145351
horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377   -0.6891955 -0.4163615 -0.4551715
weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000   -0.4168392 -0.3091199 -0.5850054
acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392    1.0000000  0.2903161  0.2127458
year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199    0.2903161  1.0000000  0.1815277
origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054    0.2127458  0.1815277  1.0000000

Based on the above correlation matrix below, conclusions can be made;

weight is highly correlated with cylinders.
displacement is highly correlated with cylinders, weight and horsepower.

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

model2 <- lm(mpg ~ . - car.name, data = Auto)

# view the summary
summary(model2)

Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the re-response?

Out of the 6 predictor variables, we can see that some significantly impact the predictor variable “mpg”. This is decided by looking into the p-values.

For this study, let’s take the significance level as 0.05. Accordingly, when the p-value is less than 0.05, the chances of the respective predictor variable not affecting the response variable are very low. And such predictor variables are assumed to have a relationship with the response variable.

ii. Which predictors appear to have a statistically significant relationship to the response?

displacement
weight
year
origin

iii. What does the coefficient for the year variable suggest?

The coefficient of the year is 0.75. This implies that when the year increases by one unit, while all the other independent variables remain constant, the mpg will increase by 75%.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2, 2))
plot(model2,col="blue",pch=20)

The residual plot shows the residuals (prediction errors) of the model plotted against the fitted values. If the plot shows a pattern, it may indicate that the model is not adequately capturing the underlying relationships in the data. Since the above residual plot shows a funnel-shaped pattern homoscedasticity assumption is violated.

The normal Q-Q plot compares the quantiles of the residuals to the quantiles of a normal distribution. If the residuals are normally distributed, the points on the plot should lie approximately in a straight line. In our study, we can see that the Q-Q plot follows an approximately straight line, with only a few points deviating (326,327,328). Thus, the data set meets the assumption of normality.

The scale-location plot shows the residuals plotted against the fitted values, with the square root of the leverage on the x-axis. This plot can help you identify any outliers in the data. As a rule of thumb, we use the range -3,3 to identify outliers. Any point going beyond the range will be treated as an outlier. Since no data points can be found beyond -3 and 3, the data set has no outliers.

The leverage plot shows the influence of each observation on the fit of the model. Observations with high leverage can have a disproportionate influence on the fit of the model. The Cook’s distance plot shows the influence of each observation on the fit of the model, with observations with higher influence shown above the Cook’s line. In the above plot, we can see that no data point goes beyond the dashed red line. Thus the data set has no high leverage points.

Conclusion: Since the assumption of homoscedasticity is not met, in further studies, we need to adjust the model.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

# interaction between horsepower and displacement 
modeli1 = lm(mpg~.-name + horsepower*displacement, data=Auto)
summary(modeli1)

Call:
lm(formula = mpg ~ . - name + horsepower * displacement, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7010 -1.6009 -0.0967  1.4119 12.6734 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -1.894e+00  4.302e+00  -0.440  0.66007    
cylinders                6.466e-01  3.017e-01   2.143  0.03275 *  
displacement            -7.487e-02  1.092e-02  -6.859 2.80e-11 ***
horsepower              -1.975e-01  2.052e-02  -9.624  < 2e-16 ***
weight                  -3.147e-03  6.475e-04  -4.861 1.71e-06 ***
acceleration            -2.131e-01  9.062e-02  -2.351  0.01921 *  
year                     7.379e-01  4.463e-02  16.534  < 2e-16 ***
origin                   6.891e-01  2.527e-01   2.727  0.00668 ** 
displacement:horsepower  5.236e-04  4.813e-05  10.878  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.912 on 383 degrees of freedom
Multiple R-squared:  0.8636, Adjusted R-squared:  0.8608 
F-statistic: 303.1 on 8 and 383 DF,  p-value: < 2.2e-16


# interaction between horsepower and origin
modeli2 = lm(mpg~.-name + horsepower*origin, data=Auto)
summary(modeli2)

Call:
lm(formula = mpg ~ . - name + horsepower * origin, data = Auto)

Residuals:
   Min     1Q Median     3Q    Max 
-9.277 -1.875 -0.225  1.570 12.080 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -2.196e+01  4.396e+00  -4.996 8.94e-07 ***
cylinders         -5.275e-01  3.028e-01  -1.742   0.0823 .  
displacement      -1.486e-03  7.607e-03  -0.195   0.8452    
horsepower         8.173e-02  1.856e-02   4.404 1.38e-05 ***
weight            -4.710e-03  6.555e-04  -7.186 3.52e-12 ***
acceleration      -1.124e-01  9.617e-02  -1.168   0.2434    
year               7.327e-01  4.780e-02  15.328  < 2e-16 ***
origin             7.695e+00  8.858e-01   8.687  < 2e-16 ***
horsepower:origin -7.955e-02  1.074e-02  -7.405 8.44e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.116 on 383 degrees of freedom
Multiple R-squared:  0.8438, Adjusted R-squared:  0.8406 
F-statistic: 258.7 on 8 and 383 DF,  p-value: < 2.2e-16

# interaction between origin and cylinders
modeli5 = lm(mpg~.-car.name + origin*cylinders, data=Auto)
summary(modeli5)

Call:
lm(formula = mpg ~ . - name + origin * cylinders, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.7908 -2.1679 -0.1234  1.9043 13.0306 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -1.792e+01  5.169e+00  -3.466 0.000587 ***
cylinders        -3.572e-01  5.459e-01  -0.654 0.513327    
displacement      1.919e-02  7.865e-03   2.439 0.015169 *  
horsepower       -1.656e-02  1.386e-02  -1.195 0.232926    
weight           -6.462e-03  6.541e-04  -9.879  < 2e-16 ***
acceleration      8.068e-02  9.896e-02   0.815 0.415449    
year              7.527e-01  5.141e-02  14.640  < 2e-16 ***
origin            1.838e+00  1.358e+00   1.353 0.176836    
cylinders:origin -9.985e-02  3.223e-01  -0.310 0.756898    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.332 on 383 degrees of freedom
Multiple R-squared:  0.8215, Adjusted R-squared:  0.8178 
F-statistic: 220.4 on 8 and 383 DF,  p-value: < 2.2e-16

As you can see, the interaction terms introduced to horsepower and displacement, horsepower and origin are significant to the model. This means when one of these independent variables changes, the respective interacted variable also has an effect from the change.

But the two variables, origin and cylinders, do not have a significant effect, so they do not have an interactive relationship.

Likewise, we can try out several pairs and find out significant interaction terms.

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

# Squared transformations of the variables
modelt1 <- lm(mpg~.-car.name +I(horsepower^2), data=Auto)
summary(modelt1)

Call:
lm(formula = mpg ~ . - name + I(horsepower^2), data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.5497 -1.7311 -0.2236  1.5877 11.9955 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1.3236564  4.6247696   0.286 0.774872    
cylinders        0.3489063  0.3048310   1.145 0.253094    
displacement    -0.0075649  0.0073733  -1.026 0.305550    
horsepower      -0.3194633  0.0343447  -9.302  < 2e-16 ***
weight          -0.0032712  0.0006787  -4.820 2.07e-06 ***
acceleration    -0.3305981  0.0991849  -3.333 0.000942 ***
year             0.7353414  0.0459918  15.989  < 2e-16 ***
origin           1.0144130  0.2545545   3.985 8.08e-05 ***
I(horsepower^2)  0.0010060  0.0001065   9.449  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.001 on 383 degrees of freedom
Multiple R-squared:  0.8552, Adjusted R-squared:  0.8522 
F-statistic: 282.8 on 8 and 383 DF,  p-value: < 2.2e-16


# Log transformations of the variables
modelt2 <- lm(mpg~.-name +log(acceleration), data=Auto)
summary(modelt2)

Call:
lm(formula = mpg ~ . - name + log(acceleration), data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.7931 -2.0052 -0.1279  1.9299 13.1085 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        4.552e+01  1.479e+01   3.077  0.00224 ** 
cylinders         -2.796e-01  3.193e-01  -0.876  0.38172    
displacement       8.042e-03  7.805e-03   1.030  0.30344    
horsepower        -3.434e-02  1.401e-02  -2.450  0.01473 *  
weight            -5.343e-03  6.854e-04  -7.795 6.15e-14 ***
acceleration       2.167e+00  4.782e-01   4.532 7.82e-06 ***
year               7.560e-01  4.978e-02  15.186  < 2e-16 ***
origin             1.329e+00  2.724e-01   4.877 1.58e-06 ***
log(acceleration) -3.513e+01  7.886e+00  -4.455 1.10e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.249 on 383 degrees of freedom
Multiple R-squared:  0.8303, Adjusted R-squared:  0.8267 
F-statistic: 234.2 on 8 and 383 DF,  p-value: < 2.2e-16


# Root transformations of the variables
modelt3 <- lm(mpg~.-name +I(cylinders^0.5), data=Auto)
summary(modelt3)

Call:
lm(formula = mpg ~ . - name + I(cylinders^0.5), data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.7190  -2.1361  -0.1756   1.7299  12.9229 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3.281e+01  1.453e+01   2.258 0.024490 *  
cylinders         8.550e+00  2.513e+00   3.402 0.000739 ***
displacement      2.001e-02  7.399e-03   2.704 0.007149 ** 
horsepower       -2.867e-02  1.395e-02  -2.055 0.040585 *  
weight           -6.365e-03  6.427e-04  -9.905  < 2e-16 ***
acceleration      1.062e-01  9.757e-02   1.088 0.277224    
year              7.474e-01  5.019e-02  14.891  < 2e-16 ***
origin            1.255e+00  2.779e-01   4.514 8.46e-06 ***
I(cylinders^0.5) -4.261e+01  1.175e+01  -3.628 0.000325 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.276 on 383 degrees of freedom
Multiple R-squared:  0.8274, Adjusted R-squared:  0.8238 
F-statistic: 229.5 on 8 and 383 DF,  p-value: < 2.2e-16

By carrying out the transformation on predictor variables, we try to fit a more accurate model. As usual, to evaluate the significance of the transformation, we use the p-values. For our study, the below conclusions can be arrived at by looking at the above output;

The squared transformation of horsepower does not change the significance
The log transformation of acceleration is significant but not as significant as the acceleration.
The root transformation of cylinders is more significant than that of cylinders.

The end of question 09. Let’s meet again for another discussion 💕

ISLR Applied Exercises Chapter 3 — Question 09

Written by Uvini Ranaweera