## Introduction

Chapter 7 talks about using regression as a prediction tool (i.e., for predictive modeling). The following examples help illustrate a few items that were discussed. Follow all instructions to complete Chapter 7.

## Prediction with Linear Regression

1. Let’s start by loading the tidyverse package (you can ignore the notes that you see below that it gives you once you load it), the furniture package, the educ7610 package, and the caret package. The caret package provides us with almost everything you need to do predictive modeling with regression (or many other approaches).
library(tidyverse)
library(furniture)
library(caret)
library(educ7610)
1. Import the GSS data set.
data(gss)
1. Let’s start predicting income06 with several of the variables in the data set. First thing we need to do is deal with the missing values in the data that will be used for the prediction. There are several ways to handle the missing data but we will take the easiest route by just removing them (often not recommended–see “multiple imputation”).
gss_no_na <- gss %>%
select(income06, age, educ, degree, sex, race, hompop,
marital, partyid, natspac,
tvhours) %>%
filter(complete.cases(income06, age, educ, degree, sex, race, hompop,
marital, partyid, natspac,
tvhours))
1. Because we are using a random aspect of our analysis (cross-validation), let’s set our random seed so we can replicate our results.
set.seed(843)
1. Now we can set up our cross-validation using trainControl().
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
1. Using that, we can how fit our cross-validated regression models. The method = "lm" uses linear regression just like the regression that we’ve discussed but is not so invested in coefficients but more about rpediction.
fit <- train(income06 ~ .,
data = gss_no_na,
method = "lm",
trControl = fitControl)
fit
## Linear Regression
##
## 1147 samples
##   10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 1032, 1032, 1033, 1033, 1032, 1032, ...
## Resampling results:
##
##   RMSE     Rsquared   MAE
##   36751.8  0.3264584  28397.88
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
1. We are given some pieces of information here. It tells us our sample size, how many predictors we’ve included, that we’ve used cross-validation, and then gives us an RMSE (root mean squared error), $$R^2$$, and MAE (mean absolute error). These are all measures of accuracy of prediction. How much of the variation in income06 are we explaining with our 12 predictors?

## Selection Approaches

Notably, these selection approaches cannot be used for causal interpretation of the individual coefficients or predictors.

1. There are several predictor selection approaches. Among these, the Lasso (least absolute shrinkage and selector operator) is one of the most useful. The Lasso is built on linear regression but integrates a penalty term that allows it to select important (important in terms of prediction) variables. Let’s use the method = "glmnet" option with caret that uses a type of blended lasso approach.
fit2 <- train(income06 ~ .,
data = gss_no_na,
method = "glmnet",
trControl = fitControl)
fit2
## glmnet
##
## 1147 samples
##   10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 1031, 1033, 1033, 1031, 1033, 1033, ...
## Resampling results across tuning parameters:
##
##   alpha  lambda      RMSE      Rsquared   MAE
##   0.10     33.55241  36859.97  0.3236738  28413.97
##   0.10    335.52415  36897.76  0.3231371  28420.77
##   0.10   3355.24149  37159.79  0.3186155  28589.38
##   0.55     33.55241  36855.72  0.3237939  28410.43
##   0.55    335.52415  37040.66  0.3214669  28436.22
##   0.55   3355.24149  37720.23  0.3098664  29047.82
##   1.00     33.55241  36865.86  0.3236732  28410.30
##   1.00    335.52415  37220.33  0.3193338  28471.59
##   1.00   3355.24149  38601.18  0.2961069  29659.55
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.55 and lambda
##  = 33.55241.
1. We are given several pieces of information, including the levels of the tuning parameters (aspects of the model that can be adjusted beyond that of linear regression) and their corresponding accuracy levels. How accurate is the best model?
2. We can further investigate this by seeing how many/which predictors were selected for the best model. We can do that by using coef() while using the s argument where we give the best lambda value from above (33.55241).
coef(fit2\$finalModel,
s = 33.55241)
## 24 x 1 sparse Matrix of class "dgCMatrix"
##                                       1
## (Intercept)                  1352.54663
## age                            81.86052
## educ                          533.86702
## degreeAssociates            24250.62997
## degreeBachelors             36842.35921
## degreeHS                    14364.83145
## sexMale                      6192.98267
## raceOther                    4758.33285
## raceWhite                    5906.64658
## hompop                       4904.90164
## maritalMarried              19823.91849
## maritalNever Married         -368.35274
## maritalSeparated            -5864.23024
## maritalWidowed               1273.19522
## partyidIndependent            167.82669
## partyidIndependent Lean Rep  4746.63508
## partyidIndepndent Lean Dem   1241.41418
## partyidOther                -2082.30738
## partyidRep                  10061.13614
## partyidStrong Dem           -4079.07348
## partyidStrong Rep           13124.39061
## natspac                      -329.41913
## tvhours                     -2264.22235
1. Almost all the variables in the model were important in this case. Which variable was not selected?
2. Another predictor selection approach is stepwise (as discussed in the book). It has proven to not work super well in accurately selecting predictors but it can be helpful in prediction. We’ll do this below with method = "leapSeq".
fit3 <- train(income06 ~ .,
data = gss_no_na,
method = "leapSeq",
trControl = fitControl)
fit3
## Linear Regression with Stepwise Selection
##
## 1147 samples
##   10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 1033, 1031, 1033, 1032, 1032, 1032, ...
## Resampling results across tuning parameters:
##
##   nvmax  RMSE      Rsquared   MAE
##   2      41199.62  0.2249142  30667.48
##   3      39735.39  0.2627051  29875.04
##   4      39471.02  0.2726218  29596.88
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 4.
1. Here, nvmax is the number of predictors selected. How many predictors of the 2 - 4 provided the best model?
2. How much of the variability can we explain with just 4 predictors in the model?

## Predictor Configurations

1. To help illustrate these various configurations, we will work with the following four data sets:
• df1
• df2
• df3
• df4
2. What configuration is df1? How can you tell?
df1 %>%
furniture::tableC()
##
## ──────────────────────────────────────────
##        [1]            [2]           [3]
##  [1]x1 1.00
##  [2]x2 -0.042 (0.678) 1.00
##  [3]y  0.707 (<.001)  0.677 (<.001) 1.00
## ──────────────────────────────────────────
df1 %>%
lm(y ~ x1 + x2,
data = .) %>%
summary()
##
## Call:
## lm(formula = y ~ x1 + x2, data = .)
##
## Residuals:
##        Min         1Q     Median         3Q        Max
## -7.311e-15 -2.210e-17  6.510e-17  1.908e-16  8.873e-16
##
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)
## (Intercept) -7.772e-17  8.132e-17 -9.560e-01    0.342
## x1           1.000e+00  7.979e-17  1.253e+16   <2e-16 ***
## x2           1.000e+00  8.305e-17  1.204e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.097e-16 on 97 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1
## F-statistic: 1.449e+32 on 2 and 97 DF,  p-value: < 2.2e-16
1. What configuration is df2? How can you tell?
df2 %>%
furniture::tableC()
##
## ─────────────────────────────────────────
##        [1]           [2]           [3]
##  [1]x1 1.00
##  [2]x2 0.655 (<.001) 1.00
##  [3]y  0.883 (<.001) 0.933 (<.001) 1.00
## ─────────────────────────────────────────
df2 %>%
lm(y ~ x1 + x2,
data = .) %>%
summary()
##
## Call:
## lm(formula = y ~ x1 + x2, data = .)
##
## Residuals:
##        Min         1Q     Median         3Q        Max
## -1.014e-14 -1.423e-16  3.610e-17  2.193e-16  7.012e-15
##
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)
## (Intercept) -2.109e-16  1.290e-16 -1.635e+00    0.105
## x1           1.000e+00  1.737e-16  5.758e+15   <2e-16 ***
## x2           1.000e+00  1.335e-16  7.490e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.288e-15 on 97 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1
## F-statistic: 1.275e+32 on 2 and 97 DF,  p-value: < 2.2e-16
1. What configuration is df3? How can you tell?
df3 %>%
furniture::tableC()
##
## ──────────────────────────────────────────
##        [1]            [2]           [3]
##  [1]x1 1.00
##  [2]x2 -0.761 (<.001) 1.00
##  [3]y  -0.107 (0.288) 0.727 (<.001) 1.00
## ──────────────────────────────────────────
df3 %>%
lm(y ~ x1 + x2,
data = .) %>%
summary()
##
## Call:
## lm(formula = y ~ x1 + x2, data = .)
##
## Residuals:
##        Min         1Q     Median         3Q        Max
## -6.753e-16 -9.747e-17 -9.920e-18  4.417e-17  3.144e-15
##
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)
## (Intercept) -3.331e-17  3.683e-17 -9.040e-01    0.368
## x1           1.000e+00  5.734e-17  1.744e+16   <2e-16 ***
## x2           1.000e+00  3.960e-17  2.525e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.665e-16 on 97 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1
## F-statistic: 3.225e+32 on 2 and 97 DF,  p-value: < 2.2e-16
1. What configuration is df4? How can you tell?
df4 %>%
furniture::tableC()
##
## ──────────────────────────────────────────
##        [1]            [2]           [3]
##  [1]x1 1.00
##  [2]x2 -0.995 (<.001) 1.00
##  [3]y  -0.034 (0.739) 0.135 (0.181) 1.00
## ──────────────────────────────────────────
df4 %>%
lm(y ~ x1 + x2,
data = .) %>%
summary()
##
## Call:
## lm(formula = y ~ x1 + x2, data = .)
##
## Residuals:
##        Min         1Q     Median         3Q        Max
## -2.075e-15 -2.265e-17  1.953e-17  6.881e-17  7.759e-16
##
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)
## (Intercept) 1.110e-16  2.518e-17 4.409e+00 2.69e-05 ***
## x1          1.000e+00  2.445e-16 4.090e+15  < 2e-16 ***
## x2          1.000e+00  2.424e-16 4.126e+15  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.462e-16 on 97 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1
## F-statistic: 8.521e+30 on 2 and 97 DF,  p-value: < 2.2e-16

## Conclusion

Prediction is an important piece of research. Often, we don’t need to know the mechanism, but we want to be able to predict behavior, outcomes, or precursers. In these situations, regression is a great starting point when attempting to pursue this goal.