Introduction

Chapter 7 talks about using regression as a prediction tool (i.e., for predictive modeling). The following examples help illustrate a few items that were discussed. Follow all instructions to complete Chapter 7.

Prediction with Linear Regression

  1. Let’s start by loading the tidyverse package (you can ignore the notes that you see below that it gives you once you load it), the furniture package, and the caret package. The caret package provides us with almost everything you need to do predictive modeling with regression (or many other approaches).
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1.9000 ──
## ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.5     
## ✔ tibble  1.4.2          ✔ dplyr   0.7.5     
## ✔ tidyr   0.8.1          ✔ stringr 1.3.1     
## ✔ readr   1.1.1          ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(furniture)
## ── furniture 1.7.12 ────────────────────────────────────────────────────────────────── learn more at tysonbarrett.com ──
## ✔ furniture attached
## ✖ The furniture::table1() function has the same name as tidyr::table1 (tbl_df)
##    Consider using `furniture::` for each function call.
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
  1. Download the GSS_reduced_example.csv data set from Canvas or tysonbarrett.com/teaching to your computer. Save it in a directory you can access fairly easily.
  2. Open RStudio, start a new R script or RMarkdown document.
  3. Import it into R.
gss <- read.csv("GSS_reduced_example.csv")
  1. Let’s start predicting income06 with several of the variables in the data set. First thing we need to do is deal with the missing values in the data that will be used for the prediction. There are several ways to handle the missing data but we will take the easiest route by just removing them (often not recommended–see “multiple imputation”).
gss_no_na <- gss %>%
  select(income06, age, educ, degree, sex, race, hompop, 
         marital, partyid, natspac, natenvir, natheal, 
         tvhours) %>%
  filter(complete.cases(income06, age, educ, degree, sex, race, hompop, 
         marital, partyid, natspac, natenvir, natheal, 
         tvhours))
  1. Because we are using a random aspect of our analysis (cross-validation), let’s set our random seed so we can replicate our results.
set.seed(843)
  1. Now we can set up our cross-validation using trainControl().
fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 10)
  1. Using that, we can how fit our cross-validated regression models. The method = "lm" uses linear regression just like the regression that we’ve discussed but is not so invested in coefficients but more about rpediction.
fit <- train(income06 ~ ., 
             data = gss_no_na, 
             method = "lm", 
             trControl = fitControl)
fit
## Linear Regression 
## 
## 1147 samples
##   12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 1032, 1032, 1033, 1033, 1032, 1032, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   36722.2  0.3274048  28377.17
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
  1. We are given some pieces of information here. It tells us our sample size, how many predictors we’ve included, that we’ve used cross-validation, and then gives us an RMSE (root mean squared error), \(R^2\), and MAE (mean absolute error). These are all measures of accuracy of prediction. How much of the variation in income06 are we explaining with our 12 predictors?

Selection Approaches

Notably, these selection approaches cannot be used for causal interpretation of the individual coefficients or predictors.

  1. There are several predictor selection approaches. Among these, the Lasso (least absolute shrinkage and selector operator) is one of the most useful. The Lasso is built on linear regression but integrates a penalty term that allows it to select important (important in terms of prediction) variables. Let’s use the method = "glmnet" option with caret that uses a type of blended lasso approach.
fit2 <- train(income06 ~ ., 
              data = gss_no_na, 
              method = "glmnet", 
              trControl = fitControl)
fit2
## glmnet 
## 
## 1147 samples
##   12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 1031, 1033, 1033, 1031, 1033, 1033, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      RMSE      Rsquared   MAE     
##   0.10     33.55241  36862.79  0.3234613  28405.93
##   0.10    335.52415  36899.92  0.3229535  28408.48
##   0.10   3355.24149  37168.58  0.3183004  28578.09
##   0.55     33.55241  36858.50  0.3235899  28402.50
##   0.55    335.52415  37037.77  0.3214988  28416.67
##   0.55   3355.24149  37723.91  0.3097081  29049.52
##   1.00     33.55241  36867.40  0.3235060  28400.77
##   1.00    335.52415  37226.53  0.3192054  28456.09
##   1.00   3355.24149  38601.18  0.2961069  29659.55
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.55 and lambda
##  = 33.55241.
  1. We are given several pieces of information, including the levels of the tuning parameters (aspects of the model that can be adjusted beyond that of linear regression) and their corresponding accuracy levels. How accurate is the best model?
  2. We can further investigate this by seeing how many/which predictors were selected for the best model. We can do that by using coef() while using the s argument where we give the best lambda value from above (33.55241).
coef(fit2$finalModel,
     s = 33.55241)
## 26 x 1 sparse Matrix of class "dgCMatrix"
##                                       1
## (Intercept)                   366.97990
## age                            94.76392
## educ                          511.68853
## degreeAssociates            24398.22692
## degreeBachelors             37118.59748
## degreeGraduate              50389.94833
## degreeHS                    14526.66081
## sexMale                      6355.65856
## raceOther                    5243.18109
## raceWhite                    5868.02874
## hompop                       4993.14110
## maritalMarried              19773.42111
## maritalNever Married         -348.58583
## maritalSeparated            -5715.94478
## maritalWidowed               1189.51691
## partyidIndependent            734.31164
## partyidIndependent Lean Rep  5244.70901
## partyidIndepndent Lean Dem   1469.97158
## partyidOther                -1890.84396
## partyidRep                  10347.62533
## partyidStrong Dem           -3618.97885
## partyidStrong Rep           13928.01572
## natspac                         .      
## natenvir                    -2637.44406
## natheal                      2025.79673
## tvhours                     -2233.05486
  1. Almost all the variables in the model were important in this case. Which variable was not selected?
  2. Another predictor selection approach is stepwise (as discussed in the book). It has proven to not work super well in accurately selecting predictors but it can be helpful in prediction. We’ll do this below with method = "leapSeq".
fit3 <- train(income06 ~ ., 
              data = gss_no_na, 
              method = "leapSeq", 
              trControl = fitControl)
fit3
## Linear Regression with Stepwise Selection 
## 
## 1147 samples
##   12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 1033, 1031, 1033, 1032, 1032, 1032, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##   2      41199.62  0.2249142  30667.48
##   3      39735.39  0.2627051  29875.04
##   4      39471.02  0.2726218  29596.88
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 4.
  1. Here, nvmax is the number of predictors selected. How many predictors of the 2 - 4 provided the best model?
  2. How much of the variability can we explain with just 4 predictors in the model?

Predictor Configurations

  1. To help illustrate these various configurations, we will work with the following four data sets:
    • df1
    • df2
    • df3
    • df4
  2. What configuration is df1? How can you tell?
df1 %>%
  furniture::tableC()
## N = 100
## Note: pearson correlation (p-value).
## 
## ──────────────────────────────────────────
##        [1]            [2]           [3]  
##  [1]x1 1.00                              
##  [2]x2 -0.042 (0.678) 1.00               
##  [3]y  0.707 (<.001)  0.677 (<.001) 1.00 
## ──────────────────────────────────────────
df1 %>%
  lm(y ~ x1 + x2,
     data = .) %>%
  summary()
## Warning in summary.lm(.): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = y ~ x1 + x2, data = .)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -7.311e-15 -2.210e-17  6.510e-17  1.908e-16  8.873e-16 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -7.772e-17  8.132e-17 -9.560e-01    0.342    
## x1           1.000e+00  7.979e-17  1.253e+16   <2e-16 ***
## x2           1.000e+00  8.305e-17  1.204e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.097e-16 on 97 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.449e+32 on 2 and 97 DF,  p-value: < 2.2e-16
  1. What configuration is df2? How can you tell?
df2 %>%
  furniture::tableC()
## N = 100
## Note: pearson correlation (p-value).
## 
## ─────────────────────────────────────────
##        [1]           [2]           [3]  
##  [1]x1 1.00                             
##  [2]x2 0.655 (<.001) 1.00               
##  [3]y  0.883 (<.001) 0.933 (<.001) 1.00 
## ─────────────────────────────────────────
df2 %>%
  lm(y ~ x1 + x2,
     data = .) %>%
  summary()
## Warning in summary.lm(.): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = y ~ x1 + x2, data = .)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.014e-14 -1.423e-16  3.610e-17  2.193e-16  7.012e-15 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -2.109e-16  1.290e-16 -1.635e+00    0.105    
## x1           1.000e+00  1.737e-16  5.758e+15   <2e-16 ***
## x2           1.000e+00  1.335e-16  7.490e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.288e-15 on 97 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.275e+32 on 2 and 97 DF,  p-value: < 2.2e-16
  1. What configuration is df3? How can you tell?
df3 %>%
  furniture::tableC()
## N = 100
## Note: pearson correlation (p-value).
## 
## ──────────────────────────────────────────
##        [1]            [2]           [3]  
##  [1]x1 1.00                              
##  [2]x2 -0.761 (<.001) 1.00               
##  [3]y  -0.107 (0.288) 0.727 (<.001) 1.00 
## ──────────────────────────────────────────
df3 %>%
  lm(y ~ x1 + x2,
     data = .) %>%
  summary()
## Warning in summary.lm(.): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = y ~ x1 + x2, data = .)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -6.753e-16 -9.747e-17 -9.920e-18  4.417e-17  3.144e-15 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -3.331e-17  3.683e-17 -9.040e-01    0.368    
## x1           1.000e+00  5.734e-17  1.744e+16   <2e-16 ***
## x2           1.000e+00  3.960e-17  2.525e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.665e-16 on 97 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.225e+32 on 2 and 97 DF,  p-value: < 2.2e-16
  1. What configuration is df4? How can you tell?
df4 %>%
  furniture::tableC()
## N = 100
## Note: pearson correlation (p-value).
## 
## ──────────────────────────────────────────
##        [1]            [2]           [3]  
##  [1]x1 1.00                              
##  [2]x2 -0.995 (<.001) 1.00               
##  [3]y  -0.034 (0.739) 0.135 (0.181) 1.00 
## ──────────────────────────────────────────
df4 %>%
  lm(y ~ x1 + x2,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = y ~ x1 + x2, data = .)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.075e-15 -2.265e-17  1.953e-17  6.881e-17  7.759e-16 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 1.110e-16  2.518e-17 4.409e+00 2.69e-05 ***
## x1          1.000e+00  2.445e-16 4.090e+15  < 2e-16 ***
## x2          1.000e+00  2.424e-16 4.126e+15  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.462e-16 on 97 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.521e+30 on 2 and 97 DF,  p-value: < 2.2e-16

Conclusion

Prediction is an important piece of research. Often, we don’t need to know the mechanism, but we want to be able to predict behavior, outcomes, or precursers. In these situations, regression is a great starting point when attempting to pursue this goal.