Chapter 7 talks about using regression as a prediction tool (i.e., for predictive modeling). The following examples help illustrate a few items that were discussed. Follow all instructions to complete Chapter 7.
tidyverse
package (you can ignore the notes that you see below that it gives you once you load it), the furniture
package, the educ7610
package, and the caret
package. The caret
package provides us with almost everything you need to do predictive modeling with regression (or many other approaches).library(tidyverse)
library(furniture)
library(caret)
library(educ7610)
data(gss)
income06
with several of the variables in the data set. First thing we need to do is deal with the missing values in the data that will be used for the prediction. There are several ways to handle the missing data but we will take the easiest route by just removing them (often not recommended–see “multiple imputation”).gss_no_na <- gss %>%
select(income06, age, educ, degree, sex, race, hompop,
marital, partyid, natspac,
tvhours) %>%
filter(complete.cases(income06, age, educ, degree, sex, race, hompop,
marital, partyid, natspac,
tvhours))
set.seed(843)
trainControl()
.fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
method = "lm"
uses linear regression just like the regression that we’ve discussed but is not so invested in coefficients but more about rpediction.fit <- train(income06 ~ .,
data = gss_no_na,
method = "lm",
trControl = fitControl)
fit
## Linear Regression
##
## 1147 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 1032, 1032, 1033, 1033, 1032, 1032, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 36751.8 0.3264584 28397.88
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
income06
are we explaining with our 12 predictors?Notably, these selection approaches cannot be used for causal interpretation of the individual coefficients or predictors.
method = "glmnet"
option with caret
that uses a type of blended lasso approach.fit2 <- train(income06 ~ .,
data = gss_no_na,
method = "glmnet",
trControl = fitControl)
fit2
## glmnet
##
## 1147 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 1031, 1033, 1033, 1031, 1033, 1033, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 33.55241 36859.97 0.3236738 28413.97
## 0.10 335.52415 36897.76 0.3231371 28420.77
## 0.10 3355.24149 37159.79 0.3186155 28589.38
## 0.55 33.55241 36855.72 0.3237939 28410.43
## 0.55 335.52415 37040.66 0.3214669 28436.22
## 0.55 3355.24149 37720.23 0.3098664 29047.82
## 1.00 33.55241 36865.86 0.3236732 28410.30
## 1.00 335.52415 37220.33 0.3193338 28471.59
## 1.00 3355.24149 38601.18 0.2961069 29659.55
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.55 and lambda
## = 33.55241.
coef()
while using the s
argument where we give the best lambda value from above (33.55241
).coef(fit2$finalModel,
s = 33.55241)
## 24 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1352.54663
## age 81.86052
## educ 533.86702
## degreeAssociates 24250.62997
## degreeBachelors 36842.35921
## degreeGraduate 50525.41581
## degreeHS 14364.83145
## sexMale 6192.98267
## raceOther 4758.33285
## raceWhite 5906.64658
## hompop 4904.90164
## maritalMarried 19823.91849
## maritalNever Married -368.35274
## maritalSeparated -5864.23024
## maritalWidowed 1273.19522
## partyidIndependent 167.82669
## partyidIndependent Lean Rep 4746.63508
## partyidIndepndent Lean Dem 1241.41418
## partyidOther -2082.30738
## partyidRep 10061.13614
## partyidStrong Dem -4079.07348
## partyidStrong Rep 13124.39061
## natspac -329.41913
## tvhours -2264.22235
method = "leapSeq"
.fit3 <- train(income06 ~ .,
data = gss_no_na,
method = "leapSeq",
trControl = fitControl)
fit3
## Linear Regression with Stepwise Selection
##
## 1147 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 1033, 1031, 1033, 1032, 1032, 1032, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 2 41199.62 0.2249142 30667.48
## 3 39735.39 0.2627051 29875.04
## 4 39471.02 0.2726218 29596.88
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 4.
nvmax
is the number of predictors selected. How many predictors of the 2 - 4 provided the best model?df1
df2
df3
df4
df1
? How can you tell?df1 %>%
furniture::tableC()
##
## ──────────────────────────────────────────
## [1] [2] [3]
## [1]x1 1.00
## [2]x2 -0.042 (0.678) 1.00
## [3]y 0.707 (<.001) 0.677 (<.001) 1.00
## ──────────────────────────────────────────
df1 %>%
lm(y ~ x1 + x2,
data = .) %>%
summary()
##
## Call:
## lm(formula = y ~ x1 + x2, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.311e-15 -2.210e-17 6.510e-17 1.908e-16 8.873e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.772e-17 8.132e-17 -9.560e-01 0.342
## x1 1.000e+00 7.979e-17 1.253e+16 <2e-16 ***
## x2 1.000e+00 8.305e-17 1.204e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.097e-16 on 97 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.449e+32 on 2 and 97 DF, p-value: < 2.2e-16
df2
? How can you tell?df2 %>%
furniture::tableC()
##
## ─────────────────────────────────────────
## [1] [2] [3]
## [1]x1 1.00
## [2]x2 0.655 (<.001) 1.00
## [3]y 0.883 (<.001) 0.933 (<.001) 1.00
## ─────────────────────────────────────────
df2 %>%
lm(y ~ x1 + x2,
data = .) %>%
summary()
##
## Call:
## lm(formula = y ~ x1 + x2, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.014e-14 -1.423e-16 3.610e-17 2.193e-16 7.012e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.109e-16 1.290e-16 -1.635e+00 0.105
## x1 1.000e+00 1.737e-16 5.758e+15 <2e-16 ***
## x2 1.000e+00 1.335e-16 7.490e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.288e-15 on 97 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.275e+32 on 2 and 97 DF, p-value: < 2.2e-16
df3
? How can you tell?df3 %>%
furniture::tableC()
##
## ──────────────────────────────────────────
## [1] [2] [3]
## [1]x1 1.00
## [2]x2 -0.761 (<.001) 1.00
## [3]y -0.107 (0.288) 0.727 (<.001) 1.00
## ──────────────────────────────────────────
df3 %>%
lm(y ~ x1 + x2,
data = .) %>%
summary()
##
## Call:
## lm(formula = y ~ x1 + x2, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.753e-16 -9.747e-17 -9.920e-18 4.417e-17 3.144e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.331e-17 3.683e-17 -9.040e-01 0.368
## x1 1.000e+00 5.734e-17 1.744e+16 <2e-16 ***
## x2 1.000e+00 3.960e-17 2.525e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.665e-16 on 97 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.225e+32 on 2 and 97 DF, p-value: < 2.2e-16
df4
? How can you tell?df4 %>%
furniture::tableC()
##
## ──────────────────────────────────────────
## [1] [2] [3]
## [1]x1 1.00
## [2]x2 -0.995 (<.001) 1.00
## [3]y -0.034 (0.739) 0.135 (0.181) 1.00
## ──────────────────────────────────────────
df4 %>%
lm(y ~ x1 + x2,
data = .) %>%
summary()
##
## Call:
## lm(formula = y ~ x1 + x2, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.075e-15 -2.265e-17 1.953e-17 6.881e-17 7.759e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.110e-16 2.518e-17 4.409e+00 2.69e-05 ***
## x1 1.000e+00 2.445e-16 4.090e+15 < 2e-16 ***
## x2 1.000e+00 2.424e-16 4.126e+15 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.462e-16 on 97 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.521e+30 on 2 and 97 DF, p-value: < 2.2e-16
Prediction is an important piece of research. Often, we don’t need to know the mechanism, but we want to be able to predict behavior, outcomes, or precursers. In these situations, regression is a great starting point when attempting to pursue this goal.