```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
## Introduction
Chapter 7 talks about using regression as a prediction tool (i.e., for predictive modeling). The following examples help illustrate a few items that were discussed. Follow all instructions to complete Chapter 7.
## Prediction with Linear Regression
1. Let's start by loading the `tidyverse` package (you can ignore the notes that you see below that it gives you once you load it), the `furniture` package, the `educ7610` package, and the `caret` package. The `caret` package provides us with almost everything you need to do predictive modeling with regression (or many other approaches).
```{r}
library(tidyverse)
library(furniture)
library(caret)
library(educ7610)
```
2. Import the GSS data set.
```{r}
data(gss)
```
3. Let's start predicting `income06` with several of the variables in the data set. First thing we need to do is deal with the missing values in the data that will be used for the prediction. There are several ways to handle the missing data but we will take the easiest route by just removing them (often not recommended--see "multiple imputation").
```{r}
gss_no_na <- gss %>%
select(income06, age, educ, degree, sex, race, hompop,
marital, partyid, natspac,
tvhours) %>%
filter(complete.cases(income06, age, educ, degree, sex, race, hompop,
marital, partyid, natspac,
tvhours))
```
4. Because we are using a random aspect of our analysis (cross-validation), let's set our random seed so we can replicate our results.
```{r}
set.seed(843)
```
5. Now we can set up our cross-validation using `trainControl()`.
```{r}
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
```
6. Using that, we can how fit our cross-validated regression models. The `method = "lm"` uses linear regression just like the regression that we've discussed but is not so invested in coefficients but more about rpediction.
```{r}
fit <- train(income06 ~ .,
data = gss_no_na,
method = "lm",
trControl = fitControl)
fit
```
7. We are given some pieces of information here. It tells us our sample size, how many predictors we've included, that we've used cross-validation, and then gives us an RMSE (root mean squared error), $R^2$, and MAE (mean absolute error). These are all measures of accuracy of prediction. How much of the variation in `income06` are we explaining with our 12 predictors?
## Selection Approaches
Notably, these selection approaches cannot be used for causal interpretation of the individual coefficients or predictors.
8. There are several predictor selection approaches. Among these, the Lasso (least absolute shrinkage and selector operator) is one of the most useful. The Lasso is built on linear regression but integrates a penalty term that allows it to select important (important in terms of prediction) variables. Let's use the `method = "glmnet"` option with `caret` that uses a type of blended lasso approach.
```{r}
fit2 <- train(income06 ~ .,
data = gss_no_na,
method = "glmnet",
trControl = fitControl)
fit2
```
9. We are given several pieces of information, including the levels of the tuning parameters (aspects of the model that can be adjusted beyond that of linear regression) and their corresponding accuracy levels. How accurate is the best model?
10. We can further investigate this by seeing how many/which predictors were selected for the best model. We can do that by using `coef()` while using the `s` argument where we give the best lambda value from above (`33.55241`).
```{r}
coef(fit2$finalModel,
s = 33.55241)
```
11. Almost all the variables in the model were important in this case. Which variable was not selected?
12. Another predictor selection approach is stepwise (as discussed in the book). It has proven to not work super well in accurately selecting predictors but it can be helpful in prediction. We'll do this below with `method = "leapSeq"`.
```{r}
fit3 <- train(income06 ~ .,
data = gss_no_na,
method = "leapSeq",
trControl = fitControl)
fit3
```
13. Here, `nvmax` is the number of predictors selected. How many predictors of the 2 - 4 provided the best model?
14. How much of the variability can we explain with just 4 predictors in the model?
## Predictor Configurations
15. To help illustrate these various configurations, we will work with the following four data sets:
- `df1`
- `df2`
- `df3`
- `df4`
```{r, echo = FALSE}
set.seed(843)
## Indepndence
df1 <- data_frame(
x1 = rnorm(100),
x2 = rnorm(100),
y = x1 + x2
)
## Partial Redundancy
df2 <- data_frame(
x1 = rnorm(100),
x2 = x1 + rnorm(100),
y = x1 + x2
)
## Suppression
df3 <- data_frame(
x1 = rnorm(100),
x2 = -x1 + rnorm(100),
y = x1 + x2
)
## Complimentarity
df4 <- data_frame(
x1 = rnorm(100),
x2 = -x1 + .1*rnorm(100),
y = x1 + x2
)
```
16. What configuration is `df1`? How can you tell?
```{r}
df1 %>%
furniture::tableC()
df1 %>%
lm(y ~ x1 + x2,
data = .) %>%
summary()
```
17. What configuration is `df2`? How can you tell?
```{r}
df2 %>%
furniture::tableC()
df2 %>%
lm(y ~ x1 + x2,
data = .) %>%
summary()
```
18. What configuration is `df3`? How can you tell?
```{r}
df3 %>%
furniture::tableC()
df3 %>%
lm(y ~ x1 + x2,
data = .) %>%
summary()
```
19. What configuration is `df4`? How can you tell?
```{r}
df4 %>%
furniture::tableC()
df4 %>%
lm(y ~ x1 + x2,
data = .) %>%
summary()
```
## Conclusion
Prediction is an important piece of research. Often, we don't need to know the mechanism, but we want to be able to predict behavior, outcomes, or precursers. In these situations, regression is a great starting point when attempting to pursue this goal.