```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Chapter 3 introduces multiple linear regression. The following examples help illustrate a number of principles that were discussed. Follow all instructions to complete Chapter 3.
## Multiple Regression
1. Download the `GSS_reduced_example.csv` data set from [Canvas](https://login.usu.edu/cas/login?service=https%3a%2f%2fmy.usu.edu%2f) or [tysonbarrett.com/teaching](https://tysonstanley.github.io/teaching) to your computer. Save it in a directory you can access fairly easily.
2. Open RStudio, start a new R script or RMarkdown document.
3. Load the `tidyverse` package (you can ignore the notes that you see below that it gives you once you load it).
```{r}
library(tidyverse)
```
3. Import it into R.
```{r, eval = FALSE}
gss <- read.csv("GSS_reduced_example.csv")
```
```{r, echo = FALSE}
gss <- read.csv(here::here("GSS_Data/Data/GSS_reduced_example.csv"))
```
4. Use a multiple regression model to assess the effect of the number of years of education (`educ`) on income (`income06`) while controlling for home population (`hompop`).
```{r}
gss %>%
lm(income06 ~ educ + hompop,
data = .)
```
5. This output gives you three values, one for the intercept, one for the slope of `educ`, and another for the slope of `hompop`. What does the slope of `educ` mean here (i.e., interpret the value)?
## Partial Effects (partial regression, partial standardized effect, and partial correlation)
To better understand how `educ` and `hompop` will change the other's simple effect to a partial effect, we can first check the correlation between the two. As a reminder, if the correlation between them is non-zero and they both are correlated with the outcome, the simple effect and the partial effect will differ (at least by a little).
1. Using the GSS data you already imported above, let's look at the correlation between income, years of education, and home population.
```{r}
gss %>%
furniture::tableC(income06, educ, hompop,
na.rm = TRUE)
```
2. That is not a big correlation but it isn't equal to zero. What does such a small correlation do to the simple effects of both `educ` and `hompop`? That is, run both simple regressions and compare to the results from the multiple regression earlier.
```{r}
gss %>%
lm(income06 ~ educ,
data = .)
gss %>%
lm(income06 ~ hompop,
data = .)
```
3. Let's run a multiple regression with years of education (`educ`) on income (`income06`) while controlling for home population (`hompop`) and age (`age`).
```{r}
gss %>%
lm(income06 ~ educ + hompop + age,
data = .)
```
4. How is the effect of education interpreted in this case?
5. Let's compare this to the regression value after we standardize both variables with `scale()`. We first have to grab just the complete cases of the variables first (again, consider why).
```{r}
gss %>%
filter(complete.cases(income06, educ, hompop, age)) %>%
mutate(incomeZ = scale(income06) %>% as.numeric,
educZ = scale(educ) %>% as.numeric,
hompopZ = scale(hompop) %>% as.numeric,
ageZ = scale(age) %>% as.numeric) %>%
lm(incomeZ ~ educZ + hompopZ + ageZ,
data = .)
```
6. The intercept is essentially zero again. Now, how is the estimate of education interpreted in this case?
7. Let's compare this to the other way of standardizing the estimates ($b_{standardized} = b \frac{s_x}{s_y}$).
```{r}
sds <- gss %>%
filter(complete.cases(income06, educ, hompop, age)) %>%
summarize(s_educ = sd(educ),
s_hom = sd(hompop),
s_age = sd(age),
s_inc = sd(income06))
gss %>%
lm(income06 ~ educ + hompop + age,
data = .) %>%
coef() %>%
.[-1] * sds[,1:3]/sds[,4]
```
8. Although it was a little bit more of a messy computation, the results for the estimates are the same.
9. Finally, let's do a partial correlation.
```{r}
gss %>%
filter(complete.cases(income06, educ, hompop, age)) %>%
mutate(residincom = lm(income06 ~ hompop + age) %>% resid,
resideduc = lm(educ ~ hompop + age) %>% resid) %>%
furniture::tableC(residincom, resideduc)
```
10. These values differ from the standardized regression coefficients. Why?
11. Consider if we had a variable that we wanted to control for (let's call it $c$) but didn't have access to it. We report on the regression below with $x$ predicting $y$. If we know that the correlation between $x$ and $c$ is positive and the correlation between $y$ and $c$ is positive, will the estimate on $x$ go up, down, or stay the same?
```{r}
## Do not edit this part
set.seed(843)
df <- data_frame(
x = rnorm(100),
y = 2*x + rnorm(100, 2, 5)
)
df %>%
lm(y ~ x, data = .)
```
## Conclusion
This was an introduction to some of the features of multiple regression that we'll be using throughout the class. Although not much of a workflow here, each piece will play a role in larger analyses.