```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

## Introduction

Chapter 3 introduces multiple linear regression. The following examples help illustrate a number of principles that were discussed. Follow all instructions to complete Chapter 3.

## Multiple Regression

1. Download the `GSS_reduced_example.csv` data set from [Canvas](https://login.usu.edu/cas/login?service=https%3a%2f%2fmy.usu.edu%2f) or [tysonbarrett.com/teaching](https://tysonstanley.github.io/teaching) to your computer. Save it in a directory you can access fairly easily.
2. Open RStudio, start a new R script or RMarkdown document.
3. Load the `tidyverse` package.
```{r}
library(tidyverse)
```
3. Import it into R.
```{r, eval = FALSE}
gss <- read.csv("GSS_reduced_example.csv")
```
```{r, echo = FALSE}
gss <- read.csv(here::here("GSS_Data/Data/GSS_reduced_example.csv"))
```
4. Use a multiple regression model to assess the effect of the number of years of education (`educ`) on income (`income06`) while controlling for home population (`hompop`).
```{r}
gss %>%
  lm(income06 ~ educ + hompop,
     data = .)
```
5. This output gives you three values, one for the intercept, one for the slope of `educ`, and another for the slope of `hompop`. What does the slope of `educ` mean here (i.e., interpret the value)?

## Partial Effects (partial regression, partial standardized effect, and partial correlation)

To better understand how `educ` and `hompop` will change the other's simple effect to a partial effect, we can first check the correlation between the two. As a reminder, if the correlation between them is non-zero and they both are correlated with the outcome, the simple effect and the partial effect will differ (at least by a little).

1. Using the GSS data you already imported above, let's look at the correlation between income, years of education, and home population.
```{r}
gss %>%
  furniture::tableC(income06, educ, hompop,
                    na.rm = TRUE)
```
2. That is not a big correlation but it isn't equal to zero. What does such a small correlation do to the simple effects of both `educ` and `hompop`? That is, run both simple regressions and compare to the results from the multiple regression earlier.
```{r}
gss %>%
  lm(income06 ~ educ,
     data = .)
gss %>%
  lm(income06 ~ hompop,
     data = .)
```
3. Let's run a multiple regression with years of education (`educ`) on income (`income06`) while controlling for home population (`hompop`) and age (`age`).
```{r}
gss %>%
  lm(income06 ~ educ + hompop + age,
     data = .)
```
4. How is the effect of education interpreted in this case?
5. Let's compare this to the regression value after we standardize both variables with `scale()`. We first have to grab just the complete cases of the variables first (again, consider why).
```{r}
gss %>%
  filter(complete.cases(income06, educ, hompop, age)) %>%
  mutate(incomeZ = scale(income06) %>% as.numeric,
         educZ   = scale(educ) %>% as.numeric,
         hompopZ = scale(hompop) %>% as.numeric,
         ageZ    = scale(age) %>% as.numeric) %>%
  lm(incomeZ ~ educZ + hompopZ + ageZ,
     data = .)
```
6. The intercept is essentially zero again. Now, how is the estimate of education interpreted in this case?
7. Let's compare this to the other way of standardizing the estimates ($b_{standardized} = b \frac{s_x}{s_y}$).
```{r}
sds <- gss %>%
  filter(complete.cases(income06, educ, hompop, age)) %>%
  summarize(s_educ = sd(educ),
            s_hom  = sd(hompop),
            s_age  = sd(age),
            s_inc  = sd(income06))
gss %>%
  lm(income06 ~ educ + hompop + age,
     data = .) %>%
  coef() %>%
  .[-1] * sds[,1:3]/sds[[4]]
```
8. Although it was a little bit more of a messy computation, the results for the estimates are the same.
9. Finally, let's do a partial correlation.
```{r}
gss %>%
  filter(complete.cases(income06, educ, hompop, age)) %>%
  mutate(residincom  = lm(income06 ~ hompop + age) %>% resid,
         resideduc   = lm(educ ~ hompop + age) %>% resid) %>%
  furniture::tableC(residincom, resideduc)
```
10. These values differ from the standardized regression coefficients. Why?
11. Consider if we had a variable that we wanted to control for (let's call it $c$) but didn't have access to it. We report on the regression below with $x$ predicting $y$. If we know that the correlation between $x$ and $c$ is positive and the correlation between $y$ and $c$ is positive, will the estimate on $x$ go up, down, or stay the same?
```{r}
## Do not edit this part
set.seed(843)
df <- data_frame(
  x = rnorm(100),
  y = 2*x + rnorm(100, 2, 5)
)

df %>%
  lm(y ~ x, data = .)
```


## Conclusion

This was an introduction to some of the features of multiple regression that we'll be using throughout the class. Although not much of a workflow here, each piece will play a role in larger analyses.