EDUC/PSY 7610

Chapter 3 introduces multiple linear regression. The following examples help illustrate a number of principles that were discussed. Follow all instructions to complete Chapter 3.

- Download the
`GSS_reduced_example.csv`

data set from Canvas or tysonbarrett.com/teaching to your computer. Save it in a directory you can access fairly easily. - Open RStudio, start a new R script or RMarkdown document.
- Load the
`tidyverse`

package.

`library(tidyverse)`

- Import it into R.

`gss <- read.csv("GSS_reduced_example.csv")`

- Use a multiple regression model to assess the effect of the number of years of education (
`educ`

) on income (`income06`

) while controlling for home population (`hompop`

).

```
gss %>%
lm(income06 ~ educ + hompop,
data = .)
```

```
##
## Call:
## lm(formula = income06 ~ educ + hompop, data = .)
##
## Coefficients:
## (Intercept) educ hompop
## -18417 4286 7125
```

- This output gives you three values, one for the intercept, one for the slope of
`educ`

, and another for the slope of`hompop`

. What does the slope of`educ`

mean here (i.e., interpret the value)?

To better understand how `educ`

and `hompop`

will change the other’s simple effect to a partial effect, we can first check the correlation between the two. As a reminder, if the correlation between them is non-zero and they both are correlated with the outcome, the simple effect and the partial effect will differ (at least by a little).

- Using the GSS data you already imported above, let’s look at the correlation between income, years of education, and home population.

```
gss %>%
furniture::tableC(income06, educ, hompop,
na.rm = TRUE)
```

```
##
## ────────────────────────────────────────────────
## [1] [2] [3]
## [1]income06 1.00
## [2]educ 0.341 (<.001) 1.00
## [3]hompop 0.207 (<.001) -0.058 (0.015) 1.00
## ────────────────────────────────────────────────
```

- That is not a big correlation but it isn’t equal to zero. What does such a small correlation do to the simple effects of both
`educ`

and`hompop`

? That is, run both simple regressions and compare to the results from the multiple regression earlier.

```
gss %>%
lm(income06 ~ educ,
data = .)
```

```
##
## Call:
## lm(formula = income06 ~ educ, data = .)
##
## Coefficients:
## (Intercept) educ
## 1742 4127
```

```
gss %>%
lm(income06 ~ hompop,
data = .)
```

```
##
## Call:
## lm(formula = income06 ~ hompop, data = .)
##
## Coefficients:
## (Intercept) hompop
## 41206 6486
```

- Let’s run a multiple regression with years of education (
`educ`

) on income (`income06`

) while controlling for home population (`hompop`

) and age (`age`

).

```
gss %>%
lm(income06 ~ educ + hompop + age,
data = .)
```

```
##
## Call:
## lm(formula = income06 ~ educ + hompop + age, data = .)
##
## Coefficients:
## (Intercept) educ hompop age
## -33492.0 4319.8 8222.6 251.2
```

- How is the effect of education interpreted in this case?
- Let’s compare this to the regression value after we standardize both variables with
`scale()`

. We first have to grab just the complete cases of the variables first (again, consider why).

```
gss %>%
filter(complete.cases(income06, educ, hompop, age)) %>%
mutate(incomeZ = scale(income06) %>% as.numeric,
educZ = scale(educ) %>% as.numeric,
hompopZ = scale(hompop) %>% as.numeric,
ageZ = scale(age) %>% as.numeric) %>%
lm(incomeZ ~ educZ + hompopZ + ageZ,
data = .)
```

```
##
## Call:
## lm(formula = incomeZ ~ educZ + hompopZ + ageZ, data = .)
##
## Coefficients:
## (Intercept) educZ hompopZ ageZ
## -1.196e-16 3.568e-01 2.628e-01 9.756e-02
```

- The intercept is essentially zero again. Now, how is the estimate of education interpreted in this case?
- Let’s compare this to the other way of standardizing the estimates (\(b_{standardized} = b \frac{s_x}{s_y}\)).

```
sds <- gss %>%
filter(complete.cases(income06, educ, hompop, age)) %>%
summarize(s_educ = sd(educ),
s_hom = sd(hompop),
s_age = sd(age),
s_inc = sd(income06))
gss %>%
lm(income06 ~ educ + hompop + age,
data = .) %>%
coef() %>%
.[-1] * sds[,1:3]/sds[[4]]
```

```
## s_educ s_hom s_age
## 1 0.3568421 0.2628036 0.0975636
```

- Although it was a little bit more of a messy computation, the results for the estimates are the same.
- Finally, let’s do a partial correlation.

```
gss %>%
filter(complete.cases(income06, educ, hompop, age)) %>%
mutate(residincom = lm(income06 ~ hompop + age) %>% resid,
resideduc = lm(educ ~ hompop + age) %>% resid) %>%
furniture::tableC(residincom, resideduc)
```

```
##
## ───────────────────────────────────
## [1] [2]
## [1]residincom 1.00
## [2]resideduc 0.365 (<.001) 1.00
## ───────────────────────────────────
```

- These values differ from the standardized regression coefficients. Why?
- Consider if we had a variable that we wanted to control for (let’s call it \(c\)) but didn’t have access to it. We report on the regression below with \(x\) predicting \(y\). If we know that the correlation between \(x\) and \(c\) is positive and the correlation between \(y\) and \(c\) is positive, will the estimate on \(x\) go up, down, or stay the same?

```
## Do not edit this part
set.seed(843)
df <- data_frame(
x = rnorm(100),
y = 2*x + rnorm(100, 2, 5)
)
df %>%
lm(y ~ x, data = .)
```

```
##
## Call:
## lm(formula = y ~ x, data = .)
##
## Coefficients:
## (Intercept) x
## 1.912 1.798
```

This was an introduction to some of the features of multiple regression that we’ll be using throughout the class. Although not much of a workflow here, each piece will play a role in larger analyses.