Introduction

Chapter 3 introduces multiple linear regression. The following examples help illustrate a number of principles that were discussed. Follow all instructions to complete Chapter 3.

Multiple Regression

  1. Download the GSS_reduced_example.csv data set from Canvas or tysonbarrett.com/teaching to your computer. Save it in a directory you can access fairly easily.
  2. Open RStudio, start a new R script or RMarkdown document.
  3. Load the tidyverse package (you can ignore the notes that you see below that it gives you once you load it).
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1.9000 ──
## ✔ ggplot2 3.0.0.9000      ✔ purrr   0.2.5      
## ✔ tibble  1.4.99.9004     ✔ dplyr   0.7.99.9000
## ✔ tidyr   0.8.1           ✔ stringr 1.3.1      
## ✔ readr   1.2.0           ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
  1. Import it into R.
gss <- read.csv("GSS_reduced_example.csv")
  1. Use a multiple regression model to assess the effect of the number of years of education (educ) on income (income06) while controlling for home population (hompop).
gss %>%
  lm(income06 ~ educ + hompop,
     data = .)
## 
## Call:
## lm(formula = income06 ~ educ + hompop, data = .)
## 
## Coefficients:
## (Intercept)         educ       hompop  
##      -18417         4286         7125
  1. This output gives you three values, one for the intercept, one for the slope of educ, and another for the slope of hompop. What does the slope of educ mean here (i.e., interpret the value)?

Partial Effects (partial regression, partial standardized effect, and partial correlation)

To better understand how educ and hompop will change the other’s simple effect to a partial effect, we can first check the correlation between the two. As a reminder, if the correlation between them is non-zero and they both are correlated with the outcome, the simple effect and the partial effect will differ (at least by a little).

  1. Using the GSS data you already imported above, let’s look at the correlation between income, years of education, and home population.
gss %>%
  furniture::tableC(income06, educ, hompop,
                    na.rm = TRUE)
## N = 1774
## Note: pearson correlation (p-value).
## 
## ────────────────────────────────────────────────
##              [1]           [2]            [3]  
##  [1]income06 1.00                              
##  [2]educ     0.341 (<.001) 1.00                
##  [3]hompop   0.207 (<.001) -0.058 (0.015) 1.00 
## ────────────────────────────────────────────────
  1. That is not a big correlation but it isn’t equal to zero. What does such a small correlation do to the simple effects of both educ and hompop? That is, run both simple regressions and compare to the results from the multiple regression earlier.
gss %>%
  lm(income06 ~ educ,
     data = .)
## 
## Call:
## lm(formula = income06 ~ educ, data = .)
## 
## Coefficients:
## (Intercept)         educ  
##        1742         4127
gss %>%
  lm(income06 ~ hompop,
     data = .)
## 
## Call:
## lm(formula = income06 ~ hompop, data = .)
## 
## Coefficients:
## (Intercept)       hompop  
##       41206         6486
  1. Let’s run a multiple regression with years of education (educ) on income (income06) while controlling for home population (hompop) and age (age).
gss %>%
  lm(income06 ~ educ + hompop + age,
     data = .)
## 
## Call:
## lm(formula = income06 ~ educ + hompop + age, data = .)
## 
## Coefficients:
## (Intercept)         educ       hompop          age  
##    -33492.0       4319.8       8222.6        251.2
  1. How is the effect of education interpreted in this case?
  2. Let’s compare this to the regression value after we standardize both variables with scale(). We first have to grab just the complete cases of the variables first (again, consider why).
gss %>%
  filter(complete.cases(income06, educ, hompop, age)) %>%
  mutate(incomeZ = scale(income06) %>% as.numeric,
         educZ   = scale(educ) %>% as.numeric,
         hompopZ = scale(hompop) %>% as.numeric,
         ageZ    = scale(age) %>% as.numeric) %>%
  lm(incomeZ ~ educZ + hompopZ + ageZ,
     data = .)
## 
## Call:
## lm(formula = incomeZ ~ educZ + hompopZ + ageZ, data = .)
## 
## Coefficients:
## (Intercept)        educZ      hompopZ         ageZ  
##  -1.196e-16    3.568e-01    2.628e-01    9.756e-02
  1. The intercept is essentially zero again. Now, how is the estimate of education interpreted in this case?
  2. Let’s compare this to the other way of standardizing the estimates (\(b_{standardized} = b \frac{s_x}{s_y}\)).
sds <- gss %>%
  filter(complete.cases(income06, educ, hompop, age)) %>%
  summarize(s_educ = sd(educ),
            s_hom  = sd(hompop),
            s_age  = sd(age),
            s_inc  = sd(income06))
gss %>%
  lm(income06 ~ educ + hompop + age,
     data = .) %>%
  coef() %>%
  .[-1] * sds[,1:3]/sds[[4]]
##      s_educ     s_hom     s_age
## 1 0.3568421 0.2628036 0.0975636
  1. Although it was a little bit more of a messy computation, the results for the estimates are the same.
  2. Finally, let’s do a partial correlation.
gss %>%
  filter(complete.cases(income06, educ, hompop, age)) %>%
  mutate(residincom  = lm(income06 ~ hompop + age) %>% resid,
         resideduc   = lm(educ ~ hompop + age) %>% resid) %>%
  furniture::tableC(residincom, resideduc)
## N = 1774
## Note: pearson correlation (p-value).
## 
## ───────────────────────────────────
##                [1]           [2]  
##  [1]residincom 1.00               
##  [2]resideduc  0.365 (<.001) 1.00 
## ───────────────────────────────────
  1. These values differ from the standardized regression coefficients. Why?
  2. Consider if we had a variable that we wanted to control for (let’s call it \(c\)) but didn’t have access to it. We report on the regression below with \(x\) predicting \(y\). If we know that the correlation between \(x\) and \(c\) is positive and the correlation between \(y\) and \(c\) is positive, will the estimate on \(x\) go up, down, or stay the same?
## Do not edit this part
set.seed(843)
df <- data_frame(
  x = rnorm(100),
  y = 2*x + rnorm(100, 2, 5)
)

df %>%
  lm(y ~ x, data = .)
## 
## Call:
## lm(formula = y ~ x, data = .)
## 
## Coefficients:
## (Intercept)            x  
##       1.912        1.798

Conclusion

This was an introduction to some of the features of multiple regression that we’ll be using throughout the class. Although not much of a workflow here, each piece will play a role in larger analyses.