Introduction

Chapter 9 talks about multicategorical predictors, how we can leverage the information contained therein, and ways we can avoid common problems with these types of variables.

Multicategorical Variables

  1. Let’s start by loading the tidyverse package, the furniture package, the car package, and the rlm package.
library(car)
## Loading required package: carData
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1.9000 ──
## ✔ ggplot2 2.2.1.9000      ✔ purrr   0.2.5      
## ✔ tibble  1.4.2.9004      ✔ dplyr   0.7.99.9000
## ✔ tidyr   0.8.1           ✔ stringr 1.3.1      
## ✔ readr   1.2.0           ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some()   masks car::some()
library(furniture)
## ── furniture 1.7.12 ────────────────────────────────────────────────────────────────────── learn more at tysonbarrett.com ──
## ✔ furniture attached
## ✖ The furniture::table1() function has the same name as tidyr::table1 (tbl_df)
##    Consider using `furniture::` for each function call.
library(rlm)
## ── rlm 0.1.0 ───────────────────────────────────────────────────────────────────────────── learn more at tysonbarrett.com ──
## ✔ rlm attached
## ✔ No potential conflicts found
  1. Import gss into R.
data(gss)
  1. There are two multi-categorical variables we want to investigate: educational degree (degree) and race (race). Let’s look at the distribution of both of these variables using furniture::tableF().
gss %>%
  furniture::tableF(degree)
## 
## ─────────────────────────────────────────────────────────
##  degree     Freq CumFreq Percent CumPerc Valid  CumValid
##  <HS        297  297     14.68%  14.68%  14.69% 14.69%  
##  Associates 173  470     8.55%   23.23%  8.56%  23.24%  
##  Bachelors  355  825     17.55%  40.78%  17.56% 40.80%  
##  Graduate   194  1019    9.59%   50.37%  9.59%  50.40%  
##  HS         1003 2022    49.58%  99.95%  49.60% 100.00% 
##  Missing    1    2023    0.05%   100.00%                
## ─────────────────────────────────────────────────────────
gss %>%
  furniture::tableF(race)
## 
## ────────────────────────────────────
##  race  Freq CumFreq Percent CumPerc
##  Black 281  281     13.89%  13.89% 
##  Other 183  464     9.05%   22.94% 
##  White 1559 2023    77.06%  100.00%
## ────────────────────────────────────

Multi-Categorical Predictors

  1. Let’s do two simple regressions; one with degree predicting income (income06) and another with race predicting income. The lm() function automatically does dummy coding for us, selecting a reference group for you. What is the reference group for degree? What about race?
gss %>%
  lm(income06 ~ degree, data = .) %>%
  summary()
## 
## Call:
## lm(formula = income06 ~ degree, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91660 -26463  -8713  20072 133037 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         26963       2496  10.801   <2e-16 ***
## degreeAssociates    34661       4050   8.558   <2e-16 ***
## degreeBachelors     52965       3347  15.823   <2e-16 ***
## degreeGraduate      66697       3917  17.026   <2e-16 ***
## degreeHS            23562       2834   8.314   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39710 on 1769 degrees of freedom
##   (249 observations deleted due to missingness)
## Multiple R-squared:  0.193,  Adjusted R-squared:  0.1911 
## F-statistic: 105.7 on 4 and 1769 DF,  p-value: < 2.2e-16
gss %>%
    lm(income06 ~ race, data = .) %>%
  summary()
## 
## Call:
## lm(formula = income06 ~ race, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61161 -34161  -6661  20839 122668 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    37332       2748  13.587  < 2e-16 ***
## raceOther      17020       4410   3.860 0.000118 ***
## raceWhite      24329       2988   8.144 7.16e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43360 on 1771 degrees of freedom
##   (249 observations deleted due to missingness)
## Multiple R-squared:  0.0366, Adjusted R-squared:  0.03552 
## F-statistic: 33.64 on 2 and 1771 DF,  p-value: 4.565e-15
  1. Are there significant mean differences between the non-reference levels and the reference level regarding income for either degree or race? What do the estimates mean here?
  2. Let’s do a multiple regression using both degree and race and assign that to “fit”.
fit <- gss %>%
  lm(income06 ~ degree + race, data = .) 
summary(fit)
## 
## Call:
## lm(formula = income06 ~ degree + race, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -92873 -26329  -8415  18738 129171 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         14561       3294   4.420 1.05e-05 ***
## degreeAssociates    33412       4032   8.288 2.26e-16 ***
## degreeBachelors     50433       3351  15.052  < 2e-16 ***
## degreeGraduate      64045       3915  16.360  < 2e-16 ***
## degreeHS            22586       2830   7.980 2.61e-15 ***
## raceOther           13125       4025   3.261  0.00113 ** 
## raceWhite           16267       2743   5.931 3.61e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39340 on 1767 degrees of freedom
##   (249 observations deleted due to missingness)
## Multiple R-squared:  0.2087, Adjusted R-squared:  0.206 
## F-statistic: 77.68 on 6 and 1767 DF,  p-value: < 2.2e-16

Changing the Reference Level

  1. It is often good to control the reference level for optimal interpretability. For example, here, we may want the “White” race to be the reference group since it is the most common in the data set. To do so, we will use
gss %>%
  lm(income06 ~ degree + relevel(race, ref = "White"), data = .) %>%
  summary()
## 
## Call:
## lm(formula = income06 ~ degree + relevel(race, ref = "White"), 
##     data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -92873 -26329  -8415  18738 129171 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          30829       2612  11.802  < 2e-16 ***
## degreeAssociates                     33412       4032   8.288 2.26e-16 ***
## degreeBachelors                      50433       3351  15.052  < 2e-16 ***
## degreeGraduate                       64045       3915  16.360  < 2e-16 ***
## degreeHS                             22586       2830   7.980 2.61e-15 ***
## relevel(race, ref = "White")Black   -16267       2743  -5.931 3.61e-09 ***
## relevel(race, ref = "White")Other    -3143       3331  -0.943    0.346    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39340 on 1767 degrees of freedom
##   (249 observations deleted due to missingness)
## Multiple R-squared:  0.2087, Adjusted R-squared:  0.206 
## F-statistic: 77.68 on 6 and 1767 DF,  p-value: < 2.2e-16

Contrasts and Adjusted Means

Simple Contrasts (Comparing Two Levels)

  1. We are also interested in comparing the non-reference levels as well. We can use the car package with the linearHypothesis() function to test these. To make these comparisons, we use the coefficient names from fit as shown below. In this case, let’s compare Bachelors and Graduate degree incomes. Are they significantly different?
car::linearHypothesis(fit, "degreeBachelors = degreeGraduate")
## Linear hypothesis test
## 
## Hypothesis:
## degreeBachelors - degreeGraduate = 0
## 
## Model 1: restricted model
## Model 2: income06 ~ degree + race
## 
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   1768 2.7553e+12                                   
## 2   1767 2.7345e+12  1 2.0733e+10 13.397 0.0002594 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. What about the “White” race and the “Other” race? Was this comparison any different than when we compared them earlier with the “White” as the reference group (see question 8)?
car::linearHypothesis(fit, "raceWhite = raceOther")
## Linear hypothesis test
## 
## Hypothesis:
## - raceOther  + raceWhite = 0
## 
## Model 1: restricted model
## Model 2: income06 ~ degree + race
## 
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)
## 1   1768 2.7359e+12                            
## 2   1767 2.7345e+12  1 1377459213 0.8901 0.3456

Adjusted Means

  1. To finish, let’s create a model that include both degree and race in addition to age (age) and home population (hompop).
gss %>%
  lm(income06 ~ degree + race + age + hompop, data = .) %>%
  summary()
## 
## Call:
## lm(formula = income06 ~ degree + race + age + hompop, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85351 -25713  -7068  17518 136581 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -15761.16    4828.44  -3.264  0.00112 ** 
## degreeAssociates  34197.72    3876.13   8.823  < 2e-16 ***
## degreeBachelors   52592.75    3230.16  16.282  < 2e-16 ***
## degreeGraduate    66241.39    3764.17  17.598  < 2e-16 ***
## degreeHS          23629.55    2726.91   8.665  < 2e-16 ***
## raceOther          9801.61    3880.86   2.526  0.01164 *  
## raceWhite         16015.50    2653.17   6.036 1.92e-09 ***
## age                 180.59      57.04   3.166  0.00157 ** 
## hompop             8356.00     684.18  12.213  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37780 on 1765 degrees of freedom
##   (249 observations deleted due to missingness)
## Multiple R-squared:  0.2709, Adjusted R-squared:  0.2676 
## F-statistic: 81.96 on 8 and 1765 DF,  p-value: < 2.2e-16
  1. Go through each estimate and interpret it, including the intercept, the levels of degree, the levels of race, age and hompop.
  2. Why is the intercept here not all that useful? How might we make it useful again?
  3. Let’s check out the adjusted means of degree. We can use the emmeans function for this.
gss %>%
  lm(income06 ~ degree + race + age + hompop, data = .) %>%
  emmeans::emmeans(specs = "degree")
##  degree       emmean       SE   df lower.CL upper.CL
##  <HS        22473.01 2472.960 1765 17622.77 27323.25
##  Associates 56670.74 3216.628 1765 50361.94 62979.54
##  Bachelors  75065.76 2405.331 1765 70348.16 79783.36
##  Graduate   88714.41 3091.042 1765 82651.92 94776.90
##  HS         46102.56 1644.806 1765 42876.59 49328.54
## 
## Results are averaged over the levels of: race 
## Confidence level used: 0.95
  1. What do these “emmeans” mean? It may help in the interpretation to see the values that emmeans() averaged over, we can use emmeans::ref_grid(). Does this help us more fully interpret the emmeans from before?
gss %>%
  lm(income06 ~ degree + race + age + hompop, data = .) %>%
  emmeans::ref_grid()
## 'emmGrid' object with variables:
##     degree = <HS, Associates, Bachelors, Graduate, HS
##     race = Black, Other, White
##     age = 47.137
##     hompop = 2.5271

Conclusion

Multicategorical predictors are common in the social sciences. Understanding how to analyze and interpret them in a regression model is essential to using regression fully.