Chapter 9 talks about multicategorical predictors, how we can leverage the information contained therein, and ways we can avoid common problems with these types of variables.
tidyverse
package, the furniture
package, the car
package, and the educ7610
package.library(car)
library(tidyverse)
library(furniture)
library(educ7610)
gss
into R.data(gss)
degree
) and race (race
). Let’s look at the distribution of both of these variables using furniture::tableF()
.gss %>%
furniture::tableF(degree)
##
## ─────────────────────────────────────────────────────────
## degree Freq CumFreq Percent CumPerc Valid CumValid
## <HS 297 297 14.68% 14.68% 14.69% 14.69%
## Associates 173 470 8.55% 23.23% 8.56% 23.24%
## Bachelors 355 825 17.55% 40.78% 17.56% 40.80%
## Graduate 194 1019 9.59% 50.37% 9.59% 50.40%
## HS 1003 2022 49.58% 99.95% 49.60% 100.00%
## Missing 1 2023 0.05% 100.00%
## ─────────────────────────────────────────────────────────
gss %>%
furniture::tableF(race)
##
## ────────────────────────────────────
## race Freq CumFreq Percent CumPerc
## Black 281 281 13.89% 13.89%
## Other 183 464 9.05% 22.94%
## White 1559 2023 77.06% 100.00%
## ────────────────────────────────────
income06
) and another with race predicting income. The lm()
function automatically does dummy coding for us, selecting a reference group for you. What is the reference group for degree? What about race?gss %>%
lm(income06 ~ degree, data = .) %>%
summary()
##
## Call:
## lm(formula = income06 ~ degree, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91660 -26463 -8713 20072 133037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26963 2496 10.801 <2e-16 ***
## degreeAssociates 34661 4050 8.558 <2e-16 ***
## degreeBachelors 52965 3347 15.823 <2e-16 ***
## degreeGraduate 66697 3917 17.026 <2e-16 ***
## degreeHS 23562 2834 8.314 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39710 on 1769 degrees of freedom
## (249 observations deleted due to missingness)
## Multiple R-squared: 0.193, Adjusted R-squared: 0.1911
## F-statistic: 105.7 on 4 and 1769 DF, p-value: < 2.2e-16
gss %>%
lm(income06 ~ race, data = .) %>%
summary()
##
## Call:
## lm(formula = income06 ~ race, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61161 -34161 -6661 20839 122668
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37332 2748 13.587 < 2e-16 ***
## raceOther 17020 4410 3.860 0.000118 ***
## raceWhite 24329 2988 8.144 7.16e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43360 on 1771 degrees of freedom
## (249 observations deleted due to missingness)
## Multiple R-squared: 0.0366, Adjusted R-squared: 0.03552
## F-statistic: 33.64 on 2 and 1771 DF, p-value: 4.565e-15
fit <- gss %>%
lm(income06 ~ degree + race, data = .)
summary(fit)
##
## Call:
## lm(formula = income06 ~ degree + race, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92873 -26329 -8415 18738 129171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14561 3294 4.420 1.05e-05 ***
## degreeAssociates 33412 4032 8.288 2.26e-16 ***
## degreeBachelors 50433 3351 15.052 < 2e-16 ***
## degreeGraduate 64045 3915 16.360 < 2e-16 ***
## degreeHS 22586 2830 7.980 2.61e-15 ***
## raceOther 13125 4025 3.261 0.00113 **
## raceWhite 16267 2743 5.931 3.61e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39340 on 1767 degrees of freedom
## (249 observations deleted due to missingness)
## Multiple R-squared: 0.2087, Adjusted R-squared: 0.206
## F-statistic: 77.68 on 6 and 1767 DF, p-value: < 2.2e-16
gss %>%
lm(income06 ~ degree + relevel(race, ref = "White"), data = .) %>%
summary()
##
## Call:
## lm(formula = income06 ~ degree + relevel(race, ref = "White"),
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92873 -26329 -8415 18738 129171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30829 2612 11.802 < 2e-16 ***
## degreeAssociates 33412 4032 8.288 2.26e-16 ***
## degreeBachelors 50433 3351 15.052 < 2e-16 ***
## degreeGraduate 64045 3915 16.360 < 2e-16 ***
## degreeHS 22586 2830 7.980 2.61e-15 ***
## relevel(race, ref = "White")Black -16267 2743 -5.931 3.61e-09 ***
## relevel(race, ref = "White")Other -3143 3331 -0.943 0.346
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39340 on 1767 degrees of freedom
## (249 observations deleted due to missingness)
## Multiple R-squared: 0.2087, Adjusted R-squared: 0.206
## F-statistic: 77.68 on 6 and 1767 DF, p-value: < 2.2e-16
car
package with the linearHypothesis()
function to test these. To make these comparisons, we use the coefficient names from fit
as shown below. In this case, let’s compare Bachelors and Graduate degree incomes. Are they significantly different?car::linearHypothesis(fit, "degreeBachelors = degreeGraduate")
## Linear hypothesis test
##
## Hypothesis:
## degreeBachelors - degreeGraduate = 0
##
## Model 1: restricted model
## Model 2: income06 ~ degree + race
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1768 2.7553e+12
## 2 1767 2.7345e+12 1 2.0733e+10 13.397 0.0002594 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::linearHypothesis(fit, "raceWhite = raceOther")
## Linear hypothesis test
##
## Hypothesis:
## - raceOther + raceWhite = 0
##
## Model 1: restricted model
## Model 2: income06 ~ degree + race
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1768 2.7359e+12
## 2 1767 2.7345e+12 1 1377459213 0.8901 0.3456
age
) and home population (hompop
).gss %>%
lm(income06 ~ degree + race + age + hompop, data = .) %>%
summary()
##
## Call:
## lm(formula = income06 ~ degree + race + age + hompop, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85351 -25713 -7068 17518 136581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15761.16 4828.44 -3.264 0.00112 **
## degreeAssociates 34197.72 3876.13 8.823 < 2e-16 ***
## degreeBachelors 52592.75 3230.16 16.282 < 2e-16 ***
## degreeGraduate 66241.39 3764.17 17.598 < 2e-16 ***
## degreeHS 23629.55 2726.91 8.665 < 2e-16 ***
## raceOther 9801.61 3880.86 2.526 0.01164 *
## raceWhite 16015.50 2653.17 6.036 1.92e-09 ***
## age 180.59 57.04 3.166 0.00157 **
## hompop 8356.00 684.18 12.213 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37780 on 1765 degrees of freedom
## (249 observations deleted due to missingness)
## Multiple R-squared: 0.2709, Adjusted R-squared: 0.2676
## F-statistic: 81.96 on 8 and 1765 DF, p-value: < 2.2e-16
emmeans
function for this.gss %>%
lm(income06 ~ degree + race + age + hompop, data = .) %>%
emmeans::emmeans(specs = "degree")
## degree emmean SE df lower.CL upper.CL
## <HS 22473.01 2472.960 1765 17622.77 27323.25
## Associates 56670.74 3216.628 1765 50361.94 62979.54
## Bachelors 75065.76 2405.331 1765 70348.16 79783.36
## Graduate 88714.41 3091.042 1765 82651.92 94776.90
## HS 46102.56 1644.806 1765 42876.59 49328.54
##
## Results are averaged over the levels of: race
## Confidence level used: 0.95
emmeans()
averaged over, we can use emmeans::ref_grid()
. Does this help us more fully interpret the emmeans from before?gss %>%
lm(income06 ~ degree + race + age + hompop, data = .) %>%
emmeans::ref_grid()
## 'emmGrid' object with variables:
## degree = <HS, Associates, Bachelors, Graduate, HS
## race = Black, Other, White
## age = 47.137
## hompop = 2.5271
Multicategorical predictors are common in the social sciences. Understanding how to analyze and interpret them in a regression model is essential to using regression fully.