```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Chapter 9 talks about multicategorical predictors, how we can leverage the information contained therein, and ways we can avoid common problems with these types of variables.
## Multicategorical Variables
1. Let's start by loading the `tidyverse` package, the `furniture` package, the `car` package, and the `rlm` package.
```{r}
library(car)
library(tidyverse)
library(furniture)
library(rlm)
```
2. Import `gss` into R.
```{r, eval = FALSE}
data(gss)
```
3. There are two multi-categorical variables we want to investigate: educational degree (`degree`) and race (`race`). Let's look at the distribution of both of these variables using `furniture::tableF()`.
```{r}
gss %>%
furniture::tableF(degree)
gss %>%
furniture::tableF(race)
```
## Multi-Categorical Predictors
4. Let's do two simple regressions; one with degree predicting income (`income06`) and another with race predicting income. The `lm()` function automatically does dummy coding for us, selecting a reference group for you. What is the reference group for degree? What about race?
```{r}
gss %>%
lm(income06 ~ degree, data = .) %>%
summary()
gss %>%
lm(income06 ~ race, data = .) %>%
summary()
```
5. Are there significant mean differences between the non-reference levels and the reference level regarding income for either degree or race? What do the estimates mean here?
6. Let's do a multiple regression using both degree and race and assign that to "fit".
```{r}
fit <- gss %>%
lm(income06 ~ degree + race, data = .)
summary(fit)
```
## Changing the Reference Level
7. It is often good to control the reference level for optimal interpretability. For example, here, we may want the "White" race to be the reference group since it is the most common in the data set. To do so, we will use
```{r}
gss %>%
lm(income06 ~ degree + relevel(race, ref = "White"), data = .) %>%
summary()
```
## Contrasts and Adjusted Means
#### Simple Contrasts (Comparing Two Levels)
8. We are also interested in comparing the non-reference levels as well. We can use the `car` package with the `linearHypothesis()` function to test these. To make these comparisons, we use the coefficient names from `fit` as shown below. In this case, let's compare Bachelors and Graduate degree incomes. Are they significantly different?
```{r}
car::linearHypothesis(fit, "degreeBachelors = degreeGraduate")
```
9. What about the "White" race and the "Other" race? Was this comparison any different than when we compared them earlier with the "White" as the reference group (see question 8)?
```{r}
car::linearHypothesis(fit, "raceWhite = raceOther")
```
#### Adjusted Means
10. To finish, let's create a model that include both degree and race in addition to age (`age`) and home population (`hompop`).
```{r}
gss %>%
lm(income06 ~ degree + race + age + hompop, data = .) %>%
summary()
```
11. Go through each estimate and interpret it, including the intercept, the levels of degree, the levels of race, age and hompop.
12. Why is the intercept here not all that useful? How might we make it useful again?
13. Let's check out the adjusted means of degree. We can use the `emmeans` function for this.
```{r}
gss %>%
lm(income06 ~ degree + race + age + hompop, data = .) %>%
emmeans::emmeans(specs = "degree")
```
14. What do these "emmeans" mean? It may help in the interpretation to see the values that `emmeans()` averaged over, we can use `emmeans::ref_grid()`. Does this help us more fully interpret the emmeans from before?
```{r}
gss %>%
lm(income06 ~ degree + race + age + hompop, data = .) %>%
emmeans::ref_grid()
```
## Conclusion
Multicategorical predictors are common in the social sciences. Understanding how to analyze and interpret them in a regression model is essential to using regression fully.