```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Chapter 5 extends much of what we have already learned to dichotomous variables and a few other statistical principles. The following examples help illustrate a number of items that were discussed. Follow all instructions to complete Chapter 5.
## Dummy Variables
1. Load the `tidyverse` package (you can ignore the notes that you see below that it gives you once you load it), the `furniture` package, and the `rlm` package.
```{r}
library(tidyverse)
library(furniture)
library(rlm)
```
2. The `rlm` package contains the GSS data set so we can import it more easily with `data(gss)` once the package is attached.
3. Import it into R.
```{r}
data(gss)
```
4. `R` is very helpful in creating dummy variables since it does all the work for you. All you need to do is let `R` know that the variable is a factor. Let's see if `R` already knows the variable `sex` is a factor.
```{r}
gss %>%
select(sex) %>%
as_tibble()
```
5. At the top of the column it says says `` which tells us `R` already knows that it is a factor, otherwise we'd use the `factor()` function to tell `R` that it was a factor. Let's now see if there are differences in `income06` across the sexes.
```{r}
gss %>%
lm(income06 ~ sex,
data = .) %>%
summary()
```
6. The output for the coefficients says `sexMale` meaning female is the reference category and this is in comparison to that. With that in mind, what is the interpretation of the estimate? Is it significant?
7. Let's include education and home population into the model as well.
```{r}
gss %>%
lm(income06 ~ sex + educ + hompop,
data = .) %>%
summary()
```
8. Now that we've controlled for education and home population, is there a difference among the sexes? If so, what is the interpretation of the estimate?
## Regression to the Mean
9. The "regression to the mean" phenomenon states that extreme values of one variable are more associated with values closer to the mean on another variable. If this is happening, what do we expect at time 2 for an individual that has extremely low anxiety at time 1?
10. Consider the following example: We have two time points--time 1 and time 2. There was an intervention between the time points. We are curious if the individual's age is related to the effect of the intervention. We are most curious about the change scores ($Y_2 - Y_1$) and how age predicts them. How should we specify the regression model?
11. Considering the previous example about the intervention, what is the difference between these two models? Is there a meaningful difference between the estimates on `age`?
```{r, echo = FALSE}
set.seed(843)
df <- data_frame(
age = runif(100, 10, 65),
y1 = age + rnorm(100),
y2 = y1 + rnorm(100)
)
df %>%
lm(y2 - y1 ~ age + y1, data = .) %>%
summary()
df %>%
lm(y2 ~ age + y1, data = .) %>%
summary()
```
## Multidimensional Sets
12. Let's bring in a set of demographic variables and test their significance versus a model with the number of years of education (`educ`). The variables of this demographic set include `sex`, `age`, `race`, and `hompop`. To test for the set, we save both model objects and use `anova()` to test for differences among the models (which gives us the significance of the set of variables that we added to the model).
```{r}
fit1 <- gss %>%
lm(income06 ~ educ,
data = .)
fit2 <- gss %>%
lm(income06 ~ educ + sex + age + race + hompop,
data = .)
anova(fit1, fit2)
```
13. Does this demographic set significantly predict income?
14. Why might we want to test a set of variables instead of each one individually?
## Conclusion
Regression can handle categorical and continuous variables, simultaneously. It also naturally handles the phonemonon known as "regression to the mean".