EDUC/PSY 7610

Chapter 5 extends much of what we have already learned to dichotomous variables and a few other statistical principles. The following examples help illustrate a number of items that were discussed. Follow all instructions to complete Chapter 5.

- Load the
`tidyverse`

package (you can ignore the notes that you see below that it gives you once you load it), the`furniture`

package, and the`rlm`

package.

`library(tidyverse)`

`## ── Attaching packages ────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1.9000 ──`

```
## ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
```

`## Warning: package 'dplyr' was built under R version 3.5.1`

```
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
```

`library(furniture)`

```
## ── furniture 1.7.12 ────────────────────────────────────────────────────────────────── learn more at tysonbarrett.com ──
## ✔ furniture attached
## ✖ The furniture::table1() function has the same name as tidyr::table1 (tbl_df)
## Consider using `furniture::` for each function call.
```

`library(rlm)`

```
## ── rlm 0.1.0 ───────────────────────────────────────────────────────────────────────── learn more at tysonbarrett.com ──
## ✔ rlm attached
## ✔ No potential conflicts found
```

- The
`rlm`

package contains the GSS data set so we can import it more easily with`data(gss)`

once the package is attached. - Import it into R.

`data(gss)`

`R`

is very helpful in creating dummy variables since it does all the work for you. All you need to do is let`R`

know that the variable is a factor. Let’s see if`R`

already knows the variable`sex`

is a factor.

```
gss %>%
select(sex) %>%
as_tibble()
```

```
## # A tibble: 2,023 x 1
## sex
## <fct>
## 1 Male
## 2 Male
## 3 Male
## 4 Male
## 5 Female
## 6 Male
## 7 Female
## 8 Female
## 9 Female
## 10 Female
## # ... with 2,013 more rows
```

- At the top of the column it says says
`<fct>`

which tells us`R`

already knows that it is a factor, otherwise we’d use the`factor()`

function to tell`R`

that it was a factor. Let’s now see if there are differences in`income06`

across the sexes.

```
gss %>%
lm(income06 ~ sex,
data = .) %>%
summary()
```

```
##
## Call:
## lm(formula = income06 ~ sex, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61424 -34424 -8764 20576 106236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53764 1434 37.506 < 2e-16 ***
## sexMale 8160 2092 3.901 9.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43970 on 1772 degrees of freedom
## (249 observations deleted due to missingness)
## Multiple R-squared: 0.008513, Adjusted R-squared: 0.007954
## F-statistic: 15.22 on 1 and 1772 DF, p-value: 9.951e-05
```

- The output for the coefficients says
`sexMale`

meaning female is the reference category and this is in comparison to that. With that in mind, what is the interpretation of the estimate? Is it significant? - Let’s include education and home population into the model as well.

```
gss %>%
lm(income06 ~ sex + educ + hompop,
data = .) %>%
summary()
```

```
##
## Call:
## lm(formula = income06 ~ sex + educ + hompop, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -384792 -27404 -8231 20773 124269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23036.1 4239.6 -5.434 6.29e-08 ***
## sexMale 9005.1 1906.7 4.723 2.51e-06 ***
## educ 4294.5 261.3 16.436 < 2e-16 ***
## hompop 7233.6 675.7 10.706 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40060 on 1770 degrees of freedom
## (249 observations deleted due to missingness)
## Multiple R-squared: 0.1783, Adjusted R-squared: 0.1769
## F-statistic: 128 on 3 and 1770 DF, p-value: < 2.2e-16
```

- Now that we’ve controlled for education and home population, is there a difference among the sexes? If so, what is the interpretation of the estimate?

- The “regression to the mean” phenomenon states that extreme values of one variable are more associated with values closer to the mean on another variable. If this is happening, what do we expect at time 2 for an individual that has extremely low anxiety at time 1?
- Consider the following example: We have two time points–time 1 and time 2. There was an intervention between the time points. We are curious if the individual’s age is related to the effect of the intervention. We are most curious about the change scores (\(Y_2 - Y_1\)) and how age predicts them. How should we specify the regression model?
- Considering the previous example about the intervention, what is the difference between these two models? Is there a meaningful difference between the estimates on
`age`

?

```
##
## Call:
## lm(formula = y2 - y1 ~ age + y1, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0257 -0.7399 0.0253 0.8320 1.8836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.34028 0.24640 -1.381 0.170
## age -0.06133 0.11379 -0.539 0.591
## y1 0.07028 0.11429 0.615 0.540
##
## Residual standard error: 1.02 on 97 degrees of freedom
## Multiple R-squared: 0.02371, Adjusted R-squared: 0.003576
## F-statistic: 1.178 on 2 and 97 DF, p-value: 0.3124
```

```
##
## Call:
## lm(formula = y2 ~ age + y1, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0257 -0.7399 0.0253 0.8320 1.8836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.34028 0.24640 -1.381 0.170
## age -0.06133 0.11379 -0.539 0.591
## y1 1.07028 0.11429 9.364 3.16e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.02 on 97 degrees of freedom
## Multiple R-squared: 0.9965, Adjusted R-squared: 0.9964
## F-statistic: 1.368e+04 on 2 and 97 DF, p-value: < 2.2e-16
```

- Let’s bring in a set of demographic variables and test their significance versus a model with the number of years of education (
`educ`

). The variables of this demographic set include`sex`

,`age`

,`race`

, and`hompop`

. To test for the set, we save both model objects and use`anova()`

to test for differences among the models (which gives us the significance of the set of variables that we added to the model).

```
fit1 <- gss %>%
lm(income06 ~ educ,
data = .)
fit2 <- gss %>%
lm(income06 ~ educ + sex + age + race + hompop,
data = .)
anova(fit1, fit2)
```

```
## Analysis of Variance Table
##
## Model 1: income06 ~ educ
## Model 2: income06 ~ educ + sex + age + race + hompop
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1772 3.0542e+12
## 2 1767 2.7295e+12 5 3.2479e+11 42.053 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

- Does this demographic set significantly predict income?
- Why might we want to test a set of variables instead of each one individually?

Regression can handle categorical and continuous variables, simultaneously. It also naturally handles the phonemonon known as “regression to the mean”.