Introduction

Chapter 5 extends much of what we have already learned to dichotomous variables and a few other statistical principles. The following examples help illustrate a number of items that were discussed. Follow all instructions to complete Chapter 5.

Dummy Variables

  1. Load the tidyverse package (you can ignore the notes that you see below that it gives you once you load it), the furniture package, and the educ7610 package.
library(tidyverse)
library(furniture)
library(educ7610)
  1. The rlm package contains the GSS data set so we can import it more easily with data(gss) once the package is attached.
  2. Import it into R.
data(gss)
  1. R is very helpful in creating dummy variables since it does all the work for you. All you need to do is let R know that the variable is a factor. Let’s see if R already knows the variable sex is a factor.
gss %>%
  select(sex) %>%
  as_tibble()
## # A tibble: 2,023 x 1
##    sex   
##    <fct> 
##  1 Male  
##  2 Male  
##  3 Male  
##  4 Male  
##  5 Female
##  6 Male  
##  7 Female
##  8 Female
##  9 Female
## 10 Female
## # … with 2,013 more rows
  1. At the top of the column it says says <fct> which tells us R already knows that it is a factor, otherwise we’d use the factor() function to tell R that it was a factor. Let’s now see if there are differences in income06 across the sexes.
gss %>%
  lm(income06 ~ sex,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = income06 ~ sex, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61424 -34424  -8764  20576 106236 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    53764       1434  37.506  < 2e-16 ***
## sexMale         8160       2092   3.901 9.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43970 on 1772 degrees of freedom
##   (249 observations deleted due to missingness)
## Multiple R-squared:  0.008513,   Adjusted R-squared:  0.007954 
## F-statistic: 15.22 on 1 and 1772 DF,  p-value: 9.951e-05
  1. The output for the coefficients says sexMale meaning female is the reference category and this is in comparison to that. With that in mind, what is the interpretation of the estimate? Is it significant?
  2. Let’s include education and home population into the model as well.
gss %>%
  lm(income06 ~ sex + educ + hompop,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = income06 ~ sex + educ + hompop, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -384792  -27404   -8231   20773  124269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23036.1     4239.6  -5.434 6.29e-08 ***
## sexMale       9005.1     1906.7   4.723 2.51e-06 ***
## educ          4294.5      261.3  16.436  < 2e-16 ***
## hompop        7233.6      675.7  10.706  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40060 on 1770 degrees of freedom
##   (249 observations deleted due to missingness)
## Multiple R-squared:  0.1783, Adjusted R-squared:  0.1769 
## F-statistic:   128 on 3 and 1770 DF,  p-value: < 2.2e-16
  1. Now that we’ve controlled for education and home population, is there a difference among the sexes? If so, what is the interpretation of the estimate?

Regression to the Mean

  1. The “regression to the mean” phenomenon states that extreme values of one variable are more associated with values closer to the mean on another variable. If this is happening, what do we expect at time 2 for an individual that has extremely low anxiety at time 1?
  2. Consider the following example: We have two time points–time 1 and time 2. There was an intervention between the time points. We are curious if the individual’s age is related to the effect of the intervention. We are most curious about the change scores (\(Y_2 - Y_1\)) and how age predicts them. How should we specify the regression model?
  3. Considering the previous example about the intervention, what is the difference between these two models? Is there a meaningful difference between the estimates on age?
## 
## Call:
## lm(formula = y2 - y1 ~ age + y1, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0257 -0.7399  0.0253  0.8320  1.8836 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.34028    0.24640  -1.381    0.170
## age         -0.06133    0.11379  -0.539    0.591
## y1           0.07028    0.11429   0.615    0.540
## 
## Residual standard error: 1.02 on 97 degrees of freedom
## Multiple R-squared:  0.02371,    Adjusted R-squared:  0.003576 
## F-statistic: 1.178 on 2 and 97 DF,  p-value: 0.3124
## 
## Call:
## lm(formula = y2 ~ age + y1, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0257 -0.7399  0.0253  0.8320  1.8836 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.34028    0.24640  -1.381    0.170    
## age         -0.06133    0.11379  -0.539    0.591    
## y1           1.07028    0.11429   9.364 3.16e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.02 on 97 degrees of freedom
## Multiple R-squared:  0.9965, Adjusted R-squared:  0.9964 
## F-statistic: 1.368e+04 on 2 and 97 DF,  p-value: < 2.2e-16

Multidimensional Sets

  1. Let’s bring in a set of demographic variables and test their significance versus a model with the number of years of education (educ). The variables of this demographic set include sex, age, race, and hompop. To test for the set, we save both model objects and use anova() to test for differences among the models (which gives us the significance of the set of variables that we added to the model).
fit1 <- gss %>%
  lm(income06 ~ educ,
     data = .)
fit2 <- gss %>%
  lm(income06 ~ educ + sex + age + race + hompop,
     data = .)
anova(fit1, fit2)
## Analysis of Variance Table
## 
## Model 1: income06 ~ educ
## Model 2: income06 ~ educ + sex + age + race + hompop
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   1772 3.0542e+12                                   
## 2   1767 2.7295e+12  5 3.2479e+11 42.053 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Does this demographic set significantly predict income?
  2. Why might we want to test a set of variables instead of each one individually?

Conclusion

Regression can handle categorical and continuous variables, simultaneously. It also naturally handles the phonemonon known as “regression to the mean”.