Chapter 5 extends much of what we have already learned to dichotomous variables and a few other statistical principles. The following examples help illustrate a number of items that were discussed. Follow all instructions to complete Chapter 5.
tidyverse
package (you can ignore the notes that you see below that it gives you once you load it), the furniture
package, and the educ7610
package.library(tidyverse)
library(furniture)
library(educ7610)
rlm
package contains the GSS data set so we can import it more easily with data(gss)
once the package is attached.data(gss)
R
is very helpful in creating dummy variables since it does all the work for you. All you need to do is let R
know that the variable is a factor. Let’s see if R
already knows the variable sex
is a factor.gss %>%
select(sex) %>%
as_tibble()
## # A tibble: 2,023 x 1
## sex
## <fct>
## 1 Male
## 2 Male
## 3 Male
## 4 Male
## 5 Female
## 6 Male
## 7 Female
## 8 Female
## 9 Female
## 10 Female
## # … with 2,013 more rows
<fct>
which tells us R
already knows that it is a factor, otherwise we’d use the factor()
function to tell R
that it was a factor. Let’s now see if there are differences in income06
across the sexes.gss %>%
lm(income06 ~ sex,
data = .) %>%
summary()
##
## Call:
## lm(formula = income06 ~ sex, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61424 -34424 -8764 20576 106236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53764 1434 37.506 < 2e-16 ***
## sexMale 8160 2092 3.901 9.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43970 on 1772 degrees of freedom
## (249 observations deleted due to missingness)
## Multiple R-squared: 0.008513, Adjusted R-squared: 0.007954
## F-statistic: 15.22 on 1 and 1772 DF, p-value: 9.951e-05
sexMale
meaning female is the reference category and this is in comparison to that. With that in mind, what is the interpretation of the estimate? Is it significant?gss %>%
lm(income06 ~ sex + educ + hompop,
data = .) %>%
summary()
##
## Call:
## lm(formula = income06 ~ sex + educ + hompop, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -384792 -27404 -8231 20773 124269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23036.1 4239.6 -5.434 6.29e-08 ***
## sexMale 9005.1 1906.7 4.723 2.51e-06 ***
## educ 4294.5 261.3 16.436 < 2e-16 ***
## hompop 7233.6 675.7 10.706 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40060 on 1770 degrees of freedom
## (249 observations deleted due to missingness)
## Multiple R-squared: 0.1783, Adjusted R-squared: 0.1769
## F-statistic: 128 on 3 and 1770 DF, p-value: < 2.2e-16
age
?##
## Call:
## lm(formula = y2 - y1 ~ age + y1, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0257 -0.7399 0.0253 0.8320 1.8836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.34028 0.24640 -1.381 0.170
## age -0.06133 0.11379 -0.539 0.591
## y1 0.07028 0.11429 0.615 0.540
##
## Residual standard error: 1.02 on 97 degrees of freedom
## Multiple R-squared: 0.02371, Adjusted R-squared: 0.003576
## F-statistic: 1.178 on 2 and 97 DF, p-value: 0.3124
##
## Call:
## lm(formula = y2 ~ age + y1, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0257 -0.7399 0.0253 0.8320 1.8836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.34028 0.24640 -1.381 0.170
## age -0.06133 0.11379 -0.539 0.591
## y1 1.07028 0.11429 9.364 3.16e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.02 on 97 degrees of freedom
## Multiple R-squared: 0.9965, Adjusted R-squared: 0.9964
## F-statistic: 1.368e+04 on 2 and 97 DF, p-value: < 2.2e-16
educ
). The variables of this demographic set include sex
, age
, race
, and hompop
. To test for the set, we save both model objects and use anova()
to test for differences among the models (which gives us the significance of the set of variables that we added to the model).fit1 <- gss %>%
lm(income06 ~ educ,
data = .)
fit2 <- gss %>%
lm(income06 ~ educ + sex + age + race + hompop,
data = .)
anova(fit1, fit2)
## Analysis of Variance Table
##
## Model 1: income06 ~ educ
## Model 2: income06 ~ educ + sex + age + race + hompop
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1772 3.0542e+12
## 2 1767 2.7295e+12 5 3.2479e+11 42.053 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regression can handle categorical and continuous variables, simultaneously. It also naturally handles the phonemonon known as “regression to the mean”.