Chapter 4 introduces statistical inference with regression. The following examples help illustrate a number of principles that were discussed. Follow all instructions to complete Chapter 4.
GSS_reduced_example.csv
data set. Turns out that it is already saved for you in an R object in the educ7610
package. We can install this package (since it is on GitHub and not CRAN) with some wizardry.# Run if you haven't installed remotes:
# install.packages("remotes")
remotes::install_github("tysonstanley/educ7610")
tidyverse
package (you can ignore the notes that you see below that it gives you once you load it) and the furniture
package.library(tidyverse)
library(furniture)
library(educ7610)
data(gss)
educ
) on income (income06
). Using summary()
obtain the F-statistic of the model with it’s accompanying p-value, as well as the standard error, t-value, and p-value of the estimate itself.gss %>%
lm(income06 ~ educ,
data = .) %>%
summary()
##
## Call:
## lm(formula = income06 ~ educ, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -378667 -30013 -10390 20363 121117
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1741.9 3789.5 0.46 0.646
## educ 4126.8 270.4 15.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41520 on 1772 degrees of freedom
## (249 observations deleted due to missingness)
## Multiple R-squared: 0.1162, Adjusted R-squared: 0.1157
## F-statistic: 233 on 1 and 1772 DF, p-value: < 2.2e-16
educ
) on income (income06
) while controlling for home population (hompop
) and age (age
).gss %>%
lm(income06 ~ educ + hompop + age,
data = .) %>%
summary()
##
## Call:
## lm(formula = income06 ~ educ + hompop + age, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -398142 -27672 -8912 19800 121471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33491.96 5460.46 -6.134 1.06e-09 ***
## educ 4319.85 261.73 16.505 < 2e-16 ***
## hompop 8222.59 724.41 11.351 < 2e-16 ***
## age 251.22 59.52 4.221 2.56e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40110 on 1770 degrees of freedom
## (249 observations deleted due to missingness)
## Multiple R-squared: 0.1762, Adjusted R-squared: 0.1748
## F-statistic: 126.2 on 3 and 1770 DF, p-value: < 2.2e-16
scale()
. We first have to grab just the complete cases of the variables first.gss %>%
filter(complete.cases(income06, educ, hompop, age)) %>%
mutate(incomeZ = scale(income06) %>% as.numeric,
educZ = scale(educ) %>% as.numeric,
hompopZ = scale(hompop) %>% as.numeric,
ageZ = scale(age) %>% as.numeric) %>%
lm(incomeZ ~ educZ + hompopZ + ageZ,
data = .) %>%
summary()
##
## Call:
## lm(formula = incomeZ ~ educZ + hompopZ + ageZ, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0181 -0.6268 -0.2019 0.4485 2.7514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.196e-16 2.157e-02 0.000 1
## educZ 3.568e-01 2.162e-02 16.505 < 2e-16 ***
## hompopZ 2.628e-01 2.315e-02 11.351 < 2e-16 ***
## ageZ 9.756e-02 2.311e-02 4.221 2.56e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9084 on 1770 degrees of freedom
## Multiple R-squared: 0.1762, Adjusted R-squared: 0.1748
## F-statistic: 126.2 on 3 and 1770 DF, p-value: < 2.2e-16
educ
to the null model (a model with just an intercept). We’ll use anova()
to compare the two models. Does the result match what we saw earlier?fit_null <- gss %>%
lm(income06 ~ 1,
data = .)
fit_educ <- gss %>%
lm(income06 ~ educ,
data = .)
anova(fit_null, fit_educ)
## Analysis of Variance Table
##
## Model 1: income06 ~ 1
## Model 2: income06 ~ educ
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1773 3.4558e+12
## 2 1772 3.0542e+12 1 4.016e+11 233 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For inference to be meaningful at all, we need to check if our assumptions are holding. As a reminder, for a linear regression model we have 4 main assumptions:
income06
and educ
is linear. This one we can take a look at via a scatterplot with a smoothed line showing the relationship. (Note: due to overplotting–points on top of points–we used geom_count()
instead of geom_point()
.)gss %>%
ggplot(aes(educ, income06)) +
geom_count() +
geom_smooth()
plot()
function with our model object we get four plots. The first one let’s us assess our residuals for homoscedasticity. Does it look homoscedastic?fit_educ <- gss %>%
lm(income06 ~ educ,
data = .)
par(mfrow = c(2,2))
plot(fit_educ)
income06
and educ
is linear. This one we can take a look at via a scatterplot. Does it look linear? (Note: due to overplotting–points on top of points–we used geom_count()
instead of geom_point()
.)gss %>%
mutate(educ = furniture::washer(educ, 99, 98)) %>% ## removes all 99's from educ
ggplot(aes(educ, income06)) +
geom_count() +
geom_smooth()
plot()
function with our model object we get four plots. The first one let’s us assess our residuals for homoscedasticity. Does it look homoscedastic? Is it better than it was before?fit_educ2 <- gss %>%
mutate(educ = furniture::washer(educ, 99, 98)) %>% ## removes all 99's from educ
lm(income06 ~ educ,
data = .)
par(mfrow = c(2,2))
plot(fit_educ2)
20. Again, the second plot above shows us whether our residuals are approximately normal. How does it look? 21. As an aside, income is giving us some troubles because of how it was collected. To adjust for this we could transform it using what is called a monotonic function (e.g., square root, natural log, etc.). In our case, let’s try the square root transformation to see if that helps with our assumptions. Notably, the interpretation changes to interpreting income to interpreting the square root of income. We’ll come back to this at a later date. Does it help with the assumptions?
fit_educ3 <- gss %>%
mutate(educ = furniture::washer(educ, 99, 98)) %>% ## removes all 99's from educ
lm(sqrt(income06) ~ educ,
data = .)
par(mfrow = c(2,2))
plot(fit_educ3)
Notably, real-life data do not fit the assumptions perfectly. We will talk about several approaches to dealing with this, such as transformation as we just showed, sandwhich estimators/robust standard errors, generalized linear models, and others.
educ
and hompop
. Let’s say we want to get a confidence interval around the point where someone has a 15 years of education (educ == 15
) and a medium home population (hompop == 4
). What is the standard error of this conditional mean?gss %>%
mutate(educ = furniture::washer(educ, 99, 98)) %>% ## removes all 99's from educ
mutate(educ15 = educ - 15, ## center educ to 15
hompop4 = hompop - 4) %>% ## center hompop to 4
lm(income06 ~ educ15 + hompop4,
data = .) %>%
summary()
##
## Call:
## lm(formula = income06 ~ educ15 + hompop4, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86858 -27195 -6937 20525 135525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77938.1 1428.2 54.57 <2e-16 ***
## educ15 6288.3 302.9 20.76 <2e-16 ***
## hompop4 7340.5 653.7 11.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38760 on 1770 degrees of freedom
## (250 observations deleted due to missingness)
## Multiple R-squared: 0.2304, Adjusted R-squared: 0.2295
## F-statistic: 264.9 on 2 and 1770 DF, p-value: < 2.2e-16
educ
and hompop
predicting income06
? (For a reminder, the original one is printed below.)gss %>%
mutate(educ = furniture::washer(educ, 99, 98)) %>% ## removes all 99's from educ
lm(income06 ~ educ + hompop,
data = .) %>%
summary()
##
## Call:
## lm(formula = income06 ~ educ + hompop, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86858 -27195 -6937 20525 135525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -45748.1 4595.2 -9.956 <2e-16 ***
## educ 6288.3 302.9 20.762 <2e-16 ***
## hompop 7340.5 653.7 11.230 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38760 on 1770 degrees of freedom
## (250 observations deleted due to missingness)
## Multiple R-squared: 0.2304, Adjusted R-squared: 0.2295
## F-statistic: 264.9 on 2 and 1770 DF, p-value: < 2.2e-16
educ15
and hompop4
? What does the intercept mean for the original case?confint()
.gss %>%
mutate(educ = furniture::washer(educ, 99, 98)) %>% ## removes all 99's from educ
mutate(educ15 = educ - 15, ## center educ to 15
hompop4 = hompop - 4) %>% ## center hompop to 4
lm(income06 ~ educ15 + hompop4,
data = .) %>%
confint()
## 2.5 % 97.5 %
## (Intercept) 75136.926 80739.374
## educ15 5694.251 6882.296
## hompop4 6058.475 8622.606
This was an introduction to many of the features of statistical inference in regression. These principles will be used extensively throughout the class.