```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
## Introduction
Chapter 4 introduces statistical inference with regression. The following examples help illustrate a number of principles that were discussed. Follow all instructions to complete Chapter 4.
## Statistical Inference
1. Download the `GSS_reduced_example.csv` data set. Turns out that it is already saved for you in an R object in the `educ7610` package. We can install this package (since it is on GitHub and not CRAN) with some wizardry.
```{r, eval = FALSE}
# Run if you haven't installed remotes:
# install.packages("remotes")
remotes::install_github("tysonstanley/educ7610")
```
2. Open RStudio, start a new R script or RMarkdown document.
3. Load the `tidyverse` package (you can ignore the notes that you see below that it gives you once you load it) and the `furniture` package.
```{r}
library(tidyverse)
library(furniture)
library(educ7610)
```
3. Import it into R.
```{r}
data(gss)
```
4. Use a simple regression model to assess the effect of the number of years of education (`educ`) on income (`income06`). Using `summary()` obtain the F-statistic of the model with it's accompanying p-value, as well as the standard error, t-value, and p-value of the estimate itself.
```{r}
gss %>%
lm(income06 ~ educ,
data = .) %>%
summary()
```
5. This output gives you the estimate, the standard error of the estimate, the t-value of the estimate, and the p-value of the estimate (where the null is that there is no relationship). Is the relationship statistically significant at the $\alpha = .05$ level?
6. From that same output above, we see *a lot more* information. What pieces do you recognize? Which pieces are confusing?
7. Let's run a multiple regression with years of education (`educ`) on income (`income06`) while controlling for home population (`hompop`) and age (`age`).
```{r}
gss %>%
lm(income06 ~ educ + hompop + age,
data = .) %>%
summary()
```
8. How is the effect of education interpreted in this case? Is it statistically significant?
9. Let's compare this to the regression value after we standardize both variables with `scale()`. We first have to grab just the complete cases of the variables first.
```{r}
gss %>%
filter(complete.cases(income06, educ, hompop, age)) %>%
mutate(incomeZ = scale(income06) %>% as.numeric,
educZ = scale(educ) %>% as.numeric,
hompopZ = scale(hompop) %>% as.numeric,
ageZ = scale(age) %>% as.numeric) %>%
lm(incomeZ ~ educZ + hompopZ + ageZ,
data = .) %>%
summary()
```
10. Is the standardized estimates more/less/same significant as the unstandardized? Is this surprising?
11. We can also test the significance of a predictor using model comparisons. Let's look at adding `educ` to the null model (a model with just an intercept). We'll use `anova()` to compare the two models. Does the result match what we saw earlier?
```{r}
fit_null <- gss %>%
lm(income06 ~ 1,
data = .)
fit_educ <- gss %>%
lm(income06 ~ educ,
data = .)
anova(fit_null, fit_educ)
```
## Check Some Assumptions
For inference to be meaningful at all, we need to check if our assumptions are holding. As a reminder, for a linear regression model we have 4 main assumptions:
- Linearity
- Homoscedasticity
- Conditional Distribution of Y is normal and centered at $\hat{Y}$
- Independent Sampling
12. Let's test these assumptions using plots as we discussed in class. First, let's see if the relationship between `income06` and `educ` is linear. This one we can take a look at via a scatterplot with a smoothed line showing the relationship. (Note: due to overplotting--points on top of points--we used `geom_count()` instead of `geom_point()`.)
```{r}
gss %>%
ggplot(aes(educ, income06)) +
geom_count() +
geom_smooth()
```
13. That scatterplot does not look good because there is an outlier there. This is an example of one of the many reasons we want to visualize our data before modeling it. That point is likely problematic (it is a high leverage point -- we'll be talking about this later on).
14. Next let's look at homoscedasticity. This we will look at using our model object. Using the `plot()` function with our model object we get four plots. The first one let's us assess our residuals for homoscedasticity. Does it look homoscedastic?
```{r}
fit_educ <- gss %>%
lm(income06 ~ educ,
data = .)
par(mfrow = c(2,2))
plot(fit_educ)
```
15. The second plot above shows us whether our residuals are approximately normal. The others we'll get to later. Do the residuals follow the line or do they deviate?
16. Finally, independent sampling is not an assumption that we can check. This is something that has to be taken care of during data collection; not during the analysis. So for this one, we will assume the data were collected in this fashion.
17. Because of the outlier, everything was somewhat thrown off. Let's take that outlier out and try these again.
18. Let's test these assumptions using plots as we discussed in class. First, let's see if the relationship between `income06` and `educ` is linear. This one we can take a look at via a scatterplot. Does it look linear? (Note: due to overplotting--points on top of points--we used `geom_count()` instead of `geom_point()`.)
```{r}
gss %>%
mutate(educ = furniture::washer(educ, 99, 98)) %>% ## removes all 99's from educ
ggplot(aes(educ, income06)) +
geom_count() +
geom_smooth()
```
19. Next let's look at homoscedasticity. This we will look at using our model object. Using the `plot()` function with our model object we get four plots. The first one let's us assess our residuals for homoscedasticity. Does it look homoscedastic? Is it better than it was before?
```{r}
fit_educ2 <- gss %>%
mutate(educ = furniture::washer(educ, 99, 98)) %>% ## removes all 99's from educ
lm(income06 ~ educ,
data = .)
par(mfrow = c(2,2))
plot(fit_educ2)
```
20. Again, the second plot above shows us whether our residuals are approximately normal. How does it look?
21. As an aside, income is giving us some troubles because of how it was collected. To adjust for this we could transform it using what is called a monotonic function (e.g., square root, natural log, etc.). In our case, let's try the square root transformation to see if that helps with our assumptions. Notably, the interpretation changes to interpreting income to interpreting the square root of income. We'll come back to this at a later date. Does it help with the assumptions?
```{r}
fit_educ3 <- gss %>%
mutate(educ = furniture::washer(educ, 99, 98)) %>% ## removes all 99's from educ
lm(sqrt(income06) ~ educ,
data = .)
par(mfrow = c(2,2))
plot(fit_educ3)
```
Notably, real-life data do not fit the assumptions perfectly. We will talk about several approaches to dealing with this, such as transformation as we just showed, sandwhich estimators/robust standard errors, generalized linear models, and others.
## Inference about a Conditional Mean
22. Remember that if we want to make inference about a conditional mean (essentially a predicted point when we select the values of the X's), we can use a simple trick: subtrack the values from the original variables and then the information on the intercept is the information for the conditional mean at that point. Let's try this below using both `educ` and `hompop`. Let's say we want to get a confidence interval around the point where someone has a 15 years of education (`educ == 15`) and a medium home population (`hompop == 4`). What is the standard error of this conditional mean?
```{r}
gss %>%
mutate(educ = furniture::washer(educ, 99, 98)) %>% ## removes all 99's from educ
mutate(educ15 = educ - 15, ## center educ to 15
hompop4 = hompop - 4) %>% ## center hompop to 4
lm(income06 ~ educ15 + hompop4,
data = .) %>%
summary()
```
23. Did anything in the results change from the original regression model with `educ` and `hompop` predicting `income06`? (For a reminder, the original one is printed below.)
```{r}
gss %>%
mutate(educ = furniture::washer(educ, 99, 98)) %>% ## removes all 99's from educ
lm(income06 ~ educ + hompop,
data = .) %>%
summary()
```
24. What is the interpretation for the conditional mean intercept (the one with `educ15` and `hompop4`? What does the intercept mean for the original case?
25. Did the regression coefficients change on the variables? Why or why not?
26. Finally, we can get the confidence interval of this conditional mean using `confint()`.
```{r}
gss %>%
mutate(educ = furniture::washer(educ, 99, 98)) %>% ## removes all 99's from educ
mutate(educ15 = educ - 15, ## center educ to 15
hompop4 = hompop - 4) %>% ## center hompop to 4
lm(income06 ~ educ15 + hompop4,
data = .) %>%
confint()
```
27. What does this 95% confidence interval mean?
## Conclusion
This was an introduction to many of the features of statistical inference in regression. These principles will be used extensively throughout the class.