```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Chapter 16 covers dealing with irregularities in the data, including violations of assumptions of the model. Chapter 17 discusses some miscellaneous topics that we'll cover in here as well.
## Irregularities
1. Let's start by loading the `tidyverse` package, the `furniture` package, and the `rlm` package.
```{r}
library(tidyverse)
library(furniture)
library(rlm)
```
2. Import the `poverty` data set found in `rlm` into R.
```{r}
data(poverty)
poverty
```
## Extreme Cases
3. Let's look at some simple descriptives to see if there are any apparent extreme cases. Note that the output will be split into two sections but you can connect the lines by looking at the row number. *Note that for `diagnostics()` to work, we have to put the actual data in the data argument in the `lm()` function as we did below.*
```{r}
lm(TeenBirth ~ ViolentCrime + poverty_pct,
data = poverty) %>%
rlm::diagnostics() %>%
round(3)
```
4. Of these, are there any that look problemmatic with the residuals, cooks, or hat values? Which ones? (Look at the row number and look at which state this is from the printed data before).
5. There's one point that is particularly horrible in both its hat value and cooks distance. Which state is it?
6. Let's look at the results with and without this point and see how much it changes the estimates and conclusions.
```{r}
## With the extreme value
poverty %>%
lm(TeenBirth ~ ViolentCrime + poverty_pct,
data = .) %>%
summary()
## Withut the extreme value
poverty %>%
filter(State != "District_of_Columbia") %>%
lm(TeenBirth ~ ViolentCrime + poverty_pct,
data = .) %>%
summary()
```
7. What are the differences between the estimates of the full sample and the sample without the extreme value?
8. Is this surprising after the seeing the diagnostics?
## Assumptions
9. There are a few ways to check normality and homoscedasticity. In `R` the easiest way is to use the `plot()` function. Let's first look at the full sample and see if the extreme point hurts us in regards to our assumptions.
```{r}
fit_full <- poverty %>%
lm(TeenBirth ~ ViolentCrime + poverty_pct,
data = .)
par(mfrow = c(2,2))
plot(fit_full)
```
10. In three of the four plots, point "9" is specifically labeled due to its extremeness. Let's talk about each plot and why point "9" keeps showing up.
11. The "Residuals vs Fitted" plot shows us a few things. First, we are looking to see if the red line is roughly along the dotted line at 0. This one isn't too bad until the last point (which it turns out is point "9"). The second thing we look for is if the spread looks fairly random around the dotted line. Other than the extreme value, this is also pretty good. This suggests homoscedasticity holds and the residuals seem to be indpendent.
12. The "Normal Q-Q" plot shows us, if the points are along the dotted line, that we have normality in our residuals. For the most part, the points are very much along the line. Point "9" may be a problem here as well but it definitely doesn't seem to much of a problem here.
13. The "Scale-Location" plot is similar ot the "Residuals vs Fitted" but shows the residuals in a slightly different way (the square root of the standardized residual). Point "9" is, again, an extreme value here.
14. Finally, the "Residuals vs Leverage" plot shows us Cook's Distance. Most points have very little leverage. But guess which point has a lot of leverage. Yep, point "9".
15. Let's look at these diagnostic plots after removing this problematic point again.
```{r}
fit <- poverty %>%
filter(State != "District_of_Columbia") %>%
lm(TeenBirth ~ ViolentCrime + poverty_pct,
data = .)
par(mfrow = c(2,2))
plot(fit)
```
16. Are these better than in the full sample? Why or why not?
17. After running all of these models and diagnostics, it is clear that something is extreme about the District of Columbia. Why might this be?
## Miscellaneous
Below, let's go over a few examples of both measurement and specification error.
### Measurement Error
18. Below, we look at the relationship between `TeenBirth` and `ViolentCrime` when one, the other, or both have more measurement error, labeled `TeenBirth_ME` and `ViolentCrime_ME`.
```{r}
set.seed(843)
poverty_ME <- poverty %>%
filter(State != "District_of_Columbia") %>%
mutate(TeenBirth_ME = TeenBirth + runif(50, 0, 20),
ViolentCrime_ME = ViolentCrime + runif(50, 0, 10))
```
19. First the original model with little measurement error.
```{r}
no_me <- poverty_ME %>%
lm(TeenBirth ~ ViolentCrime, data = .)
```
20. Now the model with high measurement error in the outcome.
```{r}
me_out <- poverty_ME %>%
lm(TeenBirth_ME ~ ViolentCrime, data = .)
```
21. Now the model with high measurement error in the predictor.
```{r}
me_pred <- poverty_ME %>%
lm(TeenBirth ~ ViolentCrime_ME, data = .)
```
22. Finally, the model with both having high measurement error.
```{r}
me_both <- poverty_ME %>%
lm(TeenBirth_ME ~ ViolentCrime_ME, data = .)
```
23. Let's put the results together using `texreg::screenreg()`.
```{r}
texreg::screenreg(list(no_me, me_out, me_pred, me_both),
custom.model.names = c("No ME", "Outcome ME", "Predictor ME", "Both ME"))
```
24. Compare the results of the models. What does measurement error do to the estimates/standard errors/$R^2$?
### Specification Errors
25. Use the following data frame for this demonstration regarding specification errors. We make a ficticious data set here to show what happens if we leave out an important variable and what happens if we include an irrelvant one. Since in reality we rarely can know this, it is best if we make a data set that we know these things about to get a sense for their effects.
```{r}
df <- data_frame(
x1 = rnorm(100),
x2 = .1*x1 + rnorm(100),
x3 = rnorm(100),
y = x1 + x2 + rnorm(100)
)
```
26. The "true model" is where both `x1` and `x2` predict `y` at the same weight of 1. Let's see what happens if we don't leave anything out.
```{r}
df %>%
lm(y ~ x1 + x2, data = .) %>%
summary()
```
27. Now, what happens if we don't include `x2` (an omitted variable)? What happens to the relationship between `x1` and `y`?
```{r}
df %>%
lm(y ~ x1, data = .) %>%
summary()
```
28. Why is this an omitted variable?
29. If we include `x3`, what does it do to the model?
```{r}
df %>%
lm(y ~ x1 + x2 + x3, data = .) %>%
summary()
```
30. Is having an omitted variable or an extra variable more problematic?
## Conclusion
These topics are consistently important in any regression analysis. We always want to assess the data, see if things look odd, if points are influential on the results, and if other errors (measurement, specification) are influencing the results.