Introduction

Chapter 16 covers dealing with irregularities in the data, including violations of assumptions of the model. Chapter 17 discusses some miscellaneous topics that we’ll cover in here as well.

Irregularities

  1. Let’s start by loading the tidyverse package, the furniture package, and the educ7610 package.
library(tidyverse)
library(furniture)
library(educ7610)
  1. Import the poverty data set found in educ7610 into R.
data(poverty)
poverty
## # A tibble: 51 x 6
##    State poverty_pct `birthrate_15-1… `birthrate_18-1… ViolentCrime
##    <chr>       <dbl>            <dbl>            <dbl>        <dbl>
##  1 Alab…        20.1             31.5             88.7         11.2
##  2 Alas…         7.1             18.9             73.7          9.1
##  3 Ariz…        16.1             35              102.          10.4
##  4 Arka…        14.9             31.6            102.          10.4
##  5 Cali…        16.7             22.6             69.1         11.2
##  6 Colo…         8.8             26.2             79.1          5.8
##  7 Conn…         9.7             14.1             45.1          4.6
##  8 Dela…        10.3             24.7             77.8          3.5
##  9 Dist…        22               44.8            102.          65  
## 10 Flor…        16.2             23.2             78.4          7.3
## # … with 41 more rows, and 1 more variable: TeenBirth <dbl>

Extreme Cases

  1. Let’s look at some simple descriptives to see if there are any apparent extreme cases. Note that the output will be split into two sections but you can connect the lines by looking at the row number. Note that for diagnostics() to work, we have to put the actual data in the data argument in the lm() function as we did below.
lm(TeenBirth ~ ViolentCrime + poverty_pct,
           data = poverty) %>%
  educ7610::diagnostics() %>%
  round(3)
##    TeenBirth ViolentCrime poverty_pct   pred   resid dfb.1_ dfb.VlnC
## 1       54.5         11.2        20.1 54.978  -0.478  0.011    0.004
## 2       39.5          9.1         7.1 32.934   6.566  0.215    0.109
## 3       61.2         10.4        16.1 48.133  13.067 -0.084   -0.011
## 4       59.9         10.4        14.9 46.177  13.723 -0.017    0.025
## 5       41.1         11.2        16.7 49.434  -8.334  0.071    0.003
## 6       47.0          5.8         8.8 34.374  12.626  0.294    0.063
## 7       25.8          4.6         9.7 35.357  -9.557 -0.176   -0.002
## 8       46.3          3.5        10.3 35.891  10.409  0.160   -0.037
## 9       69.1         65.0        22.0 79.793 -10.693  0.693   -9.397
## 10      44.5          7.3        16.2 47.045  -2.545  0.020    0.020
## 11      55.7          9.5        12.1 41.248  14.452  0.150    0.086
## 12      38.2          4.7        10.3 36.376   1.824  0.029   -0.002
## 13      39.1          4.1        14.5 42.981  -3.881  0.009    0.044
## 14      42.2         10.3        12.4 42.060   0.140  0.001    0.001
## 15      44.6          8.0         9.6 36.566   8.034  0.163    0.065
## 16      32.5          1.8        12.2 38.303  -5.803 -0.038    0.066
## 17      43.0          6.2        10.8 37.796   5.204  0.074    0.007
## 18      51.0          7.2        14.7 44.559   6.441 -0.011   -0.031
## 19      58.1         17.0        19.7 56.667   1.433 -0.028    0.009
## 20      25.4          2.0        11.2 36.753 -11.353 -0.124    0.101
## 21      35.4         11.8        10.1 38.916  -3.516 -0.070   -0.054
## 22      23.3          3.6        11.0 37.073 -13.773 -0.173    0.068
## 23      34.8          8.5        12.2 41.007  -6.207 -0.058   -0.021
## 24      27.5          3.9         9.2 34.259  -6.759 -0.137    0.002
## 25      64.7         12.9        23.5 61.208   3.492 -0.144   -0.045
## 26      44.1          8.8         9.4 36.563   7.537  0.163    0.078
## 27      36.4          3.0        15.3 43.842  -7.442  0.046    0.117
## 28      37.0          2.9         9.6 34.508   2.492  0.045   -0.008
## 29      53.9         10.7        11.1 40.102  13.798  0.211    0.151
## 30      20.0          1.8         5.3 27.053  -7.053 -0.268   -0.027
## 31      26.8          5.1         7.8 32.461  -5.661 -0.153   -0.031
## 32      62.4          8.8        25.3 62.487  -0.087  0.005    0.003
## 33      29.5          8.5        16.5 48.018 -18.518  0.172    0.115
## 34      52.2          9.4        12.6 42.023  10.177  0.082    0.046
## 35      27.2          0.9        12.0 37.613 -10.413 -0.073    0.137
## 36      39.5          5.4        11.5 38.615   0.885  0.010   -0.002
## 37      58.0         12.2        17.1 50.490   7.510 -0.074    0.008
## 38      36.8          4.1        11.2 37.601  -0.801 -0.009    0.003
## 39      31.6          6.3        12.2 40.119  -8.519 -0.071    0.012
## 40      35.6          3.3        10.6 36.300  -0.700 -0.010    0.003
## 41      53.0          7.9        19.9 53.320  -0.320  0.008    0.005
## 42      38.0          1.8        14.5 42.053  -4.053  0.013    0.067
## 43      54.3         10.6        15.5 47.236   7.064 -0.026    0.006
## 44      64.4          9.0        17.4 49.688  14.712 -0.190   -0.103
## 45      36.8          3.9         8.4 32.955   3.845  0.091    0.006
## 46      24.2          2.2        10.3 35.366 -11.166 -0.166    0.073
## 47      37.6          7.6        10.2 37.383   0.217  0.004    0.001
## 48      33.0          5.1        12.5 40.124  -7.124 -0.047    0.034
## 49      45.5          4.9        16.7 46.891  -1.391  0.016    0.020
## 50      32.3          4.3         8.5 33.279  -0.979 -0.023   -0.002
## 51      39.9          2.1        12.2 38.424   1.476  0.010   -0.016
##    dfb.pvr_   dffit cov.r cook.d   hat
## 1    -0.014  -0.017 1.154  0.000 0.077
## 2    -0.200   0.234 1.105  0.018 0.076
## 3     0.148   0.282 0.933  0.026 0.029
## 4     0.078   0.262 0.912  0.022 0.023
## 5    -0.110  -0.190 1.032  0.012 0.034
## 6    -0.232   0.327 0.953  0.035 0.042
## 7     0.120  -0.214 1.010  0.015 0.032
## 8    -0.089   0.221 0.990  0.016 0.029
## 9     1.614 -10.032 3.272 25.618 0.864
## 10   -0.038  -0.058 1.096  0.001 0.034
## 11   -0.094   0.275 0.893  0.024 0.023
## 12   -0.018   0.038 1.093  0.000 0.028
## 13   -0.040  -0.083 1.083  0.002 0.030
## 14   -0.001   0.003 1.091  0.000 0.023
## 15   -0.133   0.193 1.041  0.012 0.037
## 16   -0.012  -0.122 1.063  0.005 0.029
## 17   -0.046   0.102 1.066  0.004 0.026
## 18    0.051   0.122 1.050  0.005 0.024
## 19    0.031   0.048 1.142  0.001 0.069
## 20    0.032  -0.240 0.971  0.019 0.029
## 21    0.064  -0.093 1.102  0.003 0.045
## 22    0.075  -0.279 0.913  0.025 0.026
## 23    0.030  -0.111 1.050  0.004 0.021
## 24    0.095  -0.160 1.059  0.009 0.036
## 25    0.169   0.186 1.231  0.012 0.146
## 26   -0.139   0.192 1.053  0.012 0.041
## 27   -0.114  -0.187 1.054  0.012 0.041
## 28   -0.028   0.056 1.096  0.001 0.034
## 29   -0.174   0.310 0.916  0.031 0.032
## 30    0.224  -0.273 1.111  0.025 0.087
## 31    0.125  -0.163 1.090  0.009 0.052
## 32   -0.006  -0.006 1.367  0.000 0.221
## 33   -0.292  -0.447 0.786  0.061 0.034
## 34   -0.041   0.183 0.988  0.011 0.021
## 35   -0.022  -0.233 0.993  0.018 0.032
## 36   -0.004   0.016 1.089  0.000 0.023
## 37    0.105   0.180 1.049  0.011 0.037
## 38    0.004  -0.015 1.092  0.000 0.025
## 39    0.022  -0.150 1.017  0.008 0.021
## 40    0.005  -0.014 1.095  0.000 0.028
## 41   -0.011  -0.012 1.163  0.000 0.084
## 42   -0.052  -0.100 1.092  0.003 0.039
## 43    0.057   0.140 1.044  0.007 0.026
## 44    0.285   0.391 0.901  0.048 0.043
## 45   -0.069   0.101 1.099  0.003 0.044
## 46    0.081  -0.246 0.977  0.020 0.031
## 47   -0.003   0.005 1.099  0.000 0.031
## 48    0.000  -0.128 1.039  0.005 0.022
## 49   -0.028  -0.038 1.116  0.000 0.047
## 50    0.017  -0.025 1.112  0.000 0.043
## 51    0.003   0.030 1.094  0.000 0.028
  1. Of these, are there any that look problemmatic with the residuals, cooks, or hat values? Which ones? (Look at the row number and look at which state this is from the printed data before).
  2. There’s one point that is particularly horrible in both its hat value and cooks distance. Which state is it?
  3. Let’s look at the results with and without this point and see how much it changes the estimates and conclusions.
## With the extreme value
poverty %>%
  lm(TeenBirth ~ ViolentCrime + poverty_pct,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = TeenBirth ~ ViolentCrime + poverty_pct, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5183  -6.4833  -0.3196   6.5035  14.7125 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.6848     3.8686   4.571 3.41e-05 ***
## ViolentCrime   0.4037     0.1497   2.697  0.00962 ** 
## poverty_pct    1.6304     0.3119   5.227 3.70e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.329 on 48 degrees of freedom
## Multiple R-squared:  0.5611, Adjusted R-squared:  0.5428 
## F-statistic: 30.68 on 2 and 48 DF,  p-value: 2.608e-09
## Withut the extreme value
poverty %>%
  filter(State != "District_of_Columbia") %>%
  lm(TeenBirth ~ ViolentCrime + poverty_pct,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = TeenBirth ~ ViolentCrime + poverty_pct, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.363  -5.052   0.958   4.745  13.649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   15.3419     3.4315   4.471  4.9e-05 ***
## ViolentCrime   1.6327     0.3352   4.871  1.3e-05 ***
## poverty_pct    1.1905     0.2941   4.048 0.000192 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.278 on 47 degrees of freedom
## Multiple R-squared:  0.6366, Adjusted R-squared:  0.6211 
## F-statistic: 41.17 on 2 and 47 DF,  p-value: 4.668e-11
  1. What are the differences between the estimates of the full sample and the sample without the extreme value?
  2. Is this surprising after the seeing the diagnostics?

Assumptions

  1. There are a few ways to check normality and homoscedasticity. In R the easiest way is to use the plot() function. Let’s first look at the full sample and see if the extreme point hurts us in regards to our assumptions.
fit_full <- poverty %>%
  lm(TeenBirth ~ ViolentCrime + poverty_pct,
     data = .) 

par(mfrow = c(2,2))
plot(fit_full)

  1. In three of the four plots, point “9” is specifically labeled due to its extremeness. Let’s talk about each plot and why point “9” keeps showing up.
  2. The “Residuals vs Fitted” plot shows us a few things. First, we are looking to see if the red line is roughly along the dotted line at 0. This one isn’t too bad until the last point (which it turns out is point “9”). The second thing we look for is if the spread looks fairly random around the dotted line. Other than the extreme value, this is also pretty good. This suggests homoscedasticity holds and the residuals seem to be indpendent.
  3. The “Normal Q-Q” plot shows us, if the points are along the dotted line, that we have normality in our residuals. For the most part, the points are very much along the line. Point “9” may be a problem here as well but it definitely doesn’t seem to much of a problem here.
  4. The “Scale-Location” plot is similar ot the “Residuals vs Fitted” but shows the residuals in a slightly different way (the square root of the standardized residual). Point “9” is, again, an extreme value here.
  5. Finally, the “Residuals vs Leverage” plot shows us Cook’s Distance. Most points have very little leverage. But guess which point has a lot of leverage. Yep, point “9”.
  6. Let’s look at these diagnostic plots after removing this problematic point again.
fit <- poverty %>%
  filter(State != "District_of_Columbia") %>%
  lm(TeenBirth ~ ViolentCrime + poverty_pct,
     data = .) 

par(mfrow = c(2,2))
plot(fit)

  1. Are these better than in the full sample? Why or why not?
  2. After running all of these models and diagnostics, it is clear that something is extreme about the District of Columbia. Why might this be?

Miscellaneous

Below, let’s go over a few examples of both measurement and specification error.

Measurement Error

  1. Below, we look at the relationship between TeenBirth and ViolentCrime when one, the other, or both have more measurement error, labeled TeenBirth_ME and ViolentCrime_ME.
set.seed(843)
poverty_ME <- poverty %>%
  filter(State != "District_of_Columbia") %>%
  mutate(TeenBirth_ME = TeenBirth + runif(50, 0, 20),
         ViolentCrime_ME = ViolentCrime + runif(50, 0, 10))
  1. First the original model with little measurement error.
no_me <- poverty_ME %>%
  lm(TeenBirth ~ ViolentCrime, data = .)
  1. Now the model with high measurement error in the outcome.
me_out <- poverty_ME %>%
  lm(TeenBirth_ME ~ ViolentCrime, data = .)
  1. Now the model with high measurement error in the predictor.
me_pred <- poverty_ME %>%
  lm(TeenBirth ~ ViolentCrime_ME, data = .)
  1. Finally, the model with both having high measurement error.
me_both <- poverty_ME %>%
  lm(TeenBirth_ME ~ ViolentCrime_ME, data = .)
  1. Let’s put the results together using texreg::screenreg().
texreg::screenreg(list(no_me, me_out, me_pred, me_both),
                  custom.model.names = c("No ME", "Outcome ME", "Predictor ME", "Both ME"))
## 
## ===============================================================
##                  No ME      Outcome ME  Predictor ME  Both ME  
## ---------------------------------------------------------------
## (Intercept)      26.05 ***  37.43 ***   28.22 ***     39.10 ***
##                  (2.51)     (2.74)      (4.27)        (4.29)   
## ViolentCrime      2.33 ***   2.00 ***                          
##                  (0.33)     (0.36)                             
## ViolentCrime_ME                          1.14 **       0.99 ** 
##                                         (0.34)        (0.34)   
## ---------------------------------------------------------------
## R^2               0.51       0.39        0.19          0.15    
## Adj. R^2          0.50       0.38        0.18          0.13    
## Num. obs.        50         50          50            50       
## RMSE              8.36       9.14       10.74         10.79    
## ===============================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
  1. Compare the results of the models. What does measurement error do to the estimates/standard errors/\(R^2\)?

Specification Errors

  1. Use the following data frame for this demonstration regarding specification errors. We make a ficticious data set here to show what happens if we leave out an important variable and what happens if we include an irrelvant one. Since in reality we rarely can know this, it is best if we make a data set that we know these things about to get a sense for their effects.
df <- data_frame(
  x1 = rnorm(100),
  x2 = .1*x1 + rnorm(100),
  x3 = rnorm(100),
  y = x1 + x2 + rnorm(100)
)
  1. The “true model” is where both x1 and x2 predict y at the same weight of 1. Let’s see what happens if we don’t leave anything out.
df %>%
  lm(y ~ x1 + x2, data = .) %>%
  summary()
## 
## Call:
## lm(formula = y ~ x1 + x2, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.57720 -0.59077  0.00064  0.61818  1.98121 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.16679    0.09778  -1.706   0.0913 .  
## x1           1.06816    0.10966   9.741 4.86e-16 ***
## x2           1.00611    0.09607  10.473  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9755 on 97 degrees of freedom
## Multiple R-squared:  0.7087, Adjusted R-squared:  0.7027 
## F-statistic:   118 on 2 and 97 DF,  p-value: < 2.2e-16
  1. Now, what happens if we don’t include x2 (an omitted variable)? What happens to the relationship between x1 and y?
df %>%
  lm(y ~ x1, data = .) %>%
  summary()
## 
## Call:
## lm(formula = y ~ x1, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9625 -0.7536 -0.0439  0.8467  3.8244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1788     0.1420  -1.259    0.211    
## x1            1.2212     0.1578   7.737 9.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.417 on 98 degrees of freedom
## Multiple R-squared:  0.3792, Adjusted R-squared:  0.3729 
## F-statistic: 59.87 on 1 and 98 DF,  p-value: 9.173e-12
  1. Why is this an omitted variable?
  2. If we include x3, what does it do to the model?
df %>%
  lm(y ~ x1 + x2 + x3, data = .) %>%
  summary()
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48582 -0.66864 -0.00192  0.63545  1.99531 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.16466    0.09791  -1.682   0.0959 .  
## x1           1.07463    0.11001   9.769 4.67e-16 ***
## x2           0.99642    0.09677  10.297  < 2e-16 ***
## x3          -0.08670    0.09699  -0.894   0.3736    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9765 on 96 degrees of freedom
## Multiple R-squared:  0.7111, Adjusted R-squared:  0.702 
## F-statistic: 78.75 on 3 and 96 DF,  p-value: < 2.2e-16
  1. Is having an omitted variable or an extra variable more problematic?

Conclusion

These topics are consistently important in any regression analysis. We always want to assess the data, see if things look odd, if points are influential on the results, and if other errors (measurement, specification) are influencing the results.