Introduction

Chapter 16 covers dealing with irregularities in the data, including violations of assumptions of the model. Chapter 17 discusses some miscellaneous topics that we’ll cover in here as well.

Irregularities

  1. Let’s start by loading the tidyverse package, the furniture package, and the rlm package.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1.9000 ──
## ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.5     
## ✔ tibble  1.4.2          ✔ dplyr   0.7.6     
## ✔ tidyr   0.8.1          ✔ stringr 1.3.1     
## ✔ readr   1.1.1          ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(furniture)
## ── furniture 1.7.12 ────────────────────────────────────────────────────────── learn more at tysonbarrett.com ──
## ✔ furniture attached
## ✖ The furniture::table1() function has the same name as tidyr::table1 (tbl_df)
##    Consider using `furniture::` for each function call.
library(rlm)
## ── rlm 0.1.2 ───────────────────────────────────────────────────────────────── learn more at tysonbarrett.com ──
## ✔ rlm attached
## ✔ No potential conflicts found
  1. Import the poverty data set found in rlm into R.
data(poverty)
poverty
## # A tibble: 51 x 6
##    State        poverty_pct `birthrate_15-1… `birthrate_18-1… ViolentCrime
##    <chr>              <dbl>            <dbl>            <dbl>        <dbl>
##  1 Alabama             20.1             31.5             88.7         11.2
##  2 Alaska               7.1             18.9             73.7          9.1
##  3 Arizona             16.1             35              102.          10.4
##  4 Arkansas            14.9             31.6            102.          10.4
##  5 California          16.7             22.6             69.1         11.2
##  6 Colorado             8.8             26.2             79.1          5.8
##  7 Connecticut          9.7             14.1             45.1          4.6
##  8 Delaware            10.3             24.7             77.8          3.5
##  9 District_of…        22               44.8            102.          65  
## 10 Florida             16.2             23.2             78.4          7.3
## # ... with 41 more rows, and 1 more variable: TeenBirth <dbl>

Extreme Cases

  1. Let’s look at some simple descriptives to see if there are any apparent extreme cases. Note that the output will be split into two sections but you can connect the lines by looking at the row number. Note that for diagnostics() to work, we have to put the actual data in the data argument in the lm() function as we did below.
lm(TeenBirth ~ ViolentCrime + poverty_pct,
           data = poverty) %>%
  rlm::diagnostics() %>%
  round(3)
##    TeenBirth ViolentCrime poverty_pct   pred   resid dfb.1_ dfb.VlnC
## 1       54.5         11.2        20.1 54.978  -0.478  0.011    0.004
## 2       39.5          9.1         7.1 32.934   6.566  0.215    0.109
## 3       61.2         10.4        16.1 48.133  13.067 -0.084   -0.011
## 4       59.9         10.4        14.9 46.177  13.723 -0.017    0.025
## 5       41.1         11.2        16.7 49.434  -8.334  0.071    0.003
## 6       47.0          5.8         8.8 34.374  12.626  0.294    0.063
## 7       25.8          4.6         9.7 35.357  -9.557 -0.176   -0.002
## 8       46.3          3.5        10.3 35.891  10.409  0.160   -0.037
## 9       69.1         65.0        22.0 79.793 -10.693  0.693   -9.397
## 10      44.5          7.3        16.2 47.045  -2.545  0.020    0.020
## 11      55.7          9.5        12.1 41.248  14.452  0.150    0.086
## 12      38.2          4.7        10.3 36.376   1.824  0.029   -0.002
## 13      39.1          4.1        14.5 42.981  -3.881  0.009    0.044
## 14      42.2         10.3        12.4 42.060   0.140  0.001    0.001
## 15      44.6          8.0         9.6 36.566   8.034  0.163    0.065
## 16      32.5          1.8        12.2 38.303  -5.803 -0.038    0.066
## 17      43.0          6.2        10.8 37.796   5.204  0.074    0.007
## 18      51.0          7.2        14.7 44.559   6.441 -0.011   -0.031
## 19      58.1         17.0        19.7 56.667   1.433 -0.028    0.009
## 20      25.4          2.0        11.2 36.753 -11.353 -0.124    0.101
## 21      35.4         11.8        10.1 38.916  -3.516 -0.070   -0.054
## 22      23.3          3.6        11.0 37.073 -13.773 -0.173    0.068
## 23      34.8          8.5        12.2 41.007  -6.207 -0.058   -0.021
## 24      27.5          3.9         9.2 34.259  -6.759 -0.137    0.002
## 25      64.7         12.9        23.5 61.208   3.492 -0.144   -0.045
## 26      44.1          8.8         9.4 36.563   7.537  0.163    0.078
## 27      36.4          3.0        15.3 43.842  -7.442  0.046    0.117
## 28      37.0          2.9         9.6 34.508   2.492  0.045   -0.008
## 29      53.9         10.7        11.1 40.102  13.798  0.211    0.151
## 30      20.0          1.8         5.3 27.053  -7.053 -0.268   -0.027
## 31      26.8          5.1         7.8 32.461  -5.661 -0.153   -0.031
## 32      62.4          8.8        25.3 62.487  -0.087  0.005    0.003
## 33      29.5          8.5        16.5 48.018 -18.518  0.172    0.115
## 34      52.2          9.4        12.6 42.023  10.177  0.082    0.046
## 35      27.2          0.9        12.0 37.613 -10.413 -0.073    0.137
## 36      39.5          5.4        11.5 38.615   0.885  0.010   -0.002
## 37      58.0         12.2        17.1 50.490   7.510 -0.074    0.008
## 38      36.8          4.1        11.2 37.601  -0.801 -0.009    0.003
## 39      31.6          6.3        12.2 40.119  -8.519 -0.071    0.012
## 40      35.6          3.3        10.6 36.300  -0.700 -0.010    0.003
## 41      53.0          7.9        19.9 53.320  -0.320  0.008    0.005
## 42      38.0          1.8        14.5 42.053  -4.053  0.013    0.067
## 43      54.3         10.6        15.5 47.236   7.064 -0.026    0.006
## 44      64.4          9.0        17.4 49.688  14.712 -0.190   -0.103
## 45      36.8          3.9         8.4 32.955   3.845  0.091    0.006
## 46      24.2          2.2        10.3 35.366 -11.166 -0.166    0.073
## 47      37.6          7.6        10.2 37.383   0.217  0.004    0.001
## 48      33.0          5.1        12.5 40.124  -7.124 -0.047    0.034
## 49      45.5          4.9        16.7 46.891  -1.391  0.016    0.020
## 50      32.3          4.3         8.5 33.279  -0.979 -0.023   -0.002
## 51      39.9          2.1        12.2 38.424   1.476  0.010   -0.016
##    dfb.pvr_   dffit cov.r cook.d   hat
## 1    -0.014  -0.017 1.154  0.000 0.077
## 2    -0.200   0.234 1.105  0.018 0.076
## 3     0.148   0.282 0.933  0.026 0.029
## 4     0.078   0.262 0.912  0.022 0.023
## 5    -0.110  -0.190 1.032  0.012 0.034
## 6    -0.232   0.327 0.953  0.035 0.042
## 7     0.120  -0.214 1.010  0.015 0.032
## 8    -0.089   0.221 0.990  0.016 0.029
## 9     1.614 -10.032 3.272 25.618 0.864
## 10   -0.038  -0.058 1.096  0.001 0.034
## 11   -0.094   0.275 0.893  0.024 0.023
## 12   -0.018   0.038 1.093  0.000 0.028
## 13   -0.040  -0.083 1.083  0.002 0.030
## 14   -0.001   0.003 1.091  0.000 0.023
## 15   -0.133   0.193 1.041  0.012 0.037
## 16   -0.012  -0.122 1.063  0.005 0.029
## 17   -0.046   0.102 1.066  0.004 0.026
## 18    0.051   0.122 1.050  0.005 0.024
## 19    0.031   0.048 1.142  0.001 0.069
## 20    0.032  -0.240 0.971  0.019 0.029
## 21    0.064  -0.093 1.102  0.003 0.045
## 22    0.075  -0.279 0.913  0.025 0.026
## 23    0.030  -0.111 1.050  0.004 0.021
## 24    0.095  -0.160 1.059  0.009 0.036
## 25    0.169   0.186 1.231  0.012 0.146
## 26   -0.139   0.192 1.053  0.012 0.041
## 27   -0.114  -0.187 1.054  0.012 0.041
## 28   -0.028   0.056 1.096  0.001 0.034
## 29   -0.174   0.310 0.916  0.031 0.032
## 30    0.224  -0.273 1.111  0.025 0.087
## 31    0.125  -0.163 1.090  0.009 0.052
## 32   -0.006  -0.006 1.367  0.000 0.221
## 33   -0.292  -0.447 0.786  0.061 0.034
## 34   -0.041   0.183 0.988  0.011 0.021
## 35   -0.022  -0.233 0.993  0.018 0.032
## 36   -0.004   0.016 1.089  0.000 0.023
## 37    0.105   0.180 1.049  0.011 0.037
## 38    0.004  -0.015 1.092  0.000 0.025
## 39    0.022  -0.150 1.017  0.008 0.021
## 40    0.005  -0.014 1.095  0.000 0.028
## 41   -0.011  -0.012 1.163  0.000 0.084
## 42   -0.052  -0.100 1.092  0.003 0.039
## 43    0.057   0.140 1.044  0.007 0.026
## 44    0.285   0.391 0.901  0.048 0.043
## 45   -0.069   0.101 1.099  0.003 0.044
## 46    0.081  -0.246 0.977  0.020 0.031
## 47   -0.003   0.005 1.099  0.000 0.031
## 48    0.000  -0.128 1.039  0.005 0.022
## 49   -0.028  -0.038 1.116  0.000 0.047
## 50    0.017  -0.025 1.112  0.000 0.043
## 51    0.003   0.030 1.094  0.000 0.028
  1. Of these, are there any that look problemmatic with the residuals, cooks, or hat values? Which ones? (Look at the row number and look at which state this is from the printed data before).
  2. There’s one point that is particularly horrible in both its hat value and cooks distance. Which state is it?
  3. Let’s look at the results with and without this point and see how much it changes the estimates and conclusions.
## With the extreme value
poverty %>%
  lm(TeenBirth ~ ViolentCrime + poverty_pct,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = TeenBirth ~ ViolentCrime + poverty_pct, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5183  -6.4833  -0.3196   6.5035  14.7125 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.6848     3.8686   4.571 3.41e-05 ***
## ViolentCrime   0.4037     0.1497   2.697  0.00962 ** 
## poverty_pct    1.6304     0.3119   5.227 3.70e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.329 on 48 degrees of freedom
## Multiple R-squared:  0.5611, Adjusted R-squared:  0.5428 
## F-statistic: 30.68 on 2 and 48 DF,  p-value: 2.608e-09
## Withut the extreme value
poverty %>%
  filter(State != "District_of_Columbia") %>%
  lm(TeenBirth ~ ViolentCrime + poverty_pct,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = TeenBirth ~ ViolentCrime + poverty_pct, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.363  -5.052   0.958   4.745  13.649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   15.3419     3.4315   4.471  4.9e-05 ***
## ViolentCrime   1.6327     0.3352   4.871  1.3e-05 ***
## poverty_pct    1.1905     0.2941   4.048 0.000192 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.278 on 47 degrees of freedom
## Multiple R-squared:  0.6366, Adjusted R-squared:  0.6211 
## F-statistic: 41.17 on 2 and 47 DF,  p-value: 4.668e-11
  1. What are the differences between the estimates of the full sample and the sample without the extreme value?
  2. Is this surprising after the seeing the diagnostics?

Assumptions

  1. There are a few ways to check normality and homoscedasticity. In R the easiest way is to use the plot() function. Let’s first look at the full sample and see if the extreme point hurts us in regards to our assumptions.
fit_full <- poverty %>%
  lm(TeenBirth ~ ViolentCrime + poverty_pct,
     data = .) 

par(mfrow = c(2,2))
plot(fit_full)

  1. In three of the four plots, point “9” is specifically labeled due to its extremeness. Let’s talk about each plot and why point “9” keeps showing up.
  2. The “Residuals vs Fitted” plot shows us a few things. First, we are looking to see if the red line is roughly along the dotted line at 0. This one isn’t too bad until the last point (which it turns out is point “9”). The second thing we look for is if the spread looks fairly random around the dotted line. Other than the extreme value, this is also pretty good. This suggests homoscedasticity holds and the residuals seem to be indpendent.
  3. The “Normal Q-Q” plot shows us, if the points are along the dotted line, that we have normality in our residuals. For the most part, the points are very much along the line. Point “9” may be a problem here as well but it definitely doesn’t seem to much of a problem here.
  4. The “Scale-Location” plot is similar ot the “Residuals vs Fitted” but shows the residuals in a slightly different way (the square root of the standardized residual). Point “9” is, again, an extreme value here.
  5. Finally, the “Residuals vs Leverage” plot shows us Cook’s Distance. Most points have very little leverage. But guess which point has a lot of leverage. Yep, point “9”.
  6. Let’s look at these diagnostic plots after removing this problematic point again.
fit <- poverty %>%
  filter(State != "District_of_Columbia") %>%
  lm(TeenBirth ~ ViolentCrime + poverty_pct,
     data = .) 

par(mfrow = c(2,2))
plot(fit)