Chapter 16 covers dealing with irregularities in the data, including violations of assumptions of the model. Chapter 17 discusses some miscellaneous topics that we’ll cover in here as well.
tidyverse
package, the furniture
package, and the educ7610
package.library(tidyverse)
library(furniture)
library(educ7610)
poverty
data set found in educ7610
into R.data(poverty)
poverty
## # A tibble: 51 x 6
## State poverty_pct `birthrate_15-1… `birthrate_18-1… ViolentCrime
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alab… 20.1 31.5 88.7 11.2
## 2 Alas… 7.1 18.9 73.7 9.1
## 3 Ariz… 16.1 35 102. 10.4
## 4 Arka… 14.9 31.6 102. 10.4
## 5 Cali… 16.7 22.6 69.1 11.2
## 6 Colo… 8.8 26.2 79.1 5.8
## 7 Conn… 9.7 14.1 45.1 4.6
## 8 Dela… 10.3 24.7 77.8 3.5
## 9 Dist… 22 44.8 102. 65
## 10 Flor… 16.2 23.2 78.4 7.3
## # … with 41 more rows, and 1 more variable: TeenBirth <dbl>
diagnostics()
to work, we have to put the actual data in the data argument in the lm()
function as we did below.lm(TeenBirth ~ ViolentCrime + poverty_pct,
data = poverty) %>%
educ7610::diagnostics() %>%
round(3)
## TeenBirth ViolentCrime poverty_pct pred resid dfb.1_ dfb.VlnC
## 1 54.5 11.2 20.1 54.978 -0.478 0.011 0.004
## 2 39.5 9.1 7.1 32.934 6.566 0.215 0.109
## 3 61.2 10.4 16.1 48.133 13.067 -0.084 -0.011
## 4 59.9 10.4 14.9 46.177 13.723 -0.017 0.025
## 5 41.1 11.2 16.7 49.434 -8.334 0.071 0.003
## 6 47.0 5.8 8.8 34.374 12.626 0.294 0.063
## 7 25.8 4.6 9.7 35.357 -9.557 -0.176 -0.002
## 8 46.3 3.5 10.3 35.891 10.409 0.160 -0.037
## 9 69.1 65.0 22.0 79.793 -10.693 0.693 -9.397
## 10 44.5 7.3 16.2 47.045 -2.545 0.020 0.020
## 11 55.7 9.5 12.1 41.248 14.452 0.150 0.086
## 12 38.2 4.7 10.3 36.376 1.824 0.029 -0.002
## 13 39.1 4.1 14.5 42.981 -3.881 0.009 0.044
## 14 42.2 10.3 12.4 42.060 0.140 0.001 0.001
## 15 44.6 8.0 9.6 36.566 8.034 0.163 0.065
## 16 32.5 1.8 12.2 38.303 -5.803 -0.038 0.066
## 17 43.0 6.2 10.8 37.796 5.204 0.074 0.007
## 18 51.0 7.2 14.7 44.559 6.441 -0.011 -0.031
## 19 58.1 17.0 19.7 56.667 1.433 -0.028 0.009
## 20 25.4 2.0 11.2 36.753 -11.353 -0.124 0.101
## 21 35.4 11.8 10.1 38.916 -3.516 -0.070 -0.054
## 22 23.3 3.6 11.0 37.073 -13.773 -0.173 0.068
## 23 34.8 8.5 12.2 41.007 -6.207 -0.058 -0.021
## 24 27.5 3.9 9.2 34.259 -6.759 -0.137 0.002
## 25 64.7 12.9 23.5 61.208 3.492 -0.144 -0.045
## 26 44.1 8.8 9.4 36.563 7.537 0.163 0.078
## 27 36.4 3.0 15.3 43.842 -7.442 0.046 0.117
## 28 37.0 2.9 9.6 34.508 2.492 0.045 -0.008
## 29 53.9 10.7 11.1 40.102 13.798 0.211 0.151
## 30 20.0 1.8 5.3 27.053 -7.053 -0.268 -0.027
## 31 26.8 5.1 7.8 32.461 -5.661 -0.153 -0.031
## 32 62.4 8.8 25.3 62.487 -0.087 0.005 0.003
## 33 29.5 8.5 16.5 48.018 -18.518 0.172 0.115
## 34 52.2 9.4 12.6 42.023 10.177 0.082 0.046
## 35 27.2 0.9 12.0 37.613 -10.413 -0.073 0.137
## 36 39.5 5.4 11.5 38.615 0.885 0.010 -0.002
## 37 58.0 12.2 17.1 50.490 7.510 -0.074 0.008
## 38 36.8 4.1 11.2 37.601 -0.801 -0.009 0.003
## 39 31.6 6.3 12.2 40.119 -8.519 -0.071 0.012
## 40 35.6 3.3 10.6 36.300 -0.700 -0.010 0.003
## 41 53.0 7.9 19.9 53.320 -0.320 0.008 0.005
## 42 38.0 1.8 14.5 42.053 -4.053 0.013 0.067
## 43 54.3 10.6 15.5 47.236 7.064 -0.026 0.006
## 44 64.4 9.0 17.4 49.688 14.712 -0.190 -0.103
## 45 36.8 3.9 8.4 32.955 3.845 0.091 0.006
## 46 24.2 2.2 10.3 35.366 -11.166 -0.166 0.073
## 47 37.6 7.6 10.2 37.383 0.217 0.004 0.001
## 48 33.0 5.1 12.5 40.124 -7.124 -0.047 0.034
## 49 45.5 4.9 16.7 46.891 -1.391 0.016 0.020
## 50 32.3 4.3 8.5 33.279 -0.979 -0.023 -0.002
## 51 39.9 2.1 12.2 38.424 1.476 0.010 -0.016
## dfb.pvr_ dffit cov.r cook.d hat
## 1 -0.014 -0.017 1.154 0.000 0.077
## 2 -0.200 0.234 1.105 0.018 0.076
## 3 0.148 0.282 0.933 0.026 0.029
## 4 0.078 0.262 0.912 0.022 0.023
## 5 -0.110 -0.190 1.032 0.012 0.034
## 6 -0.232 0.327 0.953 0.035 0.042
## 7 0.120 -0.214 1.010 0.015 0.032
## 8 -0.089 0.221 0.990 0.016 0.029
## 9 1.614 -10.032 3.272 25.618 0.864
## 10 -0.038 -0.058 1.096 0.001 0.034
## 11 -0.094 0.275 0.893 0.024 0.023
## 12 -0.018 0.038 1.093 0.000 0.028
## 13 -0.040 -0.083 1.083 0.002 0.030
## 14 -0.001 0.003 1.091 0.000 0.023
## 15 -0.133 0.193 1.041 0.012 0.037
## 16 -0.012 -0.122 1.063 0.005 0.029
## 17 -0.046 0.102 1.066 0.004 0.026
## 18 0.051 0.122 1.050 0.005 0.024
## 19 0.031 0.048 1.142 0.001 0.069
## 20 0.032 -0.240 0.971 0.019 0.029
## 21 0.064 -0.093 1.102 0.003 0.045
## 22 0.075 -0.279 0.913 0.025 0.026
## 23 0.030 -0.111 1.050 0.004 0.021
## 24 0.095 -0.160 1.059 0.009 0.036
## 25 0.169 0.186 1.231 0.012 0.146
## 26 -0.139 0.192 1.053 0.012 0.041
## 27 -0.114 -0.187 1.054 0.012 0.041
## 28 -0.028 0.056 1.096 0.001 0.034
## 29 -0.174 0.310 0.916 0.031 0.032
## 30 0.224 -0.273 1.111 0.025 0.087
## 31 0.125 -0.163 1.090 0.009 0.052
## 32 -0.006 -0.006 1.367 0.000 0.221
## 33 -0.292 -0.447 0.786 0.061 0.034
## 34 -0.041 0.183 0.988 0.011 0.021
## 35 -0.022 -0.233 0.993 0.018 0.032
## 36 -0.004 0.016 1.089 0.000 0.023
## 37 0.105 0.180 1.049 0.011 0.037
## 38 0.004 -0.015 1.092 0.000 0.025
## 39 0.022 -0.150 1.017 0.008 0.021
## 40 0.005 -0.014 1.095 0.000 0.028
## 41 -0.011 -0.012 1.163 0.000 0.084
## 42 -0.052 -0.100 1.092 0.003 0.039
## 43 0.057 0.140 1.044 0.007 0.026
## 44 0.285 0.391 0.901 0.048 0.043
## 45 -0.069 0.101 1.099 0.003 0.044
## 46 0.081 -0.246 0.977 0.020 0.031
## 47 -0.003 0.005 1.099 0.000 0.031
## 48 0.000 -0.128 1.039 0.005 0.022
## 49 -0.028 -0.038 1.116 0.000 0.047
## 50 0.017 -0.025 1.112 0.000 0.043
## 51 0.003 0.030 1.094 0.000 0.028
## With the extreme value
poverty %>%
lm(TeenBirth ~ ViolentCrime + poverty_pct,
data = .) %>%
summary()
##
## Call:
## lm(formula = TeenBirth ~ ViolentCrime + poverty_pct, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.5183 -6.4833 -0.3196 6.5035 14.7125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.6848 3.8686 4.571 3.41e-05 ***
## ViolentCrime 0.4037 0.1497 2.697 0.00962 **
## poverty_pct 1.6304 0.3119 5.227 3.70e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.329 on 48 degrees of freedom
## Multiple R-squared: 0.5611, Adjusted R-squared: 0.5428
## F-statistic: 30.68 on 2 and 48 DF, p-value: 2.608e-09
## Withut the extreme value
poverty %>%
filter(State != "District_of_Columbia") %>%
lm(TeenBirth ~ ViolentCrime + poverty_pct,
data = .) %>%
summary()
##
## Call:
## lm(formula = TeenBirth ~ ViolentCrime + poverty_pct, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.363 -5.052 0.958 4.745 13.649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.3419 3.4315 4.471 4.9e-05 ***
## ViolentCrime 1.6327 0.3352 4.871 1.3e-05 ***
## poverty_pct 1.1905 0.2941 4.048 0.000192 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.278 on 47 degrees of freedom
## Multiple R-squared: 0.6366, Adjusted R-squared: 0.6211
## F-statistic: 41.17 on 2 and 47 DF, p-value: 4.668e-11
R
the easiest way is to use the plot()
function. Let’s first look at the full sample and see if the extreme point hurts us in regards to our assumptions.fit_full <- poverty %>%
lm(TeenBirth ~ ViolentCrime + poverty_pct,
data = .)
par(mfrow = c(2,2))
plot(fit_full)
fit <- poverty %>%
filter(State != "District_of_Columbia") %>%
lm(TeenBirth ~ ViolentCrime + poverty_pct,
data = .)
par(mfrow = c(2,2))
plot(fit)
Below, let’s go over a few examples of both measurement and specification error.
TeenBirth
and ViolentCrime
when one, the other, or both have more measurement error, labeled TeenBirth_ME
and ViolentCrime_ME
.set.seed(843)
poverty_ME <- poverty %>%
filter(State != "District_of_Columbia") %>%
mutate(TeenBirth_ME = TeenBirth + runif(50, 0, 20),
ViolentCrime_ME = ViolentCrime + runif(50, 0, 10))
no_me <- poverty_ME %>%
lm(TeenBirth ~ ViolentCrime, data = .)
me_out <- poverty_ME %>%
lm(TeenBirth_ME ~ ViolentCrime, data = .)
me_pred <- poverty_ME %>%
lm(TeenBirth ~ ViolentCrime_ME, data = .)
me_both <- poverty_ME %>%
lm(TeenBirth_ME ~ ViolentCrime_ME, data = .)
texreg::screenreg()
.texreg::screenreg(list(no_me, me_out, me_pred, me_both),
custom.model.names = c("No ME", "Outcome ME", "Predictor ME", "Both ME"))
##
## ===============================================================
## No ME Outcome ME Predictor ME Both ME
## ---------------------------------------------------------------
## (Intercept) 26.05 *** 37.43 *** 28.22 *** 39.10 ***
## (2.51) (2.74) (4.27) (4.29)
## ViolentCrime 2.33 *** 2.00 ***
## (0.33) (0.36)
## ViolentCrime_ME 1.14 ** 0.99 **
## (0.34) (0.34)
## ---------------------------------------------------------------
## R^2 0.51 0.39 0.19 0.15
## Adj. R^2 0.50 0.38 0.18 0.13
## Num. obs. 50 50 50 50
## RMSE 8.36 9.14 10.74 10.79
## ===============================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
df <- data_frame(
x1 = rnorm(100),
x2 = .1*x1 + rnorm(100),
x3 = rnorm(100),
y = x1 + x2 + rnorm(100)
)
x1
and x2
predict y
at the same weight of 1. Let’s see what happens if we don’t leave anything out.df %>%
lm(y ~ x1 + x2, data = .) %>%
summary()
##
## Call:
## lm(formula = y ~ x1 + x2, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.57720 -0.59077 0.00064 0.61818 1.98121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.16679 0.09778 -1.706 0.0913 .
## x1 1.06816 0.10966 9.741 4.86e-16 ***
## x2 1.00611 0.09607 10.473 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9755 on 97 degrees of freedom
## Multiple R-squared: 0.7087, Adjusted R-squared: 0.7027
## F-statistic: 118 on 2 and 97 DF, p-value: < 2.2e-16
x2
(an omitted variable)? What happens to the relationship between x1
and y
?df %>%
lm(y ~ x1, data = .) %>%
summary()
##
## Call:
## lm(formula = y ~ x1, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9625 -0.7536 -0.0439 0.8467 3.8244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1788 0.1420 -1.259 0.211
## x1 1.2212 0.1578 7.737 9.17e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.417 on 98 degrees of freedom
## Multiple R-squared: 0.3792, Adjusted R-squared: 0.3729
## F-statistic: 59.87 on 1 and 98 DF, p-value: 9.173e-12
x3
, what does it do to the model?df %>%
lm(y ~ x1 + x2 + x3, data = .) %>%
summary()
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48582 -0.66864 -0.00192 0.63545 1.99531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.16466 0.09791 -1.682 0.0959 .
## x1 1.07463 0.11001 9.769 4.67e-16 ***
## x2 0.99642 0.09677 10.297 < 2e-16 ***
## x3 -0.08670 0.09699 -0.894 0.3736
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9765 on 96 degrees of freedom
## Multiple R-squared: 0.7111, Adjusted R-squared: 0.702
## F-statistic: 78.75 on 3 and 96 DF, p-value: < 2.2e-16
These topics are consistently important in any regression analysis. We always want to assess the data, see if things look odd, if points are influential on the results, and if other errors (measurement, specification) are influencing the results.