Introduction

Chapter 8 talks about how we can assess the importance of an individual or a set of predictors. The principles discussed herein help us be able to test whether two predictors in the same regression model have the same effect on the outcome.

Relative Importance of Predictors (Categorical)

  1. Let’s start by loading the tidyverse, the furniture package, and rlm packages.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1.9000 ──
## ✔ ggplot2 2.2.1.9000      ✔ purrr   0.2.5      
## ✔ tibble  1.4.2.9004      ✔ dplyr   0.7.99.9000
## ✔ tidyr   0.8.1           ✔ stringr 1.3.1      
## ✔ readr   1.2.0           ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(furniture)
## ── furniture 1.7.12 ────────────────────────────────────────────────────────────────────── learn more at tysonbarrett.com ──
## ✔ furniture attached
## ✖ The furniture::table1() function has the same name as tidyr::table1 (tbl_df)
##    Consider using `furniture::` for each function call.
library(rlm)
## ── rlm 0.1.0 ───────────────────────────────────────────────────────────────────────────── learn more at tysonbarrett.com ──
## ✔ rlm attached
## ✔ No potential conflicts found
  1. Import the risk data set into R.
data(risk)
  1. We want to know, first, if having been in foster care (dfoster == 1) or if having a lack of water access (water == 0) in the home are predictive of how happy they rate a negative image (neg3sad).
risk %>%
  lm(neg3happy ~ dfoster + water,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = neg3happy ~ dfoster + water, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34031 -0.17813 -0.05088 -0.05088  1.15520 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.21305    0.05403   3.943 0.000141 ***
## dfosterFoster  0.12726    0.05717   2.226 0.028026 *  
## waterYes      -0.16218    0.06643  -2.441 0.016196 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2753 on 112 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.066,  Adjusted R-squared:  0.04932 
## F-statistic: 3.957 on 2 and 112 DF,  p-value: 0.02185
  1. Because of how the dummy coding was done, these are going in the opposite direction but that doesn’t mean that they affect the outcome in opposite ways. Rather, it is just an artifact of the way we dummy coded it. In order to test if one of these has a bigger effect than the other, we will use car’s linearHypothesis(). Instead of looking at if they are equal: \[ dfoster - water = 0 \] we will do: \[ dfoster - -water = 0 \] or, in other words, \[ dfoster + water = 0 \]
risk %>%
    lm(neg3happy ~ dfoster + water,
     data = .) %>%
  car::linearHypothesis("dfosterFoster + waterYes = 0")
## Linear hypothesis test
## 
## Hypothesis:
## dfosterFoster  + waterYes = 0
## 
## Model 1: restricted model
## Model 2: neg3happy ~ dfoster + water
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    113 8.5075                           
## 2    112 8.4881  1  0.019359 0.2554 0.6143
  1. Is the effect of dfoster significantly more or less than water?

Relative Importance of Predictors (Continuous)

  1. These types of comparisons are a bit easier with categorical predictors. When it comes to continuous predictors, we need to make sure they are in comparable units. Often, the best way to do this is to standardize the predictors first. To compare these, let’s use the poverty.xlsx to look at this (can be downloaded from Canvas or from this site).
poverty <- readxl::read_excel("poverty.xlsx")
  1. We want to look at if poverty (poverty_pct) or violent crime (ViolentCrime) predict teen birthrate (TeenBirth). Let’s take a look at a multiple regression model.
poverty %>%
  lm(TeenBirth ~ poverty_pct + ViolentCrime,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = TeenBirth ~ poverty_pct + ViolentCrime, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5183  -6.4833  -0.3196   6.5035  14.7125 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.6848     3.8686   4.571 3.41e-05 ***
## poverty_pct    1.6304     0.3119   5.227 3.70e-06 ***
## ViolentCrime   0.4037     0.1497   2.697  0.00962 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.329 on 48 degrees of freedom
## Multiple R-squared:  0.5611, Adjusted R-squared:  0.5428 
## F-statistic: 30.68 on 2 and 48 DF,  p-value: 2.608e-09
  1. It looks like poverty has a larger effect on teen birth rate than violent crime. Is the effect of poverty significantly different than the effect of violent crime?
poverty %>%
  lm(TeenBirth ~ poverty_pct + ViolentCrime,
     data = .) %>%
  car::linearHypothesis("poverty_pct = ViolentCrime")
## Linear hypothesis test
## 
## Hypothesis:
## poverty_pct - ViolentCrime = 0
## 
## Model 1: restricted model
## Model 2: TeenBirth ~ poverty_pct + ViolentCrime
## 
##   Res.Df    RSS Df Sum of Sq     F   Pr(>F)   
## 1     49 3968.3                               
## 2     48 3329.9  1    638.44 9.203 0.003892 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Is this sufficient to say they are different? Why or why not?
  2. Let’s standardize the two predictors and see if this changes anything.
poverty %>%
  mutate(poverty_std = scale(poverty_pct),
         violent_std = scale(ViolentCrime)) %>%
  lm(TeenBirth ~ poverty_std + violent_std,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = TeenBirth ~ poverty_std + violent_std, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5183  -6.4833  -0.3196   6.5035  14.7125 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   42.243      1.166  36.220  < 2e-16 ***
## poverty_std    6.974      1.334   5.227  3.7e-06 ***
## violent_std    3.598      1.334   2.697  0.00962 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.329 on 48 degrees of freedom
## Multiple R-squared:  0.5611, Adjusted R-squared:  0.5428 
## F-statistic: 30.68 on 2 and 48 DF,  p-value: 2.608e-09
poverty %>%
  mutate(poverty_std = scale(poverty_pct),
         violent_std = scale(ViolentCrime)) %>%
  lm(TeenBirth ~ poverty_std + violent_std,
     data = .) %>%
  car::linearHypothesis("poverty_std = violent_std")
## Linear hypothesis test
## 
## Hypothesis:
## poverty_std - violent_std = 0
## 
## Model 1: restricted model
## Model 2: TeenBirth ~ poverty_std + violent_std
## 
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)
## 1     49 3481.0                          
## 2     48 3329.9  1    151.09 2.178 0.1465
  1. Now that the variables are in the same overall units (standard deviation units), are the effects significantly different?
  2. Which one, the unstandardized or the standardized, would be a more trustworthy comparison?

Dominance Analysis

  1. We can also do dominance analysis using a (still in development) package called dominanceAnalysis. Let’s install and load the package.
#devtools::install_github("clbustos/dominanceAnalysis")
library(dominanceanalysis)
  1. We can see which of the two adversity variables (totadversity and commadversity) is more dominant.
dominanceAnalysis(lm(TeenBirth ~ poverty_pct + ViolentCrime,
                     data = poverty))
## $predictors
## [1] "poverty_pct"  "ViolentCrime"
## 
## $constants
## NULL
## 
## $fit.functions
## [1] "r2"
## 
## $fits
## $fit.functions
## [1] "r2"
## 
## $fits
## $fits$r2
##                          poverty_pct ViolentCrime
## 1                          0.4946093   0.31129354
## poverty_pct                       NA   0.06651268
## ViolentCrime               0.2498284           NA
## poverty_pct+ViolentCrime          NA           NA
## 
## 
## $base.fits
##                                 r2
## 1                        0.0000000
## poverty_pct              0.4946093
## ViolentCrime             0.3112935
## poverty_pct+ViolentCrime 0.5611220
## 
## $level
## [1] 0 1 1 2
## 
## attr(,"class")
## [1] "daRawResults"
## 
## $contribution.by.level
## $contribution.by.level$r2
##   level poverty_pct ViolentCrime
## 1     0   0.4946093   0.31129354
## 2     1   0.2498284   0.06651268
## 
## 
## $contribution.average
## $contribution.average$r2
##  poverty_pct ViolentCrime 
##    0.3722189    0.1889031 
## 
## 
## $complete
## $complete$r2
##              poverty_pct ViolentCrime
## poverty_pct          0.5          1.0
## ViolentCrime         0.0          0.5
## 
## 
## $conditional
## $conditional$r2
##              poverty_pct ViolentCrime
## poverty_pct          0.5          1.0
## ViolentCrime         0.0          0.5
## 
## 
## $general
## $general$r2
##              poverty_pct ViolentCrime
## poverty_pct          0.5          1.0
## ViolentCrime         0.0          0.5

Conclusion

Comparing predictors in the same regression model is often of substantive interest. Knowing how, and when, to use this can be a great benefit to your research.