Review Polynomials

To better understand the interpretation of polynomials, let’s see three plausible examples:

1. Positive linear, Negative quadratic

## 
## Call:
## lm(formula = health ~ exercise + I(exercise^2), data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.92735 -0.59522  0.00662  0.74671  2.22454 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.13438    0.29274   3.875 0.000194 ***
## exercise       1.84092    0.17757  10.367  < 2e-16 ***
## I(exercise^2) -0.17177    0.02182  -7.873 4.98e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.02 on 97 degrees of freedom
## Multiple R-squared:  0.6524, Adjusted R-squared:  0.6452 
## F-statistic: 91.02 on 2 and 97 DF,  p-value: < 2.2e-16

2. Negative linear, positive quadratic

## 
## Call:
## lm(formula = fail_exam ~ anxiety + I(anxiety^2), data = d2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9224  -3.8807   0.0049   3.6602  11.9476 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.81460    1.85273   10.15  < 2e-16 ***
## anxiety      -4.97617    0.80918   -6.15 1.73e-08 ***
## I(anxiety^2)  1.02500    0.07529   13.62  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.08 on 97 degrees of freedom
## Multiple R-squared:  0.9221, Adjusted R-squared:  0.9205 
## F-statistic: 574.3 on 2 and 97 DF,  p-value: < 2.2e-16

3. Both positive

## 
## Call:
## lm(formula = happiness ~ love_R + I(love_R^2), data = d3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4666  -3.8456  -0.0085   3.7124  14.2427 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.16588    1.43464   2.207   0.0297 *  
## love_R       3.39067    0.68967   4.916 3.59e-06 ***
## I(love_R^2)  0.47874    0.06727   7.117 1.91e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.209 on 97 degrees of freedom
## Multiple R-squared:  0.9592, Adjusted R-squared:  0.9583 
## F-statistic:  1139 on 2 and 97 DF,  p-value: < 2.2e-16

Interpretation

To interpret these models, we have three main options:

  1. Present the effect of the predictor.
  2. Present the results in terms of the scatterplot.
  3. Present the min/max point.

To understand these, we’ll do each for each example above.

1. Positive linear, Negative quadratic

## 
## Call:
## lm(formula = health ~ exercise + I(exercise^2), data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.92735 -0.59522  0.00662  0.74671  2.22454 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.13438    0.29274   3.875 0.000194 ***
## exercise       1.84092    0.17757  10.367  < 2e-16 ***
## I(exercise^2) -0.17177    0.02182  -7.873 4.98e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.02 on 97 degrees of freedom
## Multiple R-squared:  0.6524, Adjusted R-squared:  0.6452 
## F-statistic: 91.02 on 2 and 97 DF,  p-value: < 2.2e-16

To interpret this one, we can:

  1. State that the effect of exercise on health is 1.84 - 0.17 * 2 * exercise. This suggests that when exercise is low, exercise has a relatively large, positive effect on health. When exercise is already high (say = 7), the effect is small or even negative.
  2. Reference the scatterplot, showing the quadratic effect. In this context, we can say that for low levels of exercise, the effect of increasing exercise is very beneficial to health. However, this levels off and even starts to decrease as exercise is around 8 or 9 hours.
  3. Calculate the value of exercise where the health is at its maximum. Here, that is \(1.84 - 0.17 * 2 * exercise = 0\) (solve for exercise). This gives us \(exercise = 5.4\). This can be seen pretty clearly in the figure.

2. Negative linear, positive quadratic

## 
## Call:
## lm(formula = fail_exam ~ anxiety + I(anxiety^2), data = d2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9224  -3.8807   0.0049   3.6602  11.9476 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.81460    1.85273   10.15  < 2e-16 ***
## anxiety      -4.97617    0.80918   -6.15 1.73e-08 ***
## I(anxiety^2)  1.02500    0.07529   13.62  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.08 on 97 degrees of freedom
## Multiple R-squared:  0.9221, Adjusted R-squared:  0.9205 
## F-statistic: 574.3 on 2 and 97 DF,  p-value: < 2.2e-16

To interpret this one, we can:

  1. State that the effect of anxiety on failing the exam is -4.97 + 1.03 * 2 * anxiety. This suggests that when anxiety is low, anxiety has a relatively large, negative effect on failing the exam. When anxiety is already high (say = 7), the effect is small or even positive.
  2. Reference the scatterplot, showing the quadratic effect. In this context, we can say that for low levels of anxiety, anxiety does not really have much of an effect. However, as anxiety increases, it has an increasing.
  3. Calculate the value of anxiety where the failing exam is at its minimum. Here, that is \(-4.97 + 1.03 * 2 * anxiety = 0\) (solve for anxiety). This gives us \(anxiety = 2.4\). This can be seen pretty clearly in the figure.

3. Both positive

## 
## Call:
## lm(formula = happiness ~ love_R + I(love_R^2), data = d3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4666  -3.8456  -0.0085   3.7124  14.2427 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.16588    1.43464   2.207   0.0297 *  
## love_R       3.39067    0.68967   4.916 3.59e-06 ***
## I(love_R^2)  0.47874    0.06727   7.117 1.91e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.209 on 97 degrees of freedom
## Multiple R-squared:  0.9592, Adjusted R-squared:  0.9583 
## F-statistic:  1139 on 2 and 97 DF,  p-value: < 2.2e-16

To interpret this one, we can:

  1. State that the effect of loving R on happiness is 3.39 + .48 * 2 * loving R. This suggests that the effect of loving R only increases as loving R increases.
  2. Reference the scatterplot, showing the quadratic effect. In this context, the effect of loving R increases as loving R increases with no sign of changing.
  3. Calculate the value of loving R where the happiness is at its minimum. Here, that value is going to be well below what we actually measured so, in this case, it is no very informative.

Review Interactions

To better understand the interpretation of interactions, let’s see four plausible examples:

1. Both dummy variables

## 
## Call:
## lm(formula = health ~ hike * diet, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3915 -0.4668 -0.0226  0.6060  1.9511 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9989     0.1993   5.012 2.46e-06 ***
## hike1        -1.1947     0.2913  -4.101 8.61e-05 ***
## diet1         0.9322     0.2679   3.480 0.000755 ***
## hike1:diet1   1.9081     0.4025   4.741 7.39e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9965 on 96 degrees of freedom
## Multiple R-squared:  0.519,  Adjusted R-squared:  0.504 
## F-statistic: 34.53 on 3 and 96 DF,  p-value: 3.173e-15

2. Continuous and dummy

## 
## Call:
## lm(formula = health ~ exercise * diet, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.09249 -0.61860 -0.04806  0.57464  2.35702 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.12427    0.24837   4.527 1.72e-05 ***
## exercise       -1.02349    0.05904 -17.337  < 2e-16 ***
## diet1          -0.92545    0.35957  -2.574   0.0116 *  
## exercise:diet1  2.01677    0.08832  22.834  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8964 on 96 degrees of freedom
## Multiple R-squared:  0.9468, Adjusted R-squared:  0.9452 
## F-statistic:   570 on 3 and 96 DF,  p-value: < 2.2e-16

3. Both continuous

## 
## Call:
## lm(formula = health ~ exercise * hours_sleep, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04429 -0.51506  0.01136  0.53872  2.85592 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.48299    0.64278   2.307   0.0232 *  
## exercise             -8.05161    0.14115 -57.044   <2e-16 ***
## hours_sleep          -1.05469    0.07477 -14.106   <2e-16 ***
## exercise:hours_sleep  2.00446    0.01646 121.764   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8871 on 96 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9989 
## F-statistic: 2.997e+04 on 3 and 96 DF,  p-value: < 2.2e-16

4. Continuous and multicategorical

## 
## Call:
## lm(formula = health ~ hours_sleep * location, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40304 -0.65048 -0.07422  0.52715  2.19050 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -2.00735    0.85453  -2.349   0.0210 *  
## hours_sleep                     3.14136    0.09839  31.928  < 2e-16 ***
## locationSubmarine              -2.43410    1.13202  -2.150   0.0342 *  
## locationurban                  -9.58839    1.15442  -8.306 8.30e-13 ***
## locationrural                 -13.38270    0.99426 -13.460  < 2e-16 ***
## hours_sleep:locationSubmarine   0.97182    0.13658   7.115 2.38e-10 ***
## hours_sleep:locationurban       4.94159    0.13435  36.781  < 2e-16 ***
## hours_sleep:locationrural       6.88204    0.11771  58.467  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9059 on 92 degrees of freedom
## Multiple R-squared:  0.9986, Adjusted R-squared:  0.9985 
## F-statistic:  9460 on 7 and 92 DF,  p-value: < 2.2e-16

Interpretation

To interpret these models, we have two main options:

  1. Present the effect of the predictor at various levels of the interactor (probing the interaction).
  2. Present the results in terms of the scatterplot (highly recommended).

To understand these, we’ll do each for each example above.

1. Both dummy variables

## 
## Call:
## lm(formula = health ~ hike * diet, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3915 -0.4668 -0.0226  0.6060  1.9511 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9989     0.1993   5.012 2.46e-06 ***
## hike1        -1.1947     0.2913  -4.101 8.61e-05 ***
## diet1         0.9322     0.2679   3.480 0.000755 ***
## hike1:diet1   1.9081     0.4025   4.741 7.39e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9965 on 96 degrees of freedom
## Multiple R-squared:  0.519,  Adjusted R-squared:  0.504 
## F-statistic: 34.53 on 3 and 96 DF,  p-value: 3.173e-15

Overall there is a significant interactive effect between hiking and dieting on health.

  1. With two dummy variables, we can discuss differences across groups of diet at different levels of hike. Here, when hike = 0, the effect of diet is -1.156. When hike = 1, the effect of diet is (-1.156 + 1.821) equal to 0.665. Therefore, when hike is 0 (don’t hike), there is a negative effect of diet. When hike is 1 (do hike), the effect of hike is positive. Alternatively, we could talk about the effect of hike at both levels of diet.
  2. The plot shows that, overall, diet does seem to help but it particularly helps when the individual hikes. Alternatively, we could talk about the effect of hike at both levels of diet.

2. Continuous and dummy

## 
## Call:
## lm(formula = health ~ exercise * diet, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.09249 -0.61860 -0.04806  0.57464  2.35702 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.12427    0.24837   4.527 1.72e-05 ***
## exercise       -1.02349    0.05904 -17.337  < 2e-16 ***
## diet1          -0.92545    0.35957  -2.574   0.0116 *  
## exercise:diet1  2.01677    0.08832  22.834  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8964 on 96 degrees of freedom
## Multiple R-squared:  0.9468, Adjusted R-squared:  0.9452 
## F-statistic:   570 on 3 and 96 DF,  p-value: < 2.2e-16

Overall there is a significant interaction between exercise and diet on health.

  1. With a continuous and a dummy variable, we usually end up treating the dummy as the “moderator”. With that in mind, when diet (the dummy variable) is the reference (no diet), the effect of exercise is -1.02 (for a one unit increase in exercise there is an associated 1.02 unit decrease in health). When the individual is on a diet (diet = 1), the effect is -1.02 + 2.01 = .99. This corresponds to the figure.
  2. The figure shows that, when the person is not on a diet, there is a negative effect of exercise whereas when the individual is on a diet, the effect is positive.

3. Both continuous

## 
## Call:
## lm(formula = health ~ exercise * hours_sleep, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04429 -0.51506  0.01136  0.53872  2.85592 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.48299    0.64278   2.307   0.0232 *  
## exercise             -8.05161    0.14115 -57.044   <2e-16 ***
## hours_sleep          -1.05469    0.07477 -14.106   <2e-16 ***
## exercise:hours_sleep  2.00446    0.01646 121.764   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8871 on 96 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9989 
## F-statistic: 2.997e+04 on 3 and 96 DF,  p-value: < 2.2e-16
## Call:
## probemod::jn(model = mod1, dv = "health", iv = "exercise", mod = "hours_sleep")
## 
## Conditional effects of  exercise  on  health  at values of  hours_sleep 
##  hours_sleep  Effect     se        t      p    llci    ulci
##            1 -6.0471 0.1255 -48.2031 0.0000 -6.2962 -5.7981
##            2 -4.0427 0.1100 -36.7583 0.0000 -4.2610 -3.8243
##            3 -2.0382 0.0948 -21.4904 0.0000 -2.2265 -1.8499
##            4 -0.0338 0.0802  -0.4208 0.6749 -0.1930  0.1255
##            5  1.9707 0.0665  29.6411 0.0000  1.8387  2.1027
##            6  3.9752 0.0543  73.2402 0.0000  3.8674  4.0829
##            7  5.9796 0.0449 133.2640 0.0000  5.8905  6.0687
##            8  7.9841 0.0403 198.1920 0.0000  7.9041  8.0641

The model shows that there is a significant interaction between exercise and sleep on health (p < .001).

  1. Using the probing interactions technique, we can see at what levels of hours of sleep, exercise has a positive or negative effect. Here, until one gets about 4 hours of sleep (where our data actually start), the effect of exercise is actually negative on your health. At higher levels of sleep, exercise is positive.
  2. Scatterplot shows that at low levels of sleep, the effect of exercise is small. As hours of sleep increase, exercise has a more positive effect on health.

4. Continuous and multicategorical

## 
## Call:
## lm(formula = health ~ hours_sleep * location, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40304 -0.65048 -0.07422  0.52715  2.19050 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -2.00735    0.85453  -2.349   0.0210 *  
## hours_sleep                     3.14136    0.09839  31.928  < 2e-16 ***
## locationSubmarine              -2.43410    1.13202  -2.150   0.0342 *  
## locationurban                  -9.58839    1.15442  -8.306 8.30e-13 ***
## locationrural                 -13.38270    0.99426 -13.460  < 2e-16 ***
## hours_sleep:locationSubmarine   0.97182    0.13658   7.115 2.38e-10 ***
## hours_sleep:locationurban       4.94159    0.13435  36.781  < 2e-16 ***
## hours_sleep:locationrural       6.88204    0.11771  58.467  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9059 on 92 degrees of freedom
## Multiple R-squared:  0.9986, Adjusted R-squared:  0.9985 
## F-statistic:  9460 on 7 and 92 DF,  p-value: < 2.2e-16

There is a significant interactive effect, where the effect of sleep on health depends on the location wherein the individual lives (all p values < .001).

  1. Because there are multiple levels, the interpretation naturally gets more complex. However, it is very similar to the continuous and dummy example above. That is, when the individual lives on the international space station, the effect of sleep is 3.14 (for every one hour increase in sleep, health increases by 3.14 units, on average). For those in a submarine, the effect is slightly bigger (3.14 + .97 = 4.11). For those in an urban area, the effect is roughly 8 and for those in a rural area, the effect is the largest at 10. With just this output we do not know if the effect of sleep in rural areas is larger than sleep in urban, or any of the other comparisons not with the reference group. That would require further post-hoc comparisons.
  2. The figure shows rural has the largest effect of sleep, followed by urban, then much further down submarine and then ISS.

Average Marginal Effects

Whenever a variable has a non-linear effect (polynomial, interaction, or soon-to-be-discussed GLM), AMEs can help simplify the interpretation by putting the effect in terms much like a regular coefficient.

For example, on average, the effect of exercise can be shown by doing the following:

library(margins)

mod <- lm(health ~ exercise + I(exercise^2), 
          data = d1)
margins(mod) %>%
  summary()
##    factor    AME     SE       z      p  lower  upper
##  exercise 0.5944 0.0465 12.7789 0.0000 0.5032 0.6856

This means that, on average, for every one unit increase in exercise, there is a .59 unit increase in health (p < .001).