Introduction

Chapter 12 introduces the idea of modeling nonlinear relationships using linear regression. This is one of the many ways regression is flexible to many important situations. A nice, but incomplete, list of useful functions can be found here.

Nonlinear Relationships

  1. Let’s start by loading the tidyverse package, the furniture package, and the educ7610 package.
library(tidyverse)
library(furniture)
library(educ7610)
  1. Import the poli data set into R.
data(poli)

Descriptives and Visualizations

  1. Let’s get some descriptive statistics of the variables of interest: political knowledge (pknow), age (age), news use (npnews), and SES (ses).
poli %>%
  table1(pknow, age, npnews, ses,
         type = "condense")
## 
## ──────────────────────────
##         Mean/Count (SD/%)
##         n = 340          
##  pknow  11.3 (4.4)       
##  age    44.9 (14.6)      
##  npnews 3.6 (2.8)        
##  ses    0.0 (0.8)        
## ──────────────────────────
  1. Let’s look at the relationship between political knowledge (pknow) and age (age).
poli %>%
  ggplot(aes(age, pknow)) +
    geom_point() +
    geom_smooth()

Polynomial Regression

  1. Let’s model the relationship that we see in question 5. That is, pknow regressed on age where we also include the quadratic effect of age. To add the quadratic effect directly in the model we can use I(age^2).
poli %>%
  lm(pknow ~ age + I(age^2),
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = pknow ~ age + I(age^2), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4090  -3.2661   0.1098   3.1515   9.5183 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  6.0258076  2.0174926   2.987  0.00303 **
## age          0.2128720  0.0868390   2.451  0.01474 * 
## I(age^2)    -0.0019190  0.0008812  -2.178  0.03011 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.338 on 337 degrees of freedom
## Multiple R-squared:  0.02191,    Adjusted R-squared:  0.01611 
## F-statistic: 3.775 on 2 and 337 DF,  p-value: 0.02392
  1. Is there evidence that there is a quadratic effect of age on political knowledge?
  2. We could similarly create a new variable and put that in the model instead as below.
poli %>%
  mutate(age2 = age^2) %>%
  lm(pknow ~ age + age2,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = pknow ~ age + age2, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4090  -3.2661   0.1098   3.1515   9.5183 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  6.0258076  2.0174926   2.987  0.00303 **
## age          0.2128720  0.0868390   2.451  0.01474 * 
## age2        -0.0019190  0.0008812  -2.178  0.03011 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.338 on 337 degrees of freedom
## Multiple R-squared:  0.02191,    Adjusted R-squared:  0.01611 
## F-statistic: 3.775 on 2 and 337 DF,  p-value: 0.02392
  1. With these results, what is the peak age for political knowledge? How do we know that this is a peak (max) and not a min?
  2. We haven’t mean-centered age, yet. Let’s do that and re-run the analysis.
poli %>%
  mutate(age_centered = age - mean(age, na.rm = TRUE)) %>%
  lm(pknow ~ age_centered + I(age_centered^2),
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = pknow ~ age_centered + I(age_centered^2), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4090  -3.2661   0.1098   3.1515   9.5183 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       11.7160277  0.3004971  38.989   <2e-16 ***
## age_centered       0.0404406  0.0172795   2.340   0.0198 *  
## I(age_centered^2) -0.0019190  0.0008812  -2.178   0.0301 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.338 on 337 degrees of freedom
## Multiple R-squared:  0.02191,    Adjusted R-squared:  0.01611 
## F-statistic: 3.775 on 2 and 337 DF,  p-value: 0.02392
  1. What changed from the mean-centered model as compared to the previous one?
  2. Does the peak age change from these results compared to what you computed earlier?
  3. Finally, let’s control for some covariates (SES and news use).
poli %>%
  mutate(age_centered = age - mean(age, na.rm = TRUE)) %>%
  lm(pknow ~ age_centered + I(age_centered^2) + ses + npnews,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = pknow ~ age_centered + I(age_centered^2) + ses + 
##     npnews, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7681  -2.3222   0.1151   2.4382  10.3364 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       10.3023034  0.3729136  27.627  < 2e-16 ***
## age_centered       0.0231441  0.0155226   1.491 0.136904    
## I(age_centered^2) -0.0000341  0.0007796  -0.044 0.965139    
## ses                2.4877946  0.2648857   9.392  < 2e-16 ***
## npnews             0.2834595  0.0778497   3.641 0.000314 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.707 on 335 degrees of freedom
## Multiple R-squared:  0.2899, Adjusted R-squared:  0.2814 
## F-statistic: 34.19 on 4 and 335 DF,  p-value: < 2.2e-16
  1. Is age or age\(^2\) significant any more? Why would it change so much?
  2. Let’s look at the relationship of age with political knowledge after controlling for SES and news use.
poli %>%
  mutate(pknow_sesnpnews = resid(lm(pknow ~ ses + npnews))) %>%
  ggplot(aes(age, pknow_sesnpnews)) +
    geom_point() +
    geom_smooth()

  1. Does this plot look like what the model is telling us?
  2. Let’s remove the quadratic effect of age since it doesn’t look important once we are controlling for SES and news use.
poli %>%
  mutate(age_centered = age - mean(age, na.rm = TRUE)) %>%
  lm(pknow ~ age_centered + ses + npnews,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = pknow ~ age_centered + ses + npnews, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7828  -2.3256   0.1127   2.4347  10.3429 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.29587    0.34221  30.086  < 2e-16 ***
## age_centered  0.02294    0.01475   1.555 0.121001    
## ses           2.49079    0.25549   9.749  < 2e-16 ***
## npnews        0.28323    0.07756   3.652 0.000302 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.701 on 336 degrees of freedom
## Multiple R-squared:  0.2899, Adjusted R-squared:  0.2835 
## F-statistic: 45.72 on 3 and 336 DF,  p-value: < 2.2e-16
  1. Using this final model, what is the interpretation of this model? Include the interpretation of the covariates as well. (Note: the relationship between pknow and npnews is likely non-linear as well but we won’t look into that here; see spline regression below.)

Spline Regression

  1. Based on the plot below, there may be two joints: one at npnews == 1.5 and npnews == 6. Let’s try these two and see what we find.
poli %>%
  ggplot(aes(npnews, pknow)) +
    geom_point() +
    geom_smooth()