Introduction

Chapter 12 introduces the idea of modeling nonlinear relationships using linear regression. This is one of the many ways regression is flexible to many important situations. A nice, but incomplete, list of useful functions can be found here.

Nonlinear Relationships

  1. Let’s start by loading the tidyverse package, the furniture package, and the rlm package.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1.9000 ──
## ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.5     
## ✔ tibble  1.4.2          ✔ dplyr   0.7.6     
## ✔ tidyr   0.8.1          ✔ stringr 1.3.1     
## ✔ readr   1.1.1          ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(furniture)
## ── furniture 1.7.12 ────────────────────────────────────────────────────────────────── learn more at tysonbarrett.com ──
## ✔ furniture attached
## ✖ The furniture::table1() function has the same name as tidyr::table1 (tbl_df)
##    Consider using `furniture::` for each function call.
library(rlm)
## ── rlm 0.1.0 ───────────────────────────────────────────────────────────────────────── learn more at tysonbarrett.com ──
## ✔ rlm attached
## ✔ No potential conflicts found
  1. Import the poli data set into R.
data(poli)

Descriptives and Visualizations

  1. Let’s get some descriptive statistics of the variables of interest: political knowledge (pknow), age (age), news use (npnews), and SES (ses).
poli %>%
  table1(pknow, age, npnews, ses,
         type = "condense")
## 
## ──────────────────────────
##         Mean/Count (SD/%)
##         n = 340          
##  pknow  11.3 (4.4)       
##  age    44.9 (14.6)      
##  npnews 3.6 (2.8)        
##  ses    0.0 (0.8)        
## ──────────────────────────
  1. Let’s look at the relationship between political knowledge (pknow) and age (age).
poli %>%
  ggplot(aes(age, pknow)) +
    geom_point() +
    geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Polynomial Regression

  1. Let’s model the relationship that we see in question 5. That is, pknow regressed on age where we also include the quadratic effect of age. To add the quadratic effect directly in the model we can use I(age^2).
poli %>%
  lm(pknow ~ age + I(age^2),
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = pknow ~ age + I(age^2), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4090  -3.2661   0.1098   3.1515   9.5183 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  6.0258076  2.0174926   2.987  0.00303 **
## age          0.2128720  0.0868390   2.451  0.01474 * 
## I(age^2)    -0.0019190  0.0008812  -2.178  0.03011 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.338 on 337 degrees of freedom
## Multiple R-squared:  0.02191,    Adjusted R-squared:  0.01611 
## F-statistic: 3.775 on 2 and 337 DF,  p-value: 0.02392
  1. Is there evidence that there is a quadratic effect of age on political knowledge?
  2. We could similarly create a new variable and put that in the model instead as below.
poli %>%
  mutate(age2 = age^2) %>%
  lm(pknow ~ age + age2,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = pknow ~ age + age2, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4090  -3.2661   0.1098   3.1515   9.5183 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  6.0258076  2.0174926   2.987  0.00303 **
## age          0.2128720  0.0868390   2.451  0.01474 * 
## age2        -0.0019190  0.0008812  -2.178  0.03011 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.338 on 337 degrees of freedom
## Multiple R-squared:  0.02191,    Adjusted R-squared:  0.01611 
## F-statistic: 3.775 on 2 and 337 DF,  p-value: 0.02392
  1. With these results, what is the peak age for political knowledge? How do we know that this is a peak (max) and not a min?
  2. We haven’t mean-centered age, yet. Let’s do that and re-run the analysis.
poli %>%
  mutate(age_centered = age - mean(age, na.rm = TRUE)) %>%
  lm(pknow ~ age_centered + I(age_centered^2),
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = pknow ~ age_centered + I(age_centered^2), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4090  -3.2661   0.1098   3.1515   9.5183 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       11.7160277  0.3004971  38.989   <2e-16 ***
## age_centered       0.0404406  0.0172795   2.340   0.0198 *  
## I(age_centered^2) -0.0019190  0.0008812  -2.178   0.0301 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.338 on 337 degrees of freedom
## Multiple R-squared:  0.02191,    Adjusted R-squared:  0.01611 
## F-statistic: 3.775 on 2 and 337 DF,  p-value: 0.02392
  1. What changed from the mean-centered model as compared to the previous one?
  2. Does the peak age change from these results compared to what you computed earlier?
  3. Finally, let’s control for some covariates (SES and news use).
poli %>%
  mutate(age_centered = age - mean(age, na.rm = TRUE)) %>%
  lm(pknow ~ age_centered + I(age_centered^2) + ses + npnews,
     data = .) %>%
  summary()
## 
## Call:
## lm(formula = pknow ~ age_centered + I(age_centered^2) + ses + 
##     npnews, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7681  -2.3222   0.1151   2.4382  10.3364 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       10.3023034  0.3729136  27.627  < 2e-16 ***
## age_centered       0.0231441  0.0155226   1.491 0.136904    
## I(age_centered^2) -0.0000341  0.0007796  -0.044 0.965139    
## ses                2.4877946  0.2648857   9.392  < 2e-16 ***
## npnews             0.2834595  0.0778497   3.641 0.000314 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.707 on 335 degrees of freedom
## Multiple R-squared:  0.2899, Adjusted R-squared:  0.2814 
## F-statistic: 34.19 on 4 and 335 DF,  p-value: < 2.2e-16
  1. Is age or age\(^2\) significant any more? Why would it change so much?
  2. Let’s look at the relationship of age with political knowledge after controlling for SES and news use.
poli %>%
  mutate(pknow_sesnpnews = resid(lm(pknow ~ ses + npnews))) %>%
  ggplot(aes(age, pknow_sesnpnews)) +
    geom_point() +
    geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'