EDUC/PSY 7610

Chapter 12 introduces the idea of modeling nonlinear relationships using linear regression. This is one of the many ways regression is flexible to many important situations. A nice, but incomplete, list of useful functions can be found here.

- Let’s start by loading the
`tidyverse`

package, the`furniture`

package, and the`educ7610`

package.

```
library(tidyverse)
library(furniture)
library(educ7610)
```

- Import the
`poli`

data set into R.

`data(poli)`

- Let’s get some descriptive statistics of the variables of interest: political knowledge (
`pknow`

), age (`age`

), news use (`npnews`

), and SES (`ses`

).

```
poli %>%
table1(pknow, age, npnews, ses,
type = "condense")
```

```
##
## ──────────────────────────
## Mean/Count (SD/%)
## n = 340
## pknow 11.3 (4.4)
## age 44.9 (14.6)
## npnews 3.6 (2.8)
## ses 0.0 (0.8)
## ──────────────────────────
```

- Let’s look at the relationship between political knowledge (
`pknow`

) and age (`age`

).

```
poli %>%
ggplot(aes(age, pknow)) +
geom_point() +
geom_smooth()
```

- Let’s model the relationship that we see in question 5. That is,
`pknow`

regressed on`age`

where we also include the quadratic effect of`age`

. To add the quadratic effect directly in the model we can use`I(age^2)`

.

```
poli %>%
lm(pknow ~ age + I(age^2),
data = .) %>%
summary()
```

```
##
## Call:
## lm(formula = pknow ~ age + I(age^2), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4090 -3.2661 0.1098 3.1515 9.5183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0258076 2.0174926 2.987 0.00303 **
## age 0.2128720 0.0868390 2.451 0.01474 *
## I(age^2) -0.0019190 0.0008812 -2.178 0.03011 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.338 on 337 degrees of freedom
## Multiple R-squared: 0.02191, Adjusted R-squared: 0.01611
## F-statistic: 3.775 on 2 and 337 DF, p-value: 0.02392
```

- Is there evidence that there is a quadratic effect of age on political knowledge?
- We could similarly create a new variable and put that in the model instead as below.

```
poli %>%
mutate(age2 = age^2) %>%
lm(pknow ~ age + age2,
data = .) %>%
summary()
```

```
##
## Call:
## lm(formula = pknow ~ age + age2, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4090 -3.2661 0.1098 3.1515 9.5183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0258076 2.0174926 2.987 0.00303 **
## age 0.2128720 0.0868390 2.451 0.01474 *
## age2 -0.0019190 0.0008812 -2.178 0.03011 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.338 on 337 degrees of freedom
## Multiple R-squared: 0.02191, Adjusted R-squared: 0.01611
## F-statistic: 3.775 on 2 and 337 DF, p-value: 0.02392
```

- With these results, what is the peak age for political knowledge? How do we know that this is a peak (max) and not a min?
- We haven’t mean-centered age, yet. Let’s do that and re-run the analysis.

```
poli %>%
mutate(age_centered = age - mean(age, na.rm = TRUE)) %>%
lm(pknow ~ age_centered + I(age_centered^2),
data = .) %>%
summary()
```

```
##
## Call:
## lm(formula = pknow ~ age_centered + I(age_centered^2), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4090 -3.2661 0.1098 3.1515 9.5183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.7160277 0.3004971 38.989 <2e-16 ***
## age_centered 0.0404406 0.0172795 2.340 0.0198 *
## I(age_centered^2) -0.0019190 0.0008812 -2.178 0.0301 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.338 on 337 degrees of freedom
## Multiple R-squared: 0.02191, Adjusted R-squared: 0.01611
## F-statistic: 3.775 on 2 and 337 DF, p-value: 0.02392
```

- What changed from the mean-centered model as compared to the previous one?
- Does the peak age change from these results compared to what you computed earlier?
- Finally, let’s control for some covariates (SES and news use).

```
poli %>%
mutate(age_centered = age - mean(age, na.rm = TRUE)) %>%
lm(pknow ~ age_centered + I(age_centered^2) + ses + npnews,
data = .) %>%
summary()
```

```
##
## Call:
## lm(formula = pknow ~ age_centered + I(age_centered^2) + ses +
## npnews, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.7681 -2.3222 0.1151 2.4382 10.3364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.3023034 0.3729136 27.627 < 2e-16 ***
## age_centered 0.0231441 0.0155226 1.491 0.136904
## I(age_centered^2) -0.0000341 0.0007796 -0.044 0.965139
## ses 2.4877946 0.2648857 9.392 < 2e-16 ***
## npnews 0.2834595 0.0778497 3.641 0.000314 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.707 on 335 degrees of freedom
## Multiple R-squared: 0.2899, Adjusted R-squared: 0.2814
## F-statistic: 34.19 on 4 and 335 DF, p-value: < 2.2e-16
```

- Is age or age\(^2\) significant any more? Why would it change so much?
- Let’s look at the relationship of age with political knowledge after controlling for SES and news use.

```
poli %>%
mutate(pknow_sesnpnews = resid(lm(pknow ~ ses + npnews))) %>%
ggplot(aes(age, pknow_sesnpnews)) +
geom_point() +
geom_smooth()
```

- Does this plot look like what the model is telling us?
- Let’s remove the quadratic effect of age since it doesn’t look important once we are controlling for SES and news use.

```
poli %>%
mutate(age_centered = age - mean(age, na.rm = TRUE)) %>%
lm(pknow ~ age_centered + ses + npnews,
data = .) %>%
summary()
```

```
##
## Call:
## lm(formula = pknow ~ age_centered + ses + npnews, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.7828 -2.3256 0.1127 2.4347 10.3429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.29587 0.34221 30.086 < 2e-16 ***
## age_centered 0.02294 0.01475 1.555 0.121001
## ses 2.49079 0.25549 9.749 < 2e-16 ***
## npnews 0.28323 0.07756 3.652 0.000302 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.701 on 336 degrees of freedom
## Multiple R-squared: 0.2899, Adjusted R-squared: 0.2835
## F-statistic: 45.72 on 3 and 336 DF, p-value: < 2.2e-16
```

- Using this final model, what is the interpretation of this model? Include the interpretation of the covariates as well. (Note: the relationship between
`pknow`

and`npnews`

is likely non-linear as well but we won’t look into that here; see spline regression below.)

- Based on the plot below, there may be two joints: one at
`npnews == 1.5`

and`npnews == 6`

. Let’s try these two and see what we find.

```
poli %>%
ggplot(aes(npnews, pknow)) +
geom_point() +
geom_smooth()
```