Chapter 2 introduces simple linear regression. The following examples help illustrate a number of principles that were discussed. Follow all instructions to complete Chapter 2.
GSS_reduced_example.csv
data set from Canvas or tysonbarrett.com/teaching to your computer. Save it in a directory you can access fairly easily.tidyverse
package (you can ignore the notes that you see below that it gives you once you load it).library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.2.1.9000 ──
## ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.5
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
gss <- read.csv("GSS_reduced_example.csv")
educ
) on income (income06
).gss %>%
lm(income06 ~ educ,
data = .)
##
## Call:
## lm(formula = income06 ~ educ, data = .)
##
## Coefficients:
## (Intercept) educ
## 1742 4127
educ
. What does the slope of educ
mean here (i.e., interpret the value)?Simple regression and correlation are intimately tied. Let’s show that below.
gss %>%
furniture::tableC(income06, educ,
na.rm = TRUE)
## N = 1774
## Note: pearson correlation (p-value).
##
## ─────────────────────────────────
## [1] [2]
## [1]income06 1.00
## [2]educ 0.341 (<.001) 1.00
## ─────────────────────────────────
scale()
. We first have to grab just the complete cases of the variables first (consider why).gss %>%
filter(complete.cases(income06, educ)) %>%
mutate(incomeZ = scale(income06) %>% as.numeric,
educZ = scale(educ) %>% as.numeric) %>%
lm(incomeZ ~ educZ,
data = .)
##
## Call:
## lm(formula = incomeZ ~ educZ, data = .)
##
## Coefficients:
## (Intercept) educZ
## -2.466e-16 3.409e-01
Residuals can help us understand several things about the model and the relations we are assessing. Most of this stuff we’ll talk about later but here’s some ways we’ll access the residuals.
fit
(or any other name you want to use).fit <- gss %>%
lm(income06 ~ educ,
data = .)
resid()
to produce the residuals of the model. (Note that head()
was used just to see the first 6 lines instead of all two thousand lines.)resid(fit) %>%
head()
## 1 2 3 4 5 11
## -43763.331 35722.413 -10390.113 -5509.767 -18763.331 88102.759
plot()
. You don’t need to know what the four plots mean, but know that it uses the residuals from the model to do it.plot(fit)
This was an introduction to many of the features of regression that we’ll be using throughout the class. Although not much of a workflow here, each piece will play a role in larger analyses.