Linear Regression

class: center, middle, inverse, title-slide

# Linear Regression
## Cohen Chapter 10 <br><br> .small[EDUC/PSY 6600]

---

class: center, middle

## Fit the analysis to the data, 
## *not* the data to the analysis.

#### - Statistical Maxim

---
## Motivating Example

.large[
> Dr. Ramsey conducts a .bluer[**non-experimental (observational) study**] to evaluate what she refers to as the 'strength-injury hypothesis.'   
]

It states that overall body strength in elderly women determines the number and severity of accidents that cause bodily injury. If the results support her hypothesis, she plans to conduct an .bluer[**experimental (randomized) study**] to assess whether weight training reduces injuries in elderly women.

- Data from .nicegreen[100 women] who range in age from 60 to 70 years old are collected.

- The women initially undergo a series of measures that assess upper and lower body strength.  These measures are summarized into an .nicegreen[**overall index of body strength**].

- Over the next 5 years, the women record each time they have an accident that results in a bodily injury and describe fully the extent of the injury.  On the basis of these data, Dr. Ramsey calculates an .nicegreen[**overall injury index**] for each woman.

.large[
> A .dcoral[**simple regression analysis**] is conducted with the .nicegreen[overall index of body strength] as the **predictor (independent) variable** and the .nicegreen[overall injury index] as the **outcome (dependent) variable**. 
]

---

.pull-left[

## .nicegreen[Correlation]

> Chapter 9

**Pairwise Relationship**

- Variable order doesn't matter   
- No DV or IV, just a pair of variables

**Symbols**
- Population Parameter = `$\rho$`   
- Sample Statistic = `$r$`

**Language**
- *Evidence the `$X_1$` and `$X_2$` are .nicegreen[associated]*
- *There is a .nicegreen[correlation] between `$X_1$` and `$X_2$`*

]

.pull-right[

## .dcoral[Regression]

> Chapter 10

**Directional Predictive**

- Y = Outcome, Dependent Variable (DV)    
- X = Predictor, Independent Variable (IV)

**Symbols**
- Population Parameter = `$\beta_1$`   
- Sample Statistic = `$b_1$`

**Language**
- *.dcoral[Predicting] Y from X*
- *.dcoral[Regressing] Y on X*

]

---

---
## The Line of Best Fit

- The simple, linear relationship is usually not perfect *(points off the line)*
- Regression coefficients .nicegreen[( `$b_0$` , `$b_1$` )] computed to **minimize error** as much as possible

.pull-left[

**Residuals = Error**
- The difference between observed `$\large Y$` and `$\large \hat{Y}$` 
- Residuals: `$\large e_i = Y_i - \hat{Y}_i$`

**Technique**

- Ordinary Least Squares (OLS) regression

**Goal**

- Minimize `$\large SS_{error}$` ( `$\large SS_{residuals}$` )

`$$\large SS_{residuals} = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2$$`

]

.pull-right[
<img src="ch10_reg_files/figure-html/unnamed-chunk-1-1.png" width="80%" style="display: block; margin: auto;" />

]

---
count: false

## The Line of Best Fit

- The simple, linear relationship is usually not perfect *(points off the line)*
- Regression coefficients .nicegreen[( `$b_0$` , `$b_1$` )] computed to **minimize error** as much as possible

.pull-left[

**Residuals = Error**
- The difference between observed `$\large Y$` and `$\large \hat{Y}$` 
- Residuals: `$\large e_i = Y_i - \hat{Y}_i$`

**Technique**

- Ordinary Least Squares (OLS) regression

**Goal**

- Minimize `$\large SS_{error}$` ( `$\large SS_{residuals}$` )

`$$\large SS_{residuals} = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2$$`

]

.pull-right[
<img src="ch10_reg_files/figure-html/unnamed-chunk-2-1.png" width="80%" style="display: block; margin: auto;" />

]

---

---
## The Regression Equation

.coral[

$$ \LARGE \hat{Y_i} = b_0 + b_1 X_i $$
]

.center[

`$\Large X_i$` = value of predictor for a given case i

`$\Large \hat{Y_i}$` = predicted (unobserved) value of Y for a given case i

]

.pull-left[
.bluer[
.large[Y-Intercept]

- Predicted `$\hat{Y}$`, when X = 0

- Interpreted only if `$X = 0$` is meaningful

`$$\LARGE
b_0 = \bar{Y} - b_1 \times \bar{X}
$$

]]

.pull-right[
.nicegreen[
.large[Slope of Regression Line]

- Rate of change, "rise over run"

- Predicted change in Y, for every 1-unit change in X

`$$\LARGE
b_1 = r \times \frac{s_Y}{s_X}
$$

]]

---

.pull-leftsmall[
### Example 1)

**Predictor (IV)**  
*M* =  4.093   
*SD* = 1.31

**Outcome (DV)**   
*M* = 12.290   
*SD* = 1.66

**Correlation**   
*r* = .764

]

.pull-rightbig[
<img src="ch10_reg_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" />
]

.pull-left[
Scatterplot:

* 1 point per observation (person)
]

---
count: false

.pull-left[
Scatterplot:

* 1 point per observation (person)
* Dashed lines = mean values
* Cross at the "Point of Averages"
]

---
count: false

.pull-left[
- Correlation = 0.764
- Slope = `$b_1 = r \frac{s_y}{s_x} = .764 \frac{1.66}{1.31} = .968$`
- Intercept = `$b_0 = \bar{Y} - b_1 \bar{X} = 14.290 - (.968 * 4.093) = 10.328$`
]

---
count: false

.pull-left[
- Correlation = 0.764
- Slope = `$b_1 = r \frac{s_y}{s_x} = .764 \frac{1.66}{1.31} = .968$`
- Intercept = `$b_0 = \bar{Y} - b_1 \bar{X} = 14.290 - (.968 * 4.093) = 10.328$`
]

.pull-right[

`$\LARGE SS_{total}$`

]

---
count: false

.pull-left[
- Correlation = 0.764
- Slope = `$b_1 = r \frac{s_y}{s_x} = .764 \frac{1.66}{1.31} = .968$`
- Intercept = `$b_0 = \bar{Y} - b_1 \bar{X} = 14.290 - (.968 * 4.093) = 10.328$`
]

.pull-right[

`$\LARGE SS_{total} = SS_{explained}$`

]

---
count: false

.pull-left[
- Correlation = 0.764
- Slope = `$b_1 = r \frac{s_y}{s_x} = .764 \frac{1.66}{1.31} = .968$`
- Intercept = `$b_0 = \bar{Y} - b_1 \bar{X} = 14.290 - (.968 * 4.093) = 10.328$`
]

.pull-right[

`$\LARGE SS_{total} = SS_{explained} + SS_{unexplained}$`

]

---
# Explaining Variance

`$$\huge SS_{total} = SS_{explained} + SS_{unexplained}$$`

Synonyms: 
* Explained = Regression
* Unexplained = Residual or Error

Coefficient of Determination: `$\LARGE r^2 = \frac{\text{Explained Variation}}{\text{Total Variation}} = \frac{SS_{regression}}{SS_{total}}$`

- Computed to determine how well regression equation predicts `Y` from `X`

- Range from 0 to 1

- SS divided by corresponding df gives us the Mean Square (Regression or Error)

- .nicegreen[The proportion of variance in the outcome "accounted for" or "attributable to" or "predictable from" or "explained by" the predictor]

---
# Again, Always **Visualize** Data First

### Scatterplots

.pull-left[

```r
library(tidyverse)
df %>%
  ggplot(aes(x, y)) +
    geom_point() +
    geom_smooth(se = FALSE,
                method = "lm")
```
]

.pull-right[
<img src="ch10_reg_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" />
]

---
## R Code: Regression

```r
df %>%
  lm(y ~ x,
     data = .) %>%
  summary()
```

```

Call:
lm(formula = y ~ x, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.10376 -0.56125  0.05069  0.65004  2.15932

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.01762    0.09888  -0.178    0.859    
x            0.95964    0.09696   9.897   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9849 on 98 degrees of freedom
Multiple R-squared:  0.4999,	Adjusted R-squared:  0.4948 
F-statistic: 97.95 on 1 and 98 DF,  p-value: < 2.2e-16
```

---
## R Code: Regression

```r
df %>%
  lm(y ~ x,
     data = .) %>%
  confint()
```

```
                 2.5 %    97.5 %
(Intercept) -0.2138558 0.1786119
x            0.7672237 1.1520547
```

---
## R Code: Regression

```r
df %>%
  lm(y ~ x,
     data = .) %>%
  coef()
```

```
(Intercept)           x 
-0.01762194  0.95963917 
```

---
## R Code: Regression

```r
coef1 <- df %>%
  lm(y ~ x,
     data = .) %>%
  coef()
confint1 <- df %>%
  lm(y ~ x,
     data = .) %>%
  confint()
cbind(coef1, confint1)
```

```
                  coef1      2.5 %    97.5 %
(Intercept) -0.01762194 -0.2138558 0.1786119
x            0.95963917  0.7672237 1.1520547
```

---
## R Code: Predicted Values

```r
df %>%
  lm(y ~ x,
     data = .) %>%
  predict()
```

```
          1           2           3           4           5           6 
-1.66331253 -1.58805266 -0.37685641 -0.36001934 -1.82554446  1.96902590 
          7           8           9          10          11          12 
-1.44361263 -2.20795037 -1.52382088  0.13823564  0.40028777  1.32040382 
         13          14          15          16          17          18 
 1.44610197  1.17018122 -1.18462186 -0.31876293  0.14390364 -0.85728422 
         19          20          21          22          23          24 
 0.83163117 -1.23725243 -0.44710577  0.31680345  0.02232455  0.52088462 
         25          26          27          28          29          30 
 0.58236193 -0.26353990 -0.42729936 -0.75393890  0.77690375  0.51344384 
         31          32          33          34          35          36 
-0.06357724 -0.45745486 -1.74608438 -2.49312908  0.33677392  0.78885811 
         37          38          39          40          41          42 
 0.71086918  1.21521941  0.51198239  1.54369860 -0.12583856 -0.53196921 
         43          44          45          46          47          48 
-0.47371349  0.78368856 -0.23333494  0.69249078 -0.58503655  1.15183741 
         49          50          51          52          53          54 
-0.37269614  1.31926133  1.83331799  1.12312817 -0.73687320 -0.83905733 
         55          56          57          58          59          60 
-0.86453074  1.00706828  0.31675496 -0.02004292  0.25398427  0.18068270 
         61          62          63          64          65          66 
 0.14073354 -0.54997360 -0.92320440  0.64703366 -1.18437957 -0.95163084 
         67          68          69          70          71          72 
-0.34907933 -1.41066739 -0.35410481 -1.73774589 -0.63731976 -1.14844358 
         73          74          75          76          77          78 
-0.03006009  0.60503890 -1.06444906  0.48064443 -0.14387606  0.53813732 
         79          80          81          82          83          84 
 1.16031740 -0.56722153  0.56855506  0.78077100 -0.15258351 -0.90656360 
         85          86          87          88          89          90 
-0.97827569  1.00308499  1.07858284 -0.14627409  0.89076854 -0.72505744 
         91          92          93          94          95          96 
-0.91505384  1.31514942  0.38043146 -0.13967989  1.62183758 -1.96937621 
         97          98          99         100 
 0.06360765 -0.90747048 -0.16306626 -1.66877101 
```

---
# Assumptions

- .nicegreen[Independence] of observations
  - SRS = simple random sample

- .nicegreen[Straight line] best fits data
  - linear, not curved, ect.

- .bluer[Y is **CONDITIONALLY** normally distributed, with constant variance,  around the line]
  - Does NOT apply to Y overall
  - DOES apply to Y values, FOR A GIVEN X value
  - Check with residual diagnostics
  - Does NOT apply to predictor variable(s) X --> Can be categorical or continuous

- Sampling .dcoral[distribution of the slope] ( `$\huge b_1$` ) assumed normally distributed

---
background-image: url(figures/fig_assumptions_reg.png)
background-position: 50% 50%
background-size: 1200px

# Assumptions

---
# R Code: Assumption - Normal Residuals

```r
df %>%
  lm(y ~ x,
     data = .) %>%
  plot(which = 2)
```

---
# R Code: Assumptions - Normal Residuals

```r
df %>%
  lm(y ~ x,
     data = .) %>%
  resid() %>%
  hist()
```

---
class: inverse, center, middle

# Let's Apply This to the Cancer Dataset

---
# Read in the Data

```r
library(tidyverse)    # Loads several very helpful 'tidy' packages
library(haven)        # Read in SPSS datasets
library(furniture)    # for tableC()
```

```r
cancer_raw <- haven::read_spss("cancer.sav")
```

### And Clean It

```r
cancer_clean <- cancer_raw %>% 
  dplyr::rename_all(tolower) %>% 
  dplyr::mutate(id = factor(id)) %>% 
  dplyr::mutate(trt = factor(trt,
                             labels = c("Placebo", 
                                        "Aloe Juice"))) %>% 
  dplyr::mutate(stage = factor(stage))
```

---
## R Code: Regression

.pull-left[

```r
cancer_clean %>%
  lm(totalcin ~ age,
     data = .) %>%
  summary()
```

]

.pull-right[

```

Call:
lm(formula = totalcin ~ age, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0463 -0.6825 -0.4097  0.6510  5.2266

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  4.71197    1.45471   3.239  0.00362 **
age          0.03032    0.02386   1.271  0.21657   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.512 on 23 degrees of freedom
Multiple R-squared:  0.06559,	Adjusted R-squared:  0.02496 
F-statistic: 1.614 on 1 and 23 DF,  p-value: 0.2166
```
]

---
### R Code: Standardized

```r
cancer_clean %>%
  mutate(totalcinZ = scale(totalcin),
         ageZ = scale(age)) %>%
  lm(totalcinZ ~ ageZ, 
     data = .) %>%
  summary()
```

```

Call:
lm(formula = totalcinZ ~ ageZ, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3367 -0.4458 -0.2676  0.4253  3.4143

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.442e-16  1.975e-01   0.000    1.000
ageZ        2.561e-01  2.016e-01   1.271    0.217

Residual standard error: 0.9874 on 23 degrees of freedom
Multiple R-squared:  0.06559,	Adjusted R-squared:  0.02496 
F-statistic: 1.614 on 1 and 23 DF,  p-value: 0.2166
```

---
# R Code: Correlation vs. Standardized

.pull-left[

```r
cancer_clean %>%
  cor.test(~ totalcinZ + ageZ, 
           data = .)
```
]

.pull-right[

```r
cancer_clean %>%
  mutate(totalcinZ = scale(totalcin),
         ageZ = scale(age)) %>%
  lm(totalcinZ ~ ageZ, 
     data = .) %>%
  summary()
```
]

---
# R Code: Correlation vs. Standardized

.pull-left[

```

Pearson's product-moment correlation

data:  totalcin and age
t = 1.2706, df = 23, p-value = 0.2166
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.1546769  0.5913913
sample estimates:
      cor 
0.2561066 
```
]

.pull-right[

```

Call:
lm(formula = totalcinZ ~ ageZ, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3367 -0.4458 -0.2676  0.4253  3.4143

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.442e-16  1.975e-01   0.000    1.000
ageZ        2.561e-01  2.016e-01   1.271    0.217

Residual standard error: 0.9874 on 23 degrees of freedom
Multiple R-squared:  0.06559,	Adjusted R-squared:  0.02496 
F-statistic: 1.614 on 1 and 23 DF,  p-value: 0.2166
```
]

---
class: inverse, center, middle

# Questions?

---
class: inverse, center, middle

# Next Topic

### Matched T-Test

---
class: inverse, center, middle

# Supplemental Slides

---
## Accuracy of Prediction

- All points do not fall on regression line
  - Prediction works for most, but not all in sample
  
--

- W/out knowledge of X, best prediction of Y is mean `$\Large \bar{Y}$`
  - `$\Large s_y$` : best measure of prediction error

- With knowledge of X, best prediction of Y is from the equation `$\Large \hat{Y}$`
  - Standard error of estimate (SEE or `$s_{Y \cdot X}$` ): best measure of prediction error
  - Estimated SD of residuals in population

---
# Accuracy of Prediction

.pull-left[
### .nicegreen[Standard Error of Estimate]
<br>

`$$\large s_{Y \cdot X} = \sqrt{\frac{\sum (Y_i - \hat{Y})^2}{N - 2}} = \sqrt{\frac{SS_{residual}}{df}}$$`

]

.pull-right[
### .dcoral[Residual or Error Variance <br> or Mean Square Error]

`$$\large s_{Y \cdot X}^2 = \frac{\sum (Y_i - \hat{Y})^2}{N - 2} = \frac{SS_{residual}}{df}$$`

]

<br>
.large[
`$df = N - 2$` (2 df lost in estimating regression coefficients)

Seeking smallest `$\LARGE s_{Y \cdot X}$` as it is a measure of variation of observations around regression line

]

---
# Standardized Coefficients (Beta weights)

- 1 SD-unit change in X represents a `$\beta$` SD change in `Y`

- Intercept = 0 and is not reported when using `$\beta$`

- For simple regression only --> `$r = \beta$` and `$r^2 = \beta^2$`
  - When raw scores transformed into z-scores: `$r = b = \beta$`

--
  
- Useful for variables with abstract unit of measure