Correlation

class: center, middle, inverse, title-slide

# Correlation
## Cohen Chapter 9 <br><br> .small[EDUC/PSY 6600]

---

class: center, middle

## "Statistics is not a discipline like physics, chemistry, or biology where we study a subject to solve problems in the same subject. <br> We study statistics with the main aim of solving problems in other disciplines."

### -- C.R. Rao, Ph.D.

---

# Motivating Example

.large[
> Dr. Mortimer is interested in knowing whether people who have a **positive view of themselves in one aspect** of their lives .dcoral[**also tend to**] have a **positive view of themselves in other aspects** of their lives.
]

- He has .bluer[**80 men**] complete a **self-concept inventory** that contains .dcoral[**5 scales**].

- **Four** scales involve questions about how competent respondents feel in the areas of:

+ .nicegreen[**intimate relationships**], 
   + .nicegreen[**relationships with friends**], 
   + .nicegreen[**common sense reasoning and everyday knowledge**], and 
   + .nicegreen[**cademic reasoning and scholarly knowledge**].

- The **5th scale** includes items about how competent a person feels .nicegreen[**in general**].

.large[
> .dcoral[**10 pairwise correlations**] are computed between all possible pairs of variables.
]

---
# Correlation

.large[
> Interested in .dcoral[**degree** of covariation] or .dcoral[**co-relation**] among .dcoral[**TWO** variables] measured on **SAME** objects/participants
]

- Not interested in group differences, per se
- Data can be in raw or standardized format
- Correlation coefficient is .nicegreen[scale-invariant]

**Type** of variable makes a difference

- Interval/Ratio Continuous: Correlation (Pearson's product-moment correlation)
- Ordinal: Correlation (Spearman, Tetracloric/Polycloric, Cramer's V, Kendall's Tau, ...)
- Nominal: Association or Dependence

Can test for **significance** of the correlation

- `$\Large H_0$`: population correlation coefficient = 0

---

## Always **Visualize** Data First

### Scatterplots

.pull-left[

*AKA: scatterdiagrams, scattergrams*

Notes:
1. Each subject is represented by 1 dot (x and y coordinate)
2. Fit line can indicate .dcoral[nature] and .dcoral[degree of relationship] (Regression or prediction lines)
3. Can **stratify** scatterplots by .nicegreen[subgroups]

```r
df %>%
  ggplot(aes(`x`, `y`)) +
    geom_point() +
    geom_smooth(method = "lm")
```
]

.pull-right[
<img src="ch9_cor_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
]

---
## Correlation: Direction

.pull-left[.center[
### .nicegreen[Positive Association]

**High values** of one variable tend to occur with **High values** of the other
<img src="ch9_cor_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
]]

.pull-right[.center[
### .bluer[Negative Association]

**High values** of one variable tend to occur with **Low values** of the other
<img src="ch9_cor_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />
]]

---
## Correlation: Strength / Predictability

How much **variation** or **scatter** there is around the main form?

.pull-left[.center[
### STRONG
X provided a GOOD estimate of Y
<img src="ch9_cor_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" />
]]

.pull-right[.center[
### weak
X is associated with a wide range of Y
<img src="ch9_cor_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />
]]

---

## Scatterplot: Pattern/Form

.pull-left[
<img src="ch9_cor_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" />

<img src="ch9_cor_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
]

.pull-right[
<img src="ch9_cor_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />
<img src="ch9_cor_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" />
]

---
## Scatterplot: Scale

.pull-left[
<img src="ch9_cor_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />
<img src="ch9_cor_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />
]

.pull-right[
<img src="ch9_cor_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" />
<img src="ch9_cor_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" />
]

> Note: all have the same data! Also, `ggplot2`'s defaults are usually pretty good

???
- Using an inappropriate scale for a scatterplot can give an incorrect impression. 
- Both variables should be given a similar amount of space:
  - Plot roughly square
  - Points should occupy all the plot space (no blank space)

---

## Scatterplot: Bivariate Outliers/Leverage

.pull-left[
- An .dcoral[outlier] is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected).

- In a scatterplot, BIVARIATE outliers are points that fall outside of the **overall pattern** of the relationship.

- Not all extreme values are outliers

- No data should be "thrown out" unless there is a good reason: error, ect.

]

.pull-right[

]

---
## Pearson "Product Moment" Correlation Coefficient

> Building block for many other statistical methods

.pull-left[
Used as a measure of: 
- **Magnitude** (strength)   
- **Direction** of relationship  
- Between **2 CONTINUOUS** variables   
- Around **STRAIGHT** regression line

Applicaitons: Validity & Reliability   
- Test-retest   
- alternative forms   
- split half reliability   
]

.pull-left[
**Symbols:**

Population: `$\large \rho$`   
Sample: `r`

**Properties:**

- `x` & `y` are indistinguishable
- has no units
- ranges from `-1` through `+1`
- `r` = 0 is no correlation
- Influenced by outliers
]

---
## Correlation: Calculating Formula

$$ 
\LARGE r = \frac{1}{n - 1} \sum^n_{i = 1} \LARGE(\normalsize\frac{x_i - \bar{x}}{s_x}\LARGE)(\normalsize \frac{y_i - \bar{y}}{s_y}\LARGE)
$$

### Anyone want to do this by hand??

.large[.dcoral[
Let's use R to do this for us 
]]

---
## Correlation: Let R do the work!

.pull-left[
.dcoral[Correlation Matrix]

```r
df %>%
  furniture::tableC(`x_var`, `y_var`)
```
<br>

```

------------------------------
          [1]           [2]  
 [1]x_var 1.00               
 [2]y_var 0.178 (0.077) 1.00 
------------------------------
```
<br>
.nicegreen[
*r* = .178, *p* = .077
]

]

.pull-right[
.dcoral[Correlation, CI, and p-value]

```r
df %>%
  cor.test(~ `x_var` + `y_var`,
           data = .)
```
<br>

```

Pearson's product-moment correlation

data:  x_var and y_var
t = 1.7859, df = 98, p-value = 0.07721
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.01956568  0.36135211
sample estimates:
      cor 
0.1775347 
```
]

---
.huge[Correlations ONLY describe .dcoral[**LINEAR**] relationships]

.pull-left[.center[
### Linear

<img src="ch9_cor_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" />
]]

.pull-right[.center[
### Non-linear
<img src="ch9_cor_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" />
]]

.large[
Note: You can sometimes *transform* a non-linear association to a linear form, for instance by taking the logarithm.
]

---
## Let's see it in action

## [Correlation App](http://digitalfirst.bfwpub.com/stats_applet/stats_applet_5_correg.html)

.pull-left[
.large[
- Influential Points

- Eye-ball the correlation

- Draw the line of the best fit
]]

.pull-right[
.large[
Why are correlations not resistant to outliers?

When do outliers have more *leverage*?
]]

---
background-image: url(figures/fig_bivariate_normal.png)
background-position: 80% 50%
background-size: 400px

## Assumptions of Pearson's Correaltion

.pull-left[
.large[
1. Random Sample

2. Relationship is linear (check scatterplot, use transformations)

3. Bivariate normal distribution
    - Each variable should be normally distributed in population
    - Joint distribution should be bivariate normal
    - Curvilinear relationships = violation
    - Less important as N increases
]]

---
## Sampling Distribution of `rho`

.large[
- Normal distribution about 0
- Becomes non-normal as `$\Large \rho$` gets larger and deviates from `$\Large H_0$` value of 0 in the population
    - Negatively skewed with large, positive null hypothesized `$\rho$`
    - Positively skewed with large, negative null hypothesized `$\rho$`
- Leads to 
    - Inaccurate p-values
    - No longer testing `$\Large H_0$` that `$\Large \rho = 0$`
- Fisher's solution: transform sample `r` coefficients to yield normal sampling distribution, regardless of `$\LARGE\rho$`

*We will let the computer worry about the details...*
]

---
background-image: url(figures/fig_t_table2.png)
background-position: 80% 50%
background-size: 400px

## Hypothesis testing for 1-sample `r`

.pull-left[.dcoral[
.large[
$$ \LARGE H_0: \rho = 0$$
`$$\LARGE H_A: \rho \neq 0$$`
]]

.center[`r` is converted to a t-statistic]

$$ 
\LARGE t = \frac{r\sqrt{N - 2}}{\sqrt{1 - r^2}}
$$
- Compare to t-distribution with .dcoral[$df = N - 2$]
  - Rejection = statistical evidence of relationship
  - Or look up critical values of `r`
]

---
## Example: Mood & Recall

>Researcher wishes to correlate scores from 2 tests: 
> .nicegreen[current mood state] and .nicegreen[verbal recall memory]

.pull-leftsmall[

```
      # A tibble: 7 x 2
         Mood Recall
        <dbl>  <dbl>
      1    45     48
      2    34     39
      3    41     48
      4    25     27
      5    38     42
      6    20     29
      7    45     30
```

.dcoral[
*r* = .644, *p* = .119   
95% CI [-.212, .941]
]

]

.pull-rightbig[

```r
df %>%
  cor.test(~ `Mood` + `Recall`,
           data = .)
```

```

Pearson's product-moment correlation

data:  Mood and Recall
t = 1.8815, df = 5, p-value = 0.1186
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.2120199  0.9407669
sample estimates:
      cor 
0.6438351 
```
]

---
## Factors Affecting Validity of `r`

- .dcoral[Range restriction] (variance of X and/or Y)
  - r can be inflated or deflated
  - May be related to small N

- .dcoral[Bivariate Outliers]
  - `r` can be heavily influenced

- Use of .dcoral[heterogeneous sub-samples]
  - Combining data from heterogeneous groups can **inflate** correlation coefficient or yield **spurious results** by stretching out data

---
background-image: url(figures/fig_spurious.jpeg)
background-position: 50% 50%
background-size: 1200px

.footnote[http://www.tylervigen.com/spurious-correlations]

---

---
## Interpretation and Communcation

.large[.dcoral[
**Correlation `$\Large \neq$` Causation**, *in an observational study*
]]

But, correlation can be causation...

- Can infer **strength** and **direction**; not form or prediction from `r`
- Can say that prediction will be **better** with **large `r`**, but cannot predict actual values

- Statistical significance
  - p-value heavily influenced by **Sample Size**
  - Need to .dcoral[interpret size of r-statistic, more than p-value]

---
## APA: Reporting in Text

.large[
> Example 1.

For .bluer[nine] students, the scores on the first quiz .nicegreen[(*M* = 7.00, *SD* = 1.23)] and the first exam .nicegreen[(*M* = 80.89, *SD* = 6.90)] were strongly and significantly correlated, .dcoral[*r* = .701, *p* = .038].
]

--
<br><br>

.large[
> Example 2.

A Pearson product-moment correlation coefficient was computer to assess the relationship between the amount of water consumption and skin elasticity.  A scatterplot summarizes the results .nicegreen[(see Figure 1)].  Overall, there was a strong, positive correlation between the amount of ware consumed and skin elasticity, .dcoral[*r* = .985, *p* = .002].

]

---

## APA: Correlation Table

---
class: inverse, center, middle

# Let's Apply This to the Cancer Dataset

---
# Read in the Data

```r
library(tidyverse)    # Loads several very helpful 'tidy' packages
library(haven)        # Read in SPSS datasets
library(furniture)    # for tableC()
```

```r
cancer_raw <- haven::read_spss("cancer.sav")
```

### And Clean It

```r
cancer_clean <- cancer_raw %>% 
  dplyr::rename_all(tolower) %>% 
  dplyr::mutate(id = factor(id)) %>% 
  dplyr::mutate(trt = factor(trt,
                             labels = c("Placebo", 
                                        "Aloe Juice"))) %>% 
  dplyr::mutate(stage = factor(stage))
```

---
## R Code: Defaults

.pull-left[

.dcoral[**DEFAULT**: Pearson's 2-sided, 95% CI]

```r
cancer_clean %>%
  cor.test(~ `totalcin` + `totalcw2`,
           data = .)
```

> The code ABOVE and BELOW give the same results

```r
cancer_clean %>%
  cor.test(~ `totalcin` + `totalcw2`,
           data = .,
           alternative = "two.sided",
           method = "pearson",
           conf.level = .95)
```

]

.pull-right[

```

Pearson's product-moment correlation

data:  totalcin and totalcw2
t = 1.5885, df = 23, p-value = 0.1258
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.09215959  0.63114058
sample estimates:
     cor 
0.314421 
```
]

.dcoral[**Interpretation**]: Oral condition two weeks into the study was not significantly correlated with condition at intake, .nicegreen[*r* = .314, *p* = .126, 95% *CI* [-.092, .631]].

---

## R Code: Directional alternative

.pull-left[

.nicegreen[**NEGATIVE correaltion**]

```r
cancer_clean %>%
  cor.test(~ totalcin + totalcw2,
           data = .,
           `alternative = "less"`)
```

.nicegreen[**POSITIVE correlation**]

```r
cancer_clean %>%
  cor.test(~ totalcin + totalcw2,
           data = .,
           `alternative = "greater"`)
```

> .dcoral[NOTE]: **NEVER** use the confidence intervals from a 1-tailed test!  You MUST run a 2-tailed test to get a real confidence interval.

]

.pull-right[

```

Pearson's product-moment correlation

data:  totalcin and totalcw2
t = 1.5885, df = 23, p-value = 0.9371
alternative hypothesis: true correlation is less than 0
95 percent confidence interval:
 -1.0000000  0.5889963
sample estimates:
     cor 
0.314421 
```

```

Pearson's product-moment correlation

data:  totalcin and totalcw2
t = 1.5885, df = 23, p-value = 0.06292
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
 -0.02523473  1.00000000
sample estimates:
     cor 
0.314421 
```
]

---

## R Code: Correlation Tables, with missing values

.pull-left[

```r
cancer_clean %>%
  `furniture::tableC`(totalcin, totalcw2, 
                    totalcw4, totalcw6)
```

```

-----------------------------------------------------
             [1]           [2]           [3]   [4]  
 [1]totalcin 1.00                                   
 [2]totalcw2 0.314 (0.126) 1.00                     
 [3]totalcw4 0.222 (0.287) 0.337 (0.099) 1.00       
 [4]totalcw6 NA NA         NA NA         NA NA 1.00 
-----------------------------------------------------
```
]

.pull-right[
.dcoral[*List-wise Deletion*]: only complete cases

```r
cancer_clean %>%
  `furniture::tableC`(totalcin, totalcw2, 
                    totalcw4, totalcw6,
                    `na.rm = TRUE`)
```
]

```

-------------------------------------------------------------
             [1]           [2]           [3]           [4]  
 [1]totalcin 1.00                                           
 [2]totalcw2 0.282 (0.192) 1.00                             
 [3]totalcw4 0.206 (0.346) 0.314 (0.145) 1.00               
 [4]totalcw6 0.098 (0.657) 0.378 (0.075) 0.763 (<.001) 1.00 
-------------------------------------------------------------
```

---
## R Code: Scatterplot with Regression Line

.pull-left[

```r
cancer_clean %>%
  ggplot(aes(x = totalcin, 
             y = totalcw2)) +
  `geom_point()` +
  geom_smooth(method = "lm")
```

<img src="ch9_cor_files/figure-html/unnamed-chunk-49-1.png" style="display: block; margin: auto;" />
]

.pull-right[

```r
cancer_clean %>%
  ggplot(aes(x = totalcin, 
             y = totalcw2)) +
  `geom_count()` +
  geom_smooth(method = "lm")
```

<img src="ch9_cor_files/figure-html/unnamed-chunk-51-1.png" style="display: block; margin: auto;" />
]

---
class: inverse, center, middle

# Questions?

---
class: inverse, center, middle

# Next Topic

### Linear Regression