Hypothesis Testing

class: center, middle, inverse, title-slide

# Hypothesis Testing
## Cohen Chapter 5 <br><br> .small[EDUC/PSY 6600]

---

class: center, middle

## "I'm afraid that I rather <br> give myself away  when I explain,"  <br> said he. <br> "Results without causes <br> are much more impressive."

### -- Sherlock Holmes 
*The Stock-Broker's Cat*

---
# Two Types of Research Questions

.pull-left[
.center[
### Do .dcoral[groups] <br>*significantly* .dcoral[differ] <br> on 1 or more characteristics?
]

Comparing group means, counts, or proportions

.dcoral[
- `$t$`-tests
- ANOVA
- `$\chi^2$` tests]
]

.pull-right[
.center[
### Is there a <br> *significant* .nicegreen[relationship] <br> among a set of .nicegreen[variables]?
]

Testing the association or dependence

.nicegreen[
- Correlation
- Regression]
]

---

---
# Inferential Statistics

.pull-left[
## Descriptive statistics are limited

- Rely only on **raw** data distribution
- Generally describe **one** variable only
- Do not address **accuracy** of estimators or hypothesis testing
- How **precise** is sample mean or does it differ from a given value?
- Are there between or within **group differences** or **associations**?
]

.pull-right[
### Goals of inferential statistics

- .dcoral[Hypothesis testing]
  - `$p$`-values
- .dcoral[Parameter estimation]
  - confidence intervals

### Repeated sampling
- Estimators will vary from sample to sample
- Sampling or random error is variability due to chance
]

---

---
background-image: url(figures/fig_old_cig.png)

# Causality and Statistics:

Hill's View Points and "Diversity of Evidence"

.huge[ .dcoral[Causality] depends  on .nicegreen[**evidence**] <br> from outside statistics: ]

- Plausibility/Phenomenological credibility (educational, behavioral, biological) 
- Strength of association, ruling out occurrence by chance alone
- Coherence/Consistency with past research findings
- Temporality
- Dose-response relationship
- Specificity
- Prevention
- Experiment
- Analogy

.large[.dcoral[**Causality** is often a **judgmental** evaluation <br> of **combined** results from **several studies**]]

---
# z-Scores and Statistical Inference

Probabilities of `$z$`-scores used to determine how **unlikely** or **unusual** a single case is relative to other cases in a sample

## .center[**Small** probabilities <br> .dcoral[*(p-values)*] <br> reflect **unlikely** or **unusual scores**]

Not frequently interested in whether **individual scores** are unusual relative to others, but whether scores from **groups of cases** are unusual.

.nicegreen[**Sample mean**], `$\bar{x}$` *(for formulas)* or `$M$`  *(for APA)*, summarizes .nicegreen[**central tendency**] of a group or sample of subjects

---

---

---
background-image: url(figures/fig_yellow_box_1.png)

# Steps of a Hypothesis test

.pull-left[
1. State the .nicegreen[Hypotheses] 
  - Null & Alternative
  <br>
2. Select the .nicegreen[Statistical Test] & .nicegreen[Significance Level] 
  - `$\alpha$` level
  - One vs. Two tails
  <br>
3. Select random sample and collect data
  <br>
4. Find the .nicegreen[Region of Rejection]
  - Based on `$\alpha$` & # of tails
  <br>
4. Calculate the .nicegreen[Test Statistic]
  - Examples include: `$z, t, F, \chi^2$`
  <br>
5. Write the .nicegreen[Conclusion]
  - Statistical decision must by in context!
]

.pull-right[

## Definition of a p-value:

.center[.large[ 
The probability of observing <br> a test statistic <br> .dcoral[**as extreme or more extreme**] <br> .nicegreen[**IF**] <br> the NULL hypothesis is true.
]]]

---

---
background-image: url(figures/fig_null_hyp.png)

# Stating Hypotheses

Hypotheses are always specified in terms of .dcoral[**population**]
- Use `$\mu$` for the population mean, not `$\bar{x}$` which is for a sample

.pull-left[ 
**If you are comparing TWO population MEANS:**

.large[
.center[
.dcoral[**Null** Hypothesis]
]
]
`$$H_0: \mu_1 = \mu_2$$`
.large[
.center[
.nicegreen[**Research** or Alternative Hypothesis] <br> options...
]
]
$$
H_1: 
\mu_1 \ne \mu_2 \quad \text{ or } \quad  
\mu_1 \lt \mu_2 \quad \text{ or } \quad 
\mu_1 \gt \mu_2
$$
]

---

---
background-image: url(figures/fig_scale_null.png)

# Innocent Until Proven Guilty

**IF** there is Not enough statistical evidence to reject

Judgment suspended until further evidence evaluated:

- "Inconclusive"
- Larger sample?
- Insufficient data?

---
# Rejecting the Null Hypothesis

.pull-left[

.large[**Assumption:**]

The .dcoral[NULL] hypothesis is .dcoral[TRUE] in the .dcoral[POPULATION]

.nicegreen[.large[IF:] The p-value is very **SMALL**]

- How small?
`$p-value \lt \alpha$`

.nicegreen[.large[THEN:] We have evidence AGAINST the NULL hypothesis]

- It is **UNLIKELY** we would have observed a sample that extreme **JUST DUE TO RANDOM CHANCE**...

]

.pull-right[
.large[**Criteria:**]

May judge by either...
- the p-value `$\lt \alpha$`   
-OR-   
- test statistic `$\lt$` Critical Value

.large[**Conclusion:**]

We either **REJECT** or **FAIL TO REJECT** the .nicegreen[Null] hypothesis

.center[ .large[ .dcoral[
We NEVER **ACCEPT** <br> the **ALTERNATIVE** hypothesis!!!
]]]

]

---
background-image: url(figures/fig_1or2_tails.png)

# ONE tail or TWO?

.pull-left[
.large[**2-tailed test**]

`$H_1: \mu_1 \ne \mu_2$`

.large[**1-tailed test**]

.nicegreen[**Suggests a directionality in results!**]

`$H_1: \mu_1 \lt \mu_2$` -OR- `$H_1: \mu_1 \gt \mu_2$`

.large[**NO computational differences**]

`$2 \; tail \; p-value = \mathbf{2 \times} 1 \; tail \; p-value$`

- IF: 1-sided: `$p = .03$`
- THEN: 2-sided: `$p = .06$`

]

---
background-image: url(figures/fig_1tail_cv.png)

# ONE tail or TWO?

.large[ .large[ .center[ .dcoral[Some circumstances may warrant a 1-tailed test, BUT... <br>We generally **prefer** and default to a 2-tailed test!!!]]]]

.pull-right[
.large[**More conservative = 2 tails**]<br>

Rejection region is distributed in both tails

- e.g.: `$\alpha = .05$` distributed across both tails 
  - (2.5% in each tail)

- If we know outcome, why do study?
  - Looks suspicious to reviewer's?
  - "significant results at all costs!"
]

---

---

---

---

---
background-image: url(figures/fig_err_types.png)

# Choosing Alpha

.pull-left[

.dcoral[**Alpha**  = probability of making a **type I error**]

.large[.dcoral[**type I error**]]
- We reject the NULL when we should not
- The risk of "false positive" results

.large[.dcoral[**type II error**]]
- We FAIL to reject the NULL when we should
- The risk of "false negative" results

]

---
background-image: url(figures/fig_conf_matrix.png)

# Choosing Alpha

.pull-right[

We want `$\alpha$` to be .nicegreen[SMALL], but trade off (type II error rate)

.nicegreen[DEFAULT] is `$\alpha = .05$`  **BUT** there is nothing magical about it

Let it be .nicegreen[LARGER] value, `$\alpha = .10$`, **IF** we'd rather not miss any potential relationship and are okay with some false positives
  - Ex) screening genes, early drug investigation, pilot study
  
Set it .nicegreen[SMALLER], `$\alpha = .01$`, **IF** false positives are costly and we want to be more stringent
  - Ex) changing a national policy, mortgaging the farm

]

---
# Assumptions of a 1-sample z-test

.large[**1. Sample was drawn at .dcoral[RANDOM]** *(at least as representative as possible)*]

- Nothing can be done to fix a NON-representative samples!
- Can .bluer[**NOT**] statistically test

.large[**2. .dcoral[SD] of the sampled population = .dcoral[SD] of the comparison population**]

- Nearly impossible to check, can .bluer[**NOT**] statistically test

.large[**3. Variable has a .dcoral[NORMAL] distribution in the population**]

- .bluer[**NOT**] as important if the sample is large, due to the **Central Limit Theorem**

--
- .bluer[**CAN**] statistically test:

--
  - Visual inspection of a .nicegreen[histogram], .nicegreen[boxplot], and/or .nicegreen[QQ plot] *(straight 45 degree line)*
  
--
  - Calculate the Skewness & Kurtosis... less clear guidelines
  
--
  - Conduct .nicegreen[Shapiro-Wilks] test *(p < .05 ??? not normal)*
  
--

> For more information see this blogpost [Is This Normal? Shapiro-Wilk Test in R To The Rescue](https://www.programmingr.com/shapiro-wilk-test-in-r/) and this article [Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests](https://www.nrc.gov/docs/ML1714/ML17143A100.pdf), as well as the R help page with `?shapiro.test`.
  
  
  
  
---
# APA: results of a 1-sample z-test

- State the alpha & number of tails in the methods section, prior to the results section

- When used in a sentence, spell out .coral[mean] and .coral[standard deviation]
- When included in a table, figure, or within parentheses, use abbreviates: .coral[*n*, *M*, *SD*]

- Report most values to TWO decimal places *usually*
- Report exact p-values to THREE decimal places *usually*, except for *p* < .001

## Example Sentence:

A .nicegreen[one sample z test] showed that the difference in the quiz scores between the current sample .coral[(*N* = 9, *M* = 7.00, *SD* = 1.23)] and the hypothesized value .coral[(6.00)] were statistically significant.bluer[, *z* = 2.45, *p* = .040].

---
# EXAMPLE: 1-sample z-test

After an earthquake hits their town, a random sample of townspeople yields the following anxiety score:

.center[.nicegreen[72, 59, 54, 56, 48, 52, 57, 51, 64, 67]]