Data Visualization

class: center, middle, inverse, title-slide

# Data Visualization
## Cohen Chapter 2 .small[EDUC/PSY 6600]

---

# Always plot your data first!

.center[.Huge["Always." - Severus Snape]]

--
### Why?

.large[
- .dcoral[Outliers] and impossible values

- Determine correct .nicegreen[statistical approach]

- .bluer[Assumptions] and diagnostics

- Discover new .nicegreen[relationships]
]

---
background-image: url(figures/fig_misleading_graph1.png)
background-position: 90% 60%
background-size: 500px

# The Visualization Paradox

.pull-left[.large[
- Often the .dcoral[most informative] aspect of analysis
- .nicegreen[Communicates] the "data story" the best
- Most abused area of quantitative science
- Figures can be *very* .bluer[misleading]
]]

.footnote[[Misleading Graphs](https://venngage.com/blog/misleading-graphs/)]

---
background-image: url(figures/fig_misleading_graph1_fixed.png)
background-position: 50% 80%
background-size: 700px

# Much better

---
# Keys to Good Viz's

.pull-left[.large[
- .dcoral[Graphical method] should match .dcoral[level of measurement]
- Label all axes and include figure caption
- .nicegreen[Simplicity and clarity]
- .bluer[Avoid of ‘chartjunk’]
]]

.pull-right[.large[
- Unless there are 3 or more variables, avoid 3D figures (and even then, avoid it)
- Black & white, grayscale/pattern fine for most simple figures
]]

---
# Data Visualizations

.huge[Takes practice -- try a bunch of stuff]

### Resources

.large[.large[
- [Edward Tufte's books](https://www.edwardtufte.com/tufte/books_vdqi)
- ["R for Data Science"](http://r4ds.had.co.nz/) by Grolemund and Wickham
- ["Data Visualization for Social Science"](http://socviz.co/) by Healy
]]

---
# Frequency Distributions

.pull-left[
.huge[.bluer[Counting] the number of occurrences of unique events]

- Categorical or continuous
- just like with `tableF()` and `table1()`
]

.pull-right[
.large[
Can see .dcoral[central tendency] (continuous data) or .dcoral[most common value] (categorical data)
]

.large[Can see .nicegreen[range and extremes]]
]

```
                    
                    ------------------------------------------------------
                     x       Freq CumFreq Percent CumPerc Valid  CumValid
                     1       248  248     24.80%  24.80%  25.78% 25.78%  
                     2       233  481     23.30%  48.10%  24.22% 50.00%  
                     3       248  729     24.80%  72.90%  25.78% 75.78%  
                     4       233  962     23.30%  96.20%  24.22% 100.00% 
                     Missing 38   1000    3.80%   100.00%                
                    ------------------------------------------------------
```

---
# Frequencies and Viz's Together ❤️

.pull-left[
### Bar Graph

![](ch2_plots_files/figure-html/unnamed-chunk-2-1.png)
]

.pull-right[
### Histogram

![](ch2_plots_files/figure-html/unnamed-chunk-3-1.png)
]

???
What differences do you notice between them?

##### Bar
- Do NOT touch each other
- Begin and terminate at real limits
- Centered on the value
- Height = frequency

##### Histogram
- Touch each other
- Begin and terminate at real limits
- Centered on interval midpoint
- Height = frequency
- Interval size or ‘bin’ determines shape
    - Too narrow or too wide problematic
- Useful for checking distributional assumptions

---
# What does DISTRIBUTION mean?

### The way that the data points are scattered

.pull-left[.large[.large[
**For .dcoral[Continuous]**

- General shape
- Exceptions (outliers)
- Modes (peaks)
- Center & spread (chap 3)
- Histogram
]]]

.pull-right[.large[.large[
**For .nicegreen[Categorical]**

- Counts of each
- Percent or Rate (adjusts for an ‘out of’ to compare)
- Bar chart
- Pie chart - avoid!
]]]

---

## Example: Household Size

Table 2.5, page 29

```

-----------------------------------
 size Freq CumFreq Percent CumPerc
 2    1    1       4.00%   4.00%  
 4    4    5       16.00%  20.00% 
 5    1    6       4.00%   24.00% 
 6    7    13      28.00%  52.00% 
 7    3    16      12.00%  64.00% 
 8    5    21      20.00%  84.00% 
 9    2    23      8.00%   92.00% 
 10   2    25      8.00%   100.00%
-----------------------------------
```

---

.pull-left[
### Bar Chart
#### Discrete: spaces between bars
![](ch2_plots_files/figure-html/unnamed-chunk-8-1.png)

]

.pull-right[
### Histogram
#### Continuous: no space between bars
![](ch2_plots_files/figure-html/unnamed-chunk-9-1.png)

]

---

.pull-left[
### Histogram
![](ch2_plots_files/figure-html/unnamed-chunk-10-1.png)

]

.pull-right[
### Frequency Polygon
![](ch2_plots_files/figure-html/unnamed-chunk-11-1.png)

]

---

.pull-left[
### Frequency Polygon (like histogram)
![](ch2_plots_files/figure-html/unnamed-chunk-12-1.png)

]

.pull-right[
### Cummulative Frequency Polygon (Ogive)
![](ch2_plots_files/figure-html/unnamed-chunk-13-1.png)
]

---

### Ogive

.pull-left[
#### Count
![](ch2_plots_files/figure-html/unnamed-chunk-14-1.png)

]

--
.pull-right[
#### Proportion
![](ch2_plots_files/figure-html/unnamed-chunk-15-1.png)

]

---
## Linear Interpolation

.pull-left[
#### Twentyth percentile 
![](ch2_plots_files/figure-html/unnamed-chunk-16-1.png)

]

.pull-right[
#### Upper Qualtile (75th percentile)
![](ch2_plots_files/figure-html/unnamed-chunk-17-1.png)

]

---

## Real vs Theoretical Distributions

---
class: inverse, center, middle

# Interactive App

[Describing and Exploring Categorical Data](https://istats.shinyapps.io/EDA_categorical/)

---
class: inverse, center, middle

# Let's Apply This To the Inho Dataset

---
background-image: url(figures/fig_inho_data_desc.png)
background-position: 50% 80%
background-size: 800px

# Reminder

---
# Read in the Data & Wrangle it

```r
library(tidyverse)   # the easy button
library(readxl)      # read in Excel files
library(furniture)   # nice tables

data_clean <- readxl::read_excel("Ihno_dataset.xls") %>% 
 dplyr::rename_all(tolower) %>% 
 dplyr::mutate(genderF = factor(gender,
 levels = c(1, 2),
 labels = c("Female", "Male"))) %>% 
 dplyr::mutate(majorF = factor(major,
 levels= c(1, 2, 3, 4, 5),
 labels = c("Psychology", "Premed",
 "Biology", "Sociology",
 "Economics"))) %>%
 dplyr::mutate(coffeeF = factor(coffee,
 levels = c(0, 1),
 labels = c("Not a regular coffee drinker",
 "Regularly drinks coffee")))
```

---
# Frequency Distrubutions

.pull-left[

```r
data_clean %>%                 
  furniture::tableF(majorF)
```

```

-----------------------------------------
 majorF     Freq CumFreq Percent CumPerc
 Psychology 29   29      29.00%  29.00% 
 Premed     25   54      25.00%  54.00% 
 Biology    21   75      21.00%  75.00% 
 Sociology  15   90      15.00%  90.00% 
 Economics  10   100     10.00%  100.00%
-----------------------------------------
```
]

.pull-right[

```r
data_clean %>% 
  furniture::tableF(phobia)
```

```

-------------------------------------
 phobia Freq CumFreq Percent CumPerc
 0      12   12      12.00%  12.00% 
 1      15   27      15.00%  27.00% 
 2      12   39      12.00%  39.00% 
 3      16   55      16.00%  55.00% 
 4      21   76      21.00%  76.00% 
 5      11   87      11.00%  87.00% 
 6      1    88      1.00%   88.00% 
 7      4    92      4.00%   92.00% 
 8      4    96      4.00%   96.00% 
 9      1    97      1.00%   97.00% 
 10     3    100     3.00%   100.00%
-------------------------------------
```
]

---
# Frequency Viz's

.Huge[For viz's, we will use `ggplot2`]

.huge[This provides the most powerful, beautiful framework for data visualizations]

.large[.large[
- It is built on making .dcoral[layers]
- Each plot has a .nicegreen["geom"] function
    - e.g. `geom_bar()` for bar charts, `geom_histogram()` for histograms, etc.
]]

---
# Bar Charts: basic

.pull-left[

```r
data_clean %>% 
  ggplot(aes(coffee)) +
  geom_bar()
```

![](ch2_plots_files/figure-html/unnamed-chunk-22-1.png)
]

.pull-right[

```r
data_clean %>% 
  ggplot(aes(coffeeF)) +
  geom_bar()
```

![](ch2_plots_files/figure-html/unnamed-chunk-23-1.png)
]

---
# Bar Charts: basic

.pull-left[

```r
data_clean %>% 
  ggplot(aes(major)) +
  geom_bar()
```

![](ch2_plots_files/figure-html/unnamed-chunk-24-1.png)
]

.pull-right[

```r
data_clean %>% 
  ggplot(aes(majorF)) +
  geom_bar()
```

![](ch2_plots_files/figure-html/unnamed-chunk-25-1.png)
]

---
# Bar Charts: fancy

.pull-left[

```r
data_clean %>% 
  ggplot(aes(majorF)) +
  geom_bar(color = "black",
           fill  = "blue",
           alpha = .5) +
  labs(x = "Declared Major",
       y = "Number of Students",
       title = "Ihno's Experiment",
       subtitle = "Participant Demographics") +
  theme_bw()
```
]

.pull-right[
![](ch2_plots_files/figure-html/unnamed-chunk-27-1.png)
]

---
# Histograms: default

.pull-left[

```r
data_clean %>% 
  ggplot(aes(phobia)) +
  geom_histogram()
```

![](ch2_plots_files/figure-html/unnamed-chunk-28-1.png)
]

.pull-right[

```r
data_clean %>% 
  ggplot(aes(phobia)) +
  geom_histogram(bins = 8)
```

![](ch2_plots_files/figure-html/unnamed-chunk-29-1.png)
]

---
# Histograms: change number of bins

.pull-left[

```r
data_clean %>% 
  ggplot(aes(phobia)) +
  geom_histogram(bins = 4)
```

![](ch2_plots_files/figure-html/unnamed-chunk-30-1.png)
]

.pull-right[

```r
data_clean %>% 
  ggplot(aes(phobia)) +
  geom_histogram(bins = 10)
```

![](ch2_plots_files/figure-html/unnamed-chunk-31-1.png)
]

---
# Histograms: change bin width

.pull-left[

```r
data_clean %>% 
  ggplot(aes(mathquiz)) +
  geom_histogram(binwidth = 10)
```

![](ch2_plots_files/figure-html/unnamed-chunk-32-1.png)
]

.pull-right[

```r
data_clean %>% 
  ggplot(aes(mathquiz)) +
  geom_histogram(binwidth = 4)
```

![](ch2_plots_files/figure-html/unnamed-chunk-33-1.png)
]

---
# Histograms -by- a Factor

.pull-left[

```r
data_clean %>% 
  ggplot(aes(mathquiz)) +
  geom_histogram(binwidth = 4) +
  facet_grid(. ~ coffeeF)
```

![](ch2_plots_files/figure-html/unnamed-chunk-34-1.png)
]

.pull-right[

```r
data_clean %>% 
  ggplot(aes(mathquiz)) +
  geom_histogram(binwidth = 4) +
  facet_grid(coffeeF ~ .)
```

![](ch2_plots_files/figure-html/unnamed-chunk-35-1.png)
]

---
# Deciles: break into 10% chunks

```r
data_clean %>% 
  dplyr::pull(statquiz) %>% 
  quantile(probs = c(0, .10, .20, .30, .40, .50, .60, .70, .80, .90))
```

```
 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 
1.0 4.0 6.0 6.0 7.0 7.0 8.0 8.0 8.0 8.1 
```

---
# Deciles - with missing values

```r
data_clean %>% 
  dplyr::pull(mathquiz) %>% 
  quantile(probs = c(.10, .20, .30, .40, .50, .60, .70, .80, .90))
```

`Error in quantile.default(., probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, : missing values and NaN's not allowed if 'na.rm' is FALSE`

---
# Deciles - use optional `na.rm = TRUE`

```r
data_clean %>% 
  dplyr::pull(mathquiz) %>% 
  quantile(probs = c(0, .10, .20, .30, .40, .50, .60, .70, .80, .90),
           na.rm = TRUE)
```

```
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 
 9.0 15.0 21.0 25.2 28.0 30.0 32.0 33.8 37.2 41.0 
```

---
# Quartiles: break into 4 chunks

```r
data_clean %>% 
  dplyr::pull(statquiz) %>% 
  quantile(probs = c(0, .25, .50, .75, 1))
```

```
  0%  25%  50%  75% 100% 
   1    6    7    8   10 
```

---
# Percentiles: break at any point(s)

```r
data_clean %>% 
  dplyr::pull(statquiz) %>% 
  quantile(probs = c(.01, .05, .173, .90))
```

```
   1%    5% 17.3%   90% 
 2.98  3.00  5.00  8.10 
```

---
class: inverse, center, middle

# Questions?

---
class: inverse, center, middle

# Next Topic

### Center and Spread