class: center, middle, inverse, title-slide # Center and Spread ## Cohen Chapter 3
.small[EDUC/PSY 6600] --- class: center, middle ## "You can, for example, never foretell what any one man will do, but you can say with precision what an average number will be up to. *Individuals vary*, but percentages remain constant. So says the statistician." ### -- Sherlock Holmes, *The Sign of Four* --- background-image: url(figures/fig_dist_examples.png) background-position: 50% 90% background-size: 750px # Distributions Examples --- background-image: url(figures/fig_3centers.png) background-position: 50% 70% background-size: 1000px # Three Measures of Center --- ## Areas in Distribution Plot .pull-left[ Frequency Histogram <img src="figures/textbook_fig_3.2.PNG" width="833" style="display: block; margin: auto;" /> ] .pull-right[ Frequency Polygon <img src="figures/textbook_fig_3.3.PNG" width="991" style="display: block; margin: auto;" /> ] --- background-image: url(figures/fulcrum.png) background-position: 50% 80% background-size: 850px # Mean vs. Median .large[.large[ .nicegreen[Median]: the center point, half of values are on each side, not affected by the skew, the "typical value" .dcoral[Mean]: the "balance" point, pulled to the side of the skew, not typical <br><br><br> ]] -- .large[If distribution is symmetrical: mean = median] --- background-image: url(figures/fig_dist_income_2010.png) background-position: 50% 70% background-size: 1000px --- # Distributions and Numbers .pull-left[ .large[ - The MEDIAN is **resistant** & doesn't change much - The MEAN is **influenced** & changes more! - Average does NOT mean typical - Average moves when we remove the high point ]] -- .pull-right[ <img src="ch3_center_spread_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- # Distributions and Numbers .pull-left[ .large[ - The MEDIAN is **resistant** & doesn't change much - The MEAN is **influenced** & changes more! - Average does NOT mean typical - Average moves when we remove the high point - Median doesn't move when we remove the high point ]] -- .pull-right[ <img src="ch3_center_spread_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- background-image: url(figures/fig_three_spreads.jpg) background-position: 50% 70% background-size: 1100px # Three Measures of Spread --- # Best Summary of the Data? .huge[ "... the perfect estimator does not exist." -- Rand Wilcox, 2001 ] -- .pull-left[ .large[ ## .bluer[Median and SIR] Skewed data or outliers ]] .pull-right[ .large[ ## .nicegreen[Mean and SD] Symmetrical and no outliers ]] -- <br> .large[.large[ A .dcoral[graph gives the best overall picture of a distribution] ]] --- ## Properties of the Mean and SD .pull-left[ Add 10 to Every Value <img src="figures/textbook_fig_3.11.PNG" width="1177" style="display: block; margin: auto;" /> ] .pull-right[ Multiply Every Value by 10 <img src="figures/textbook_fig_3.12.PNG" width="1516" style="display: block; margin: auto;" /> ] <img src="figures/fig_sd_properties-small.jpg" width="70%" style="display: block; margin: auto;" /> --- # Skewness $$ Skewness = \frac{N}{N - 2}\frac{\sum_{i=1}^n (X_i - \bar{X})^3}{(N - 1)s^3} $$ .pull-left[ .large[ - Degree of .dcoral[symmetry] - Can detect **visually** - Skewness statistic - Based on cubed deviations from the mean - Divided by SE of skewness - `\(> \pm 2\)` is a sign of skewed data ]] -- .pull-right[ .large[ - Interpreting skewness statistic - Pos value = positive skew - Neg value = negative skew - Zero = no skew ]] --- # Skewness <img src="ch3_center_spread_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- # Kurtosis $$ Kurtosis = \frac{N(N+1)}{(N - 2)(N - 3)}\frac{\sum_{i=1}^n (X_i - \bar{X})^4}{(N - 1)s^4} - 3 \frac{(N - 1)(N - 1)}{(N - 2)(N - 3)} $$ .pull-left[ .large[ - Degree of .dcoral[flatness] in distribution - Harder to detect visually - Kurtosis statistic - Based on deviations from the mean (raised to 4th power) - Divided by SE of kurtosis - `\(> \pm 2\)` is a sign of problems with kurtosis ]] -- .pull-right[ .large[ - Interpreting kurtosis statistic - Pos value = peaked (leptokurtic) - Neg value = flat (platykurtic) - Zero = normal (mesokurtic) ]] --- background-image: url(figures/fig_kurtosis.png) background-position: 50% 70% background-size: 1000px # Kurtosis --- ## [Are the Skewness and Kurtosis Useful Statistics?](https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics) The .coral[skewness] and .coral[kurtosis] statistics appear to be very dependent on the sample size. In fact, even several hundred data points didn't give very good estimates of the true kurtosis and skewness. Smaller sample sizes can give results that are very misleading. .large[ .nicegreen[ > "In short, skewness and kurtosis are practically **worthless**." > "The statistics for skewness and kurtosis simply do not provide any useful information beyond that already given by the measures of location *(center)* and dispersion *(spread)*." ] ] So, don't put much emphasis on skewness and kurtosis values you may see. And remember, the more data you have, the better you can describe the shape of the distribution. But, in general, it appears there is little reason to pay much attention to skewness and kurtosis statistics. .coral[ **Just look at the histogram. It often gives you all the information you need.** ] --- <!-- Simple Learning Pro: The Five Number Summary, Boxplot, and Outliners (5 min)--> <iframe width="1000" height="750" src="https://www.youtube.com/embed/AGh66ZPpOSQ?controls=0&start=2" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- background-image: url(figures/fig_5sum_2.png) background-position: 50% 50% # Five-Number Summary --- background-image: url(figures/fig_5sum_3.png) background-position: 50% 50% # Five-Number Summary - Median --- background-image: url(figures/fig_5sum_4.png) background-position: 50% 50% # Five-Number Summary - Quartiles --- background-image: url(figures/fig_5sum_5.png) background-position: 50%50% # Boxplots (Modified) - Lines --- background-image: url(figures/fig_5sum_6.png) background-position: 50% 50% # Boxplots (Modified) - IQR and SIQR --- background-image: url(figures/fig_boxplot_hist.png) background-position: 50% 70% background-size: 1000px # Boxplot vs. Histogram --- # Boxplots by Group <img src="ch3_center_spread_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Density Plots <img src="ch3_center_spread_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- # Quantile-Quantile (Q-Q) Plot <img src="ch3_center_spread_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Interactive Apps [Describing and Exploring Quantitative Variables](https://istats.shinyapps.io/EDA_quantitative/ ) [Mean versus Median](https://istats.shinyapps.io/MeanvsMedian/) --- class: inverse, center, middle # Let's Apply This To the Cancer Dataset <br> (on Canvas) --- # Read in the Data ```r library(tidyverse) # Loads several very helpful 'tidy' packages library(haven) # Read in SPSS datasets library(furniture) # Nice tables (by our own Tyson Barrett) library(psych) # Lots of nice tid-bits ``` ```r cancer_raw <- haven::read_sav("cancer.sav") ``` -- ### And Clean It ```r cancer_clean <- cancer_raw %>% dplyr::rename_all(tolower) %>% dplyr::mutate(id = factor(id)) %>% dplyr::mutate(trt = factor(trt, levels = c(0, 1), labels = c("Placebo", "Aloe Juice"))) %>% dplyr::mutate(stage = factor(stage)) ``` --- ## Frequency Tables with `furniture::tableF()` .pull-left[ ```r cancer_clean %>% furniture::tableF(age, n = 8) ``` ``` ---------------------------------- age Freq CumFreq Percent CumPerc 27 1 1 4.00% 4.00% 42 1 2 4.00% 8.00% 44 1 3 4.00% 12.00% 46 2 5 8.00% 20.00% ... ... ... ... ... 68 1 20 4.00% 80.00% 69 1 21 4.00% 84.00% 73 1 22 4.00% 88.00% 77 2 24 8.00% 96.00% 86 1 25 4.00% 100.00% ---------------------------------- ``` ] -- .pull-right[ ```r cancer_clean %>% furniture::tableF(trt) ``` ``` ----------------------------------------- trt Freq CumFreq Percent CumPerc Placebo 14 14 56.00% 56.00% Aloe Juice 11 25 44.00% 100.00% ----------------------------------------- ``` ] --- ## Extensive Descriptive Stats: `psych:describe()` ```r cancer_clean %>% dplyr::select(age, weighin, totalcin, totalcw2, totalcw4, totalcw6) %>% psych::describe() ``` ``` vars n mean sd median trimmed mad min max range skew age 1 25 59.64 12.93 60.0 59.95 11.86 27 86.0 59.0 -0.31 weighin 2 25 178.28 31.98 172.8 176.57 21.05 124 261.4 137.4 0.73 totalcin 3 25 6.52 1.53 6.0 6.33 0.00 4 12.0 8.0 1.80 totalcw2 4 25 8.28 2.54 8.0 8.10 2.97 4 16.0 12.0 1.01 totalcw4 5 25 10.36 3.47 10.0 10.19 2.97 6 17.0 11.0 0.49 totalcw6 6 23 9.48 3.49 9.0 9.21 2.97 3 19.0 16.0 0.77 kurtosis se age -0.01 2.59 weighin 0.07 6.40 totalcin 4.30 0.31 totalcw2 1.14 0.51 totalcw4 -1.00 0.69 totalcw6 0.53 0.73 ``` --- ## Brief Descriptive Stats: `furniture::table1()` .pull-left[ Defaults ```r cancer_clean %>% furniture::table1(trt, age, weighin) ``` ``` --------------------------------- Mean/Count (SD/%) n = 25 trt Placebo 14 (56%) Aloe Juice 11 (44%) age 59.6 (12.9) weighin 178.3 (32.0) --------------------------------- ``` ] -- .pull-left[ Add Text ```r cancer_clean %>% furniture::table1("Treatment" = trt, "Age, years" = age, "Weight, lbs" = weighin) ``` ``` --------------------------------- Mean/Count (SD/%) n = 25 Treatment Placebo 14 (56%) Aloe Juice 11 (44%) Age, years 59.6 (12.9) Weight, lbs 178.3 (32.0) --------------------------------- ``` ] --- ## Stratified Stats: `furniture::table1()` .pull-left[ Defaults, but increase the number of digits ```r cancer_clean %>% dplyr::group_by(trt) %>% furniture::table1(age, weighin, digits = 2) ``` ``` --------------------------------------- trt Placebo Aloe Juice n = 14 n = 11 age 59.79 (8.98) 59.45 (17.22) weighin 167.51 (23.01) 191.99 (37.37) --------------------------------------- ``` ] -- .pull-right[ Add Text ```r cancer_clean %>% dplyr::group_by("Treatment" = trt) %>% furniture::table1("Age, years" = age, "Weight, lbs" = weighin, total = TRUE) ``` ``` ---------------------------------------------------- Treatment Total Placebo Aloe Juice n = 25 n = 14 n = 11 Age, years 59.6 (12.9) 59.8 (9.0) 59.5 (17.2) Weight, lbs 178.3 (32.0) 167.5 (23.0) 192.0 (37.4) ---------------------------------------------------- ``` ] --- ## Boxplot, one one `geom_boxplot()` ```r cancer_clean %>% ggplot(aes(x = "Full Sample", # x = "quoted text" y = age)) + # y = contin_var (no quotes) geom_boxplot() ``` <img src="ch3_center_spread_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- ## Boxplots, by groups - (1) fill color ```r cancer_clean %>% ggplot(aes(x = "Full Sample", # x = "quoted text" y = age, # y = contin_var (no quotes) fill = trt)) + # fill = group_var (no quotes) geom_boxplot() ``` <img src="ch3_center_spread_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- ## Boxplots, by groups - (2) x-axis breaks ```r cancer_clean %>% ggplot(aes(x = trt, # x = group_var (no quotes) y = age)) + # y = contin_var (no quotes) geom_boxplot() ``` <img src="ch3_center_spread_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- ## Boxplots, by groups - (3) seperate panels ```r cancer_clean %>% ggplot(aes(x = "Full Sample", # x = "quoted text" y = age)) + # y = contin_var (no quotes) geom_boxplot() + facet_grid(. ~ trt) # . ~ group_var (no quotes) ``` <img src="ch3_center_spread_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> --- ## Boxplot for a Subset - 1 requirement ```r cancer_clean %>% # Less than 172 Pound at baseline dplyr::filter(weighin < 172) %>% ggplot(aes(x = "Weigh At Baseline < 172", y = age)) + geom_boxplot() ``` <img src="ch3_center_spread_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> --- ## Boxplot for a Subset - 2 requirements ```r cancer_clean %>% # At least 150 pounds AND not in Aloe group dplyr::filter(weighin >= 150 & trt == "Placebo") %>% ggplot(aes(x = "Placebo and at least 150 Pounds", y = age)) + geom_boxplot() ``` <img src="ch3_center_spread_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> --- ## Boxplot for a Subset - 2 requirements (`%in%`) ```r cancer_clean %>% # In Aloe group, but only stages 2-4 dplyr::filter(trt == "Aloe Juice" & stage %in% c(2, 3, 4)) %>% ggplot(aes(x = "On Aloe Juice and Stage 2-4", y = weighin)) + geom_boxplot() ``` <img src="ch3_center_spread_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- ## Boxplot for Repeated Measures .pull-left[ ```r cancer_clean %>% tidyr::pivot_longer(cols = c(totalcw2, totalcw4, totalcw6), names_to = "week", names_pattern = "totalcw(.)", values_to = "condition") %>% ggplot(aes(x = week, y = condition)) + geom_boxplot() ``` ] .pull-right[ <img src="ch3_center_spread_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> ] --- ## Boxplot: COMPLICATED! .pull-left[ ```r cancer_clean %>% dplyr::filter(weighin > 130 & stage %in% c(2, 4)) %>% tidyr::pivot_longer(cols = c(totalcw2, totalcw4, totalcw6), names_to = "week", names_pattern = "totalcw(.)", values_to = "condition") %>% ggplot(aes(x = week, y = condition, fill = stage)) + geom_boxplot() + facet_grid(. ~ trt) ``` ] .pull-right[ <img src="ch3_center_spread_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> ] --- ## Alternative: Violin Plots .pull-left[ ```r cancer_clean %>% ggplot(aes(x = trt, y = age)) + geom_violin(fill = "gray") + geom_boxplot(fill = "white", alpha = .75, width = .25) + stat_summary(fun = mean, geom = "point", size = 5) + theme_bw() + labs(x = NULL, y = "Age in Years") + theme(legend.position = "none") ``` ] .pull-right[ <img src="ch3_center_spread_files/figure-html/unnamed-chunk-35-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle # Questions? --- class: inverse, center, middle # Next Topic ### Standard and Normal