3 Summary Statistics

Using the psych::describe() function

library(psych)           # Lots of good tidbits

The describe() function from the psych package returns an extensive listing of basic summary statistics for every variable in a dataset (Revelle 2020).

vars number order of the variables in this table
n how many non-missing values there are
mean the average or arithmetic mean
sd the standard deviation
median the 50th percentile or Q2
trimmed the mean after removing the top and bottom 10% of values
mad median absolute deviation (from the median) DO NOT WORRY ABOUT!
min the minimum or lowest value
max the maximum or highest value
range full range of values, max - min
skew skewness (no SE for skewness given)
kurtosis kurtosis (no SE for kurtosis given)
se the standard error for the MEAN, not the skewness or kurtosis

3.1 All Variables in a Dataset

cancer_clean %>% 
  psych::describe()

# A tibble: 9 x 13
   vars     n   mean     sd median trimmed   mad   min   max range   skew
  <int> <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1     1    25  13     7.36     13    13     8.90     1   25    24   0    
2     2    25   1.44  0.507     1     1.43  0        1    2     1   0.227
3     3    25  59.6  12.9      60    60.0  11.9     27   86    59  -0.307
4     4    25 178.   32.0     173.  177.   21.1    124  261.  137.  0.730
5     5    25   2.88  1.24      2     2.81  1.48     1    5     4   0.726
6     6    25   6.52  1.53      6     6.33  0        4   12     8   1.80 
7     7    25   8.28  2.54      8     8.10  2.97     4   16    12   1.01 
8     8    25  10.4   3.47     10    10.2   2.97     6   17    11   0.487
9     9    23   9.48  3.49      9     9.21  2.97     3   19    16   0.770
# ... with 2 more variables: kurtosis <dbl>, se <dbl>

NOTE The names of categorical variables (factors) are followed by an astrics to indicate that summary statistics should not be evaluated since the variable is not continuous or on an interval scale.

3.2 A Subset of Varaibles in a Datasets

It is better to avoid calculating summary statistics for categorical variables in the first place by first restricting the dataset to only continuous variables using a dplyr::select() step.

Make sure to use a dplyr::select(var1, var2, ..., var12) step to select only the variables of interest.

cancer_clean %>% 
  dplyr::select(age, weighin, totalcin, totalcw2, totalcw4, totalcw6) %>%
  psych::describe()

# A tibble: 6 x 13
   vars     n   mean    sd median trimmed   mad   min   max range   skew
  <int> <dbl>  <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1     1    25  59.6  12.9     60    60.0  11.9     27   86    59  -0.307
2     2    25 178.   32.0    173.  177.   21.1    124  261.  137.  0.730
3     3    25   6.52  1.53     6     6.33  0        4   12     8   1.80 
4     4    25   8.28  2.54     8     8.10  2.97     4   16    12   1.01 
5     5    25  10.4   3.47    10    10.2   2.97     6   17    11   0.487
6     6    23   9.48  3.49     9     9.21  2.97     3   19    16   0.770
# ... with 2 more variables: kurtosis <dbl>, se <dbl>