3 Summary Statistics

Using the psych::describe() function

The describe() function from the psych package returns an extensive listing of basic summary statistics for every variable in a dataset (Revelle 2020).

  • vars number order of the variables in this table
  • n how many non-missing values there are
  • mean the average or arithmetic mean
  • sd the standard deviation
  • median the 50th percentile or Q2
  • trimmed the mean after removing the top and bottom 10% of values
  • mad median absolute deviation (from the median) DO NOT WORRY ABOUT!
  • min the minimum or lowest value
  • max the maximum or highest value
  • range full range of values, max - min
  • skew skewness (no SE for skewness given)
  • kurtosis kurtosis (no SE for kurtosis given)
  • se the standard error for the MEAN, not the skewness or kurtosis

3.1 All Variables in a Dataset

# A tibble: 9 x 13
   vars     n   mean     sd median trimmed   mad   min   max range   skew
  <int> <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1     1    25  13     7.36     13    13     8.90     1   25    24   0    
2     2    25   1.44  0.507     1     1.43  0        1    2     1   0.227
3     3    25  59.6  12.9      60    60.0  11.9     27   86    59  -0.307
4     4    25 178.   32.0     173.  177.   21.1    124  261.  137.  0.730
5     5    25   2.88  1.24      2     2.81  1.48     1    5     4   0.726
6     6    25   6.52  1.53      6     6.33  0        4   12     8   1.80 
7     7    25   8.28  2.54      8     8.10  2.97     4   16    12   1.01 
8     8    25  10.4   3.47     10    10.2   2.97     6   17    11   0.487
9     9    23   9.48  3.49      9     9.21  2.97     3   19    16   0.770
# ... with 2 more variables: kurtosis <dbl>, se <dbl>

NOTE The names of categorical variables (factors) are followed by an astrics to indicate that summary statistics should not be evaluated since the variable is not continuous or on an interval scale.

3.2 A Subset of Varaibles in a Datasets

It is better to avoid calculating summary statistics for categorical variables in the first place by first restricting the dataset to only continuous variables using a dplyr::select() step.

Make sure to use a dplyr::select(var1, var2, ..., var12) step to select only the variables of interest.

# A tibble: 6 x 13
   vars     n   mean    sd median trimmed   mad   min   max range   skew
  <int> <dbl>  <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1     1    25  59.6  12.9     60    60.0  11.9     27   86    59  -0.307
2     2    25 178.   32.0    173.  177.   21.1    124  261.  137.  0.730
3     3    25   6.52  1.53     6     6.33  0        4   12     8   1.80 
4     4    25   8.28  2.54     8     8.10  2.97     4   16    12   1.01 
5     5    25  10.4   3.47    10    10.2   2.97     6   17    11   0.487
6     6    23   9.48  3.49     9     9.21  2.97     3   19    16   0.770
# ... with 2 more variables: kurtosis <dbl>, se <dbl>