8 Boxplots

Using the ggplot2::geom_boxplot() function

For boxplots, you must specify the variable of interest within the aesthetics as the y variable: ggplot(aes(y = continuous_var)) before adding the geom_boxplot() layer (Wickham et al. 2019).

Reminder: Steps BEFORE the ggplot() are combined with pipes %>%, whereas layers of the plot are combined with the addition symbol +.

8.1 Single continuous variable

If you only want to produce a single boxplot, then the aesthetics must include some quoted text as the x variable.

8.2 Single Box -for- a Subset

8.2.1 One Requirement

Where as the dplyr::select() function specifies which VARIABLES to reduce down to, the dplyr::filter() function specifies which ROWS or PARTICIPANTS to reduce down to.

When using a dplyr::filter() step, make sure to change the x = "text", too.

8.2.2 Two Requirements

When testing for a match to a value (for equality) use == instead of = in the dplyr::filter() step.

Use the & symbol to require multiple conditions for the subset, but only include one quoted text phrase for x.

8.2.3 A Requirement Specified with a List

A helpful symbol-set is %in%, which tests if the thing before it is included in the list of elements that comes after it.

8.3 Multiple Boxes -by- a Factor

There are three ways to included a categorical variable to break the sample into groups. You may specify the factor with:

  1. fill the variable is denoted with different colors filling the boxes
  2. x the horizontal axis is marked for all levels
  3. facet_grid() a panel is create for each level

8.4 Multiple Boxes -by- 2 Factors

You may combine any of the three previous specification, one per categorical (factor) variable.

8.5 Multiple Boxes -for- a Subset AND -by- 2 Factors

The layers maybe combined to create more complicated plots.

Here is a plot for only participants who were in stage 1, 2, or 4 (n > 1 per stage), and compares the final oral condition for each stage between participants receiving the Aloe treatment vs. the placebo.

8.6 Multiple Boxes -for- Repeated Measurements

The ordinal data is in what is called “WIDE” format, with the repeated measurements as separate variables, sitting side-by-side. Each participant’s data is contained in a single line (25 lines here).

To sort a dataset, use the dplyr::arrange() function and include the variable or variables you wish to sort by in the parentheses.

# A tibble: 25 x 5
   id    totalcin totalcw2 totalcw4 totalcw6
   <fct>    <dbl>    <dbl>    <dbl>    <dbl>
 1 1            6        6        6        7
 2 2            7       16        9       10
 3 5            9        6       10        9
 4 6            7        9       17       19
 5 9            6        7        9        3
 6 11           6        7       16       13
 7 12           6       10       11        9
 8 14           4        6        8        7
 9 15           6        6        6       11
10 16           6       11       11       14
# ... with 15 more rows

Some data manipulations is needed to “stack” the repeated measurement variables (baseline, week 2, week 4, week 6) into a single variable we will call value. We also need another new variable that denotes the measurement time, which we will call time. The resulting dataset is said to be in “LONG” format. Now there will be one line for each observation time on each participant (4 x 25 = 100 lines here).

This is done with with the tidyr::gather(key = "new_time_var", value = "new_value_var", old_var_1, old_var_2, ...) function.

Ignore the message that says “attributes are not identical across measure variables; they will be dropped”.

# A tibble: 100 x 3
   id    time     value
   <fct> <chr>    <dbl>
 1 1     totalcin     6
 2 1     totalcw2     6
 3 1     totalcw4     6
 4 1     totalcw6     7
 5 2     totalcin     7
 6 2     totalcw2    16
 7 2     totalcw4     9
 8 2     totalcw6    10
 9 5     totalcin     9
10 5     totalcw2     6
# ... with 90 more rows

Once the data is in this format, you can create box plots for each time point.

Ignore the additional message that says “Removed 2 rows containing non-finite values (stat_boxplot)”. This is just alerting you to the fact that two people are missing their week 6 oral condiditon values.