17 GEE, Count Outcome: Epilepsy

17.1 Packages

17.1.1 CRAN

library(tidyverse)    # all things tidy
library(pander)       # nice looking genderal tabulations
library(furniture)    # nice table1() descriptives
library(texreg)       # Convert Regression Output to LaTeX or HTML Tables
library(psych)        # contains some useful functions, like headTail

library(lme4)         # Linear, generalized linear, & nonlinear mixed models

library(corrplot)     # Vizualize correlation matrix
library(gee)          # Genderalized Estimation Equation Solver
library(geepack)      # Genderalized Estimation Equation Package 
library(MuMIn)        # Multi-Model Inference (caluclate QIC)

library(HSAUR)        # package with the dataset

17.1.2 GitHub

Helper extract functions for exponentiating parameters form generalized regression models within a texreg table of model parameters.

# remotes::install_github("sarbearschwartz/texreghelpr") # first time

library(texreghelpr)

17.2 Background

This dataset was used as an example in Chapter 11 of “A Handbook of Statistical Analysis using R” by Brian S. Everitt and Torsten Hothorn. The authors include this data set in their HSAUR package on CRAN.

The Background

In this clinical trial, 59 patients suffering from epilepsy were randomized to groups receiving either the anti-epileptic drug “Progabide”” or a placebo in addition to standard chemotherapy. The numbers of seizures suffered in each of four, two-week periods were recorded for each patient along with a baseline seizure count for the 8 weeks prior to being randomized to treatment and age.

The Research Question

The main question of interest is whether taking progabide reduced the number of epileptic seizures compared with placebo.

The Data

  • Indicators

    • subject the patient ID, a factor with levels 1 to 59
    • period treatment period, an ordered factor with levels 1 to 4
  • Outcome or dependent variable

    +seizure.rate the number of seizures (2-weeks)

  • Main predictor or independent variable of interest

    • treatment the treatment group, a factor with levels placebo and Progabide
  • Time-invariant Covariates

    • age the age of the patient
    • base the number of seizures before the trial (8 weeks)

17.2.1 Read in the data

data("epilepsy", package = "HSAUR")

Problem: The outcome (seizure.rate) were counts over a TWO-week period and we would like to interpret the results in terms of effects on the WEEKLY rate.

  • If we divide the values by TWO to get weekly rates, the outcome might be a decimal number
  • The Poisson distribution may only be used for whole numbers (not deciamls)

Solution: We need to include an offset term in the model that indicates the LOG DURATION of each period.

  • Every observation period is exactly 2 weeks in this experiment
  • Create a variable in the original dataset that is equal to the LOG DURATION (per = log(2))
  • To the formula for the glm() or gee(), add: + offset(per)

17.2.2 Long Format

data_long <- epilepsy %>% 
  dplyr::select(subject, age, treatment, base, 
                period, seizure.rate) %>% 
  dplyr::mutate(per = log(2)) %>%                  # new variable to use with the offset
  dplyr::mutate(base_wk = base/8)
  
str(data_long)
'data.frame':   236 obs. of  8 variables:
 $ subject     : Factor w/ 59 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ age         : int  31 31 31 31 30 30 30 30 25 25 ...
 $ treatment   : Factor w/ 2 levels "placebo","Progabide": 1 1 1 1 1 1 1 1 1 1 ...
 $ base        : int  11 11 11 11 11 11 11 11 6 6 ...
 $ period      : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 2 3 4 1 2 3 4 1 2 ...
 $ seizure.rate: int  5 3 3 3 3 5 3 3 2 4 ...
 $ per         : num  0.693 0.693 0.693 0.693 0.693 ...
 $ base_wk     : num  1.38 1.38 1.38 1.38 1.38 ...
psych::headTail(data_long, top = 10, bottom = 6)
    subject age treatment base period seizure.rate  per base_wk
1         1  31   placebo   11      1            5 0.69    1.38
110       1  31   placebo   11      2            3 0.69    1.38
112       1  31   placebo   11      3            3 0.69    1.38
114       1  31   placebo   11      4            3 0.69    1.38
2         2  30   placebo   11      1            3 0.69    1.38
210       2  30   placebo   11      2            5 0.69    1.38
212       2  30   placebo   11      3            3 0.69    1.38
214       2  30   placebo   11      4            3 0.69    1.38
3         3  25   placebo    6      1            2 0.69    0.75
310       3  25   placebo    6      2            4 0.69    0.75
...    <NA> ...      <NA>  ...   <NA>          ...  ...     ...
582      58  36 Progabide   13      3            0 0.69    1.62
583      58  36 Progabide   13      4            0 0.69    1.62
59       59  37 Progabide   12      1            1 0.69     1.5
591      59  37 Progabide   12      2            4 0.69     1.5
592      59  37 Progabide   12      3            3 0.69     1.5
593      59  37 Progabide   12      4            2 0.69     1.5

17.2.3 Wide Format

data_wide <- data_long %>% 
  tidyr::spread(key = period,
                value = seizure.rate,
                sep = "_") %>% 
  dplyr::arrange(subject)

str(data_wide)
'data.frame':   59 obs. of  10 variables:
 $ subject  : Factor w/ 59 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ age      : int  31 30 25 36 22 29 31 42 37 28 ...
 $ treatment: Factor w/ 2 levels "placebo","Progabide": 1 1 1 1 1 1 1 1 1 1 ...
 $ base     : int  11 11 6 8 66 27 12 52 23 10 ...
 $ per      : num  0.693 0.693 0.693 0.693 0.693 ...
 $ base_wk  : num  1.38 1.38 0.75 1 8.25 ...
 $ period_1 : int  5 3 2 4 7 5 6 40 5 14 ...
 $ period_2 : int  3 5 4 4 18 2 4 20 6 13 ...
 $ period_3 : int  3 3 0 1 9 8 0 23 6 6 ...
 $ period_4 : int  3 3 5 4 21 7 2 12 5 0 ...
psych::headTail(data_wide)
    subject age treatment base  per base_wk period_1 period_2 period_3 period_4
1         1  31   placebo   11 0.69    1.38        5        3        3        3
2         2  30   placebo   11 0.69    1.38        3        5        3        3
3         3  25   placebo    6 0.69    0.75        2        4        0        5
4         4  36   placebo    8 0.69       1        4        4        1        4
...    <NA> ...      <NA>  ...  ...     ...      ...      ...      ...      ...
56       56  26 Progabide   22 0.69    2.75        1       23       19        8
57       57  21 Progabide   25 0.69    3.12        2        3        0        1
58       58  36 Progabide   13 0.69    1.62        0        0        0        0
59       59  37 Progabide   12 0.69     1.5        1        4        3        2

17.3 Exploratory Data Analysis

17.3.1 Summarize

17.3.1.1 Demographics and Baseline

data_wide %>% 
  dplyr::group_by(treatment) %>% 
  furniture::table1(age, base, base_wk,
                    total = TRUE,
                    test = TRUE,
                    digits = 2,
                    type = "full",
                    output = "markdown")
Total placebo Progabide Test P-Value
n = 59 n = 28 n = 31
age T-Test: 0.76 0.449
28.34 (6.30) 29.00 (6.00) 27.74 (6.60)
base T-Test: -0.12 0.907
31.22 (26.88) 30.79 (26.10) 31.61 (27.98)
base_wk T-Test: -0.12 0.907
3.90 (3.36) 3.85 (3.26) 3.95 (3.50)

17.3.1.2 Outcome Across Time

Note: The Poisson distribution specifies that the MEAN = VARIANCE

In this dataset, the variance is much larger than the mean, at all time points for both groups. This is evidence of overdispersion and suggest the scale parameter should be greater than one.

data_long %>% 
  dplyr::group_by(treatment, period) %>% 
  dplyr::summarise(N = n(),
                   M = mean(seizure.rate),
                   VAR = var(seizure.rate),
                   SD = sd(seizure.rate)) %>% 
  pander::pander()

summarise() has grouped output by ‘treatment’. You can override using the .groups argument.

treatment period N M VAR SD
placebo 1 28 9.4 103 10.1
placebo 2 28 8.3 67 8.2
placebo 3 28 8.8 215 14.7
placebo 4 28 8.0 58 7.6
Progabide 1 31 8.6 333 18.2
Progabide 2 31 8.4 141 11.9
Progabide 3 31 8.1 193 13.9
Progabide 4 31 6.7 127 11.3

17.3.1.3 Correlation Across Time

Raw Scale

data_long %>% 
  dplyr::select(subject, period, seizure.rate ) %>% 
  tidyr::spread(key = period,
                value = seizure.rate ) %>% 
  dplyr::select(-subject) %>% 
  cor() %>% 
  corrplot::corrplot.mixed()

Log Scale

data_long %>% 
  dplyr::mutate(rate_wk = log(seizure.rate + 1)) %>% 
  dplyr::select(subject, period, rate_wk) %>% 
  tidyr::spread(key = period,
                value = rate_wk) %>% 
  dplyr::select(-subject) %>% 
  cor() %>% 
  corrplot::corrplot.mixed()

17.3.2 Visualize

17.3.2.1 Oucome on the Raw Scale

There appear to be quite a few extreme values or outliers, particularly for the Progabide group during period one.

Since the outcome is truely a COUNT, we will model it with a Poisson distribution combined with a LOG link.

data_long %>% 
  ggplot(aes(x = period,
             y = seizure.rate)) +
  geom_boxplot() +
  theme_bw() +
  facet_grid(.~ treatment)

To investigate possible outliers, we should transform the outcome with the log function first.

Note: Since some participants reported no seizures during a two week period and the log(0) is unndefinded, we must add some amount to the values before transforming. Here we have chosen to add the value of \(1\).

data_long %>% 
  ggplot(aes(x = period,
             y = log(seizure.rate + 1))) +
  geom_boxplot() +
  
  theme_bw() +
  facet_grid(.~ treatment)

data_long %>% 
  ggplot(aes(x = period,
             y = log(seizure.rate + 1))) +
  geom_line(aes(group = subject)) +
  theme_bw() +
  facet_grid(.~ treatment)

data_long %>% 
  ggplot(aes(x = period,
             y = log(seizure.rate + 1))) +
  geom_smooth(aes(group = subject),
              method = "lm",
              se = FALSE) +
  geom_smooth(aes(group = 1),
              color = "red",
              size = 1.5,
              method = "lm",
              se = FALSE) +
  theme_bw() +
  facet_grid(.~ treatment)

17.4 Poisson Regression (GLM)

Note: THIS IS NEVER APPROPRIATE TO CONDUCT A GLM ON REPEATED MEASURES. THIS IS DONE FOR ILLUSTRATION PURPOSES ONLY!

17.4.1 Fit the model

fit_glm <- glm(seizure.rate ~ base + age + treatment + offset(per),
               data = data_long,
               family = poisson(link = "log"))

summary(fit_glm)

Call:
glm(formula = seizure.rate ~ base + age + treatment + offset(per), 
    family = poisson(link = "log"), data = data_long)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.4360  -1.4034  -0.5029   0.4842  12.3223  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -0.1306156  0.1356191  -0.963   0.3355    
base                0.0226517  0.0005093  44.476  < 2e-16 ***
age                 0.0227401  0.0040240   5.651 1.59e-08 ***
treatmentProgabide -0.1527009  0.0478051  -3.194   0.0014 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2521.75  on 235  degrees of freedom
Residual deviance:  958.46  on 232  degrees of freedom
AIC: 1732.5

Number of Fisher Scoring iterations: 5

17.4.2 Tabulate the Model Parameters

texreg::knitreg(list(fit_glm,
                     texreghelpr::extract_glm_exp(fit_glm,
                                                  include.any = FALSE)),
                custom.model.names = c("b (SE)", "IRR [95 CI]"),
                caption = "Poisson Generalized Linear Regression (GLM)",
                single.row = TRUE,
                digits = 3,
                ci.test = 1)
Poisson Generalized Linear Regression (GLM)
  b (SE) IRR [95 CI]
(Intercept) -0.131 (0.136) 0.878 [0.672; 1.144]
base 0.023 (0.001)*** 1.023 [1.022; 1.024]*
age 0.023 (0.004)*** 1.023 [1.015; 1.031]*
treatmentProgabide -0.153 (0.048)** 0.858 [0.782; 0.943]*
AIC 1732.459  
BIC 1746.314  
Log Likelihood -862.229  
Deviance 958.464  
Num. obs. 236  
***p < 0.001; **p < 0.01; *p < 0.05 (or Null hypothesis value outside the confidence interval).

17.5 Generalized Estimating Equations (GEE)

17.5.1 Match Poisson Regresssion (GLM)

  • correlation structure: independence
  • scale parameter = \(1\)
fit_gee_ind_s1 <- gee::gee(seizure.rate ~ base + age + treatment + offset(per),
                           data = data_long,
                           family = poisson(link = "log"),
                           id = subject,
                           corstr = "independence",
                           scale.fix = TRUE,
                           scale.value = 1)
       (Intercept)               base                age treatmentProgabide 
       -0.13061561         0.02265174         0.02274013        -0.15270095 
summary(fit_gee_ind_s1)

 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logarithm 
 Variance to Mean Relation: Poisson 
 Correlation Structure:     Independent 

Call:
gee::gee(formula = seizure.rate ~ base + age + treatment + offset(per), 
    id = subject, data = data_long, family = poisson(link = "log"), 
    corstr = "independence", scale.fix = TRUE, scale.value = 1)

Summary of Residuals:
       Min         1Q     Median         3Q        Max 
-4.9195387  0.1808059  1.7073405  4.8850644 69.9658560 


Coefficients:
                      Estimate   Naive S.E.    Naive z Robust S.E.   Robust z
(Intercept)        -0.13061561 0.1356191185 -0.9631062 0.365148155 -0.3577058
base                0.02265174 0.0005093011 44.4761250 0.001235664 18.3316325
age                 0.02274013 0.0040239970  5.6511312 0.011580405  1.9636736
treatmentProgabide -0.15270095 0.0478051054 -3.1942393 0.171108915 -0.8924196

Estimated Scale Parameter:  1
Number of Iterations:  1

Working Correlation
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1
  • The estimates and the naive standard errors match the GLM exactly.

  • The naive SE’s are much smaller (half) than the robust (sandwich) SE’s, suggesting a poor fit.

17.5.2 Change Correlation Sturucture

  • correlation structure: exchangeable
  • scale parameter = \(1\)
fit_gee_exc_s1 <- gee::gee(seizure.rate ~ base + age + treatment + offset(per),
                           data = data_long,
                           family = poisson(link = "log"),
                           id = subject,
                           corstr = "exchangeable",
                           scale.fix = TRUE,
                           scale.value = 1)
       (Intercept)               base                age treatmentProgabide 
       -0.13061561         0.02265174         0.02274013        -0.15270095 
summary(fit_gee_exc_s1)

 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logarithm 
 Variance to Mean Relation: Poisson 
 Correlation Structure:     Exchangeable 

Call:
gee::gee(formula = seizure.rate ~ base + age + treatment + offset(per), 
    id = subject, data = data_long, family = poisson(link = "log"), 
    corstr = "exchangeable", scale.fix = TRUE, scale.value = 1)

Summary of Residuals:
       Min         1Q     Median         3Q        Max 
-4.9195387  0.1808059  1.7073405  4.8850644 69.9658560 


Coefficients:
                      Estimate   Naive S.E.    Naive z Robust S.E.   Robust z
(Intercept)        -0.13061561 0.2004416507 -0.6516391 0.365148155 -0.3577058
base                0.02265174 0.0007527342 30.0926122 0.001235664 18.3316325
age                 0.02274013 0.0059473665  3.8235638 0.011580405  1.9636736
treatmentProgabide -0.15270095 0.0706547450 -2.1612270 0.171108915 -0.8924196

Estimated Scale Parameter:  1
Number of Iterations:  1

Working Correlation
          [,1]      [,2]      [,3]      [,4]
[1,] 1.0000000 0.3948033 0.3948033 0.3948033
[2,] 0.3948033 1.0000000 0.3948033 0.3948033
[3,] 0.3948033 0.3948033 1.0000000 0.3948033
[4,] 0.3948033 0.3948033 0.3948033 1.0000000
  • Although the estimated beta parameters are not much different, the naive SE’s are some closer to the robust SE’s.

17.5.2.1 Interpretation

For determining significance, no p-values are given, however the p-value will be < .05 when the z-score is > 1.96.

“For this example, the estimates of standard errors under the independence are about half of the corresponding robust estimates, and the the situation improves only a little when an exchangeable structure is fitted. Using naive standard errors leads, in particular, to a highly significant treatment effect which disappears when the robust estimates are used.”

17.5.3 Estimate the Additional Scale Parameter

“The problem with the GEE approach here, using either the independence or exchangeable correlation structure lies in constraining the scale parameter to be one. For these data there is overdispersion (variance is larger than the mean value) which has to be accommodated by allowing this parameter to be freely estimated.”

  • correlation structure: exchangeable
  • scale parameter = freely estimated
fit_gee_exc_sf <- gee::gee(seizure.rate ~ base + age + treatment + offset(per),
                           data = data_long,
                           family = poisson(link = "log"),
                           id = subject,
                           corstr = "exchangeable",
                           scale.fix = FALSE)
       (Intercept)               base                age treatmentProgabide 
       -0.13061561         0.02265174         0.02274013        -0.15270095 
summary(fit_gee_exc_sf)

 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logarithm 
 Variance to Mean Relation: Poisson 
 Correlation Structure:     Exchangeable 

Call:
gee::gee(formula = seizure.rate ~ base + age + treatment + offset(per), 
    id = subject, data = data_long, family = poisson(link = "log"), 
    corstr = "exchangeable", scale.fix = FALSE)

Summary of Residuals:
       Min         1Q     Median         3Q        Max 
-4.9195387  0.1808059  1.7073405  4.8850644 69.9658560 


Coefficients:
                      Estimate Naive S.E.    Naive z Robust S.E.   Robust z
(Intercept)        -0.13061561 0.45219954 -0.2888451 0.365148155 -0.3577058
base                0.02265174 0.00169818 13.3388301 0.001235664 18.3316325
age                 0.02274013 0.01341735  1.6948302 0.011580405  1.9636736
treatmentProgabide -0.15270095 0.15939823 -0.9579840 0.171108915 -0.8924196

Estimated Scale Parameter:  5.089608
Number of Iterations:  1

Working Correlation
          [,1]      [,2]      [,3]      [,4]
[1,] 1.0000000 0.3948033 0.3948033 0.3948033
[2,] 0.3948033 1.0000000 0.3948033 0.3948033
[3,] 0.3948033 0.3948033 1.0000000 0.3948033
[4,] 0.3948033 0.3948033 0.3948033 1.0000000
  • The naive SE’s are much closer inline with the robust SE’s.

  • The scale parameter is estimated to be much larger than \(1\).

“When this is done (scale parameter is freely estimated) it gives the last set of results shown above. THe estimate of \(\phi\) is 5.09 and the naive and robust estimates of the standard errors are now very similar.”

17.5.4 Compare Models

17.5.4.1 Raw Estimates (logit scale)

texreg::knitreg(list(fit_glm, 
                     fit_gee_ind_s1, 
                     fit_gee_exc_s1, 
                     fit_gee_exc_sf),
                custom.model.names = c("GLM",
                                       "GEE-Indep(1)",
                                       "GEE-Exchg(1)",
                                       "GEE-Exchg(free)"),
                caption = "Estimates on Logit Scale",
                digits = 3)
Estimates on Logit Scale
  GLM GEE-Indep(1) GEE-Exchg(1) GEE-Exchg(free)
(Intercept) -0.131 -0.131 -0.131 -0.131
  (0.136) (0.365) (0.365) (0.365)
base 0.023*** 0.023*** 0.023*** 0.023***
  (0.001) (0.001) (0.001) (0.001)
age 0.023*** 0.023* 0.023* 0.023*
  (0.004) (0.012) (0.012) (0.012)
treatmentProgabide -0.153** -0.153 -0.153 -0.153
  (0.048) (0.171) (0.171) (0.171)
AIC 1732.459      
BIC 1746.314      
Log Likelihood -862.229      
Deviance 958.464      
Num. obs. 236 236 236 236
Scale   1.000 1.000 5.090
***p < 0.001; **p < 0.01; *p < 0.05

17.5.4.2 Exponentiate the Estimates (odds ratio scale)

texreg::knitreg(list(extract_glm_exp(fit_glm), 
                     extract_gee_exp(fit_gee_ind_s1), 
                     extract_gee_exp(fit_gee_exc_s1), 
                     extract_gee_exp(fit_gee_exc_sf)),
                custom.model.names = c("GLM",
                                       "GEE-Indep(1)",
                                       "GEE-Exchg(1)",
                                       "GEE-Exchg(free)"),
                digits = 3,
                caption = "Estimates on Odds-Ratio Scale",
                caption.above = TRUE,
                ci.test = 1)
Estimates on Odds-Ratio Scale
  GLM GEE-Indep(1) GEE-Exchg(1) GEE-Exchg(free)
(Intercept) 0.878 0.878 0.878 0.878
  [0.672; 1.144] [0.429; 1.795] [0.429; 1.795] [0.429; 1.795]
base 1.023* 1.023* 1.023* 1.023*
  [1.022; 1.024] [1.020; 1.025] [1.020; 1.025] [1.020; 1.025]
age 1.023* 1.023* 1.023* 1.023*
  [1.015; 1.031] [1.000; 1.046] [1.000; 1.046] [1.000; 1.046]
treatmentProgabide 0.858* 0.858 0.858 0.858
  [0.782; 0.943] [0.614; 1.200] [0.614; 1.200] [0.614; 1.200]
Dispersion   1.000 1.000 5.090
* Null hypothesis value outside the confidence interval.

17.6 Conculsion

The Research Question

The main question of interest is whether taking progabide reduced the number of epileptic seizures compared with placebo.

The Conclusion

There is no evidence that Progabide effects the weekly rate of epileptic seizures differently than placebo.