17 GEE, Count Outcome: Epilepsy
17.1 Packages
17.1.1 CRAN
library(tidyverse) # all things tidy
library(pander) # nice looking genderal tabulations
library(furniture) # nice table1() descriptives
library(texreg) # Convert Regression Output to LaTeX or HTML Tables
library(psych) # contains some useful functions, like headTail
library(lme4) # Linear, generalized linear, & nonlinear mixed models
library(corrplot) # Vizualize correlation matrix
library(gee) # Genderalized Estimation Equation Solver
library(geepack) # Genderalized Estimation Equation Package
library(MuMIn) # Multi-Model Inference (caluclate QIC)
library(HSAUR) # package with the dataset
17.2 Background
This dataset was used as an example in Chapter 11 of “A Handbook of Statistical Analysis using R” by Brian S. Everitt and Torsten Hothorn. The authors include this data set in their HSAUR package on
CRAN
.
The Background
In this clinical trial, 59 patients suffering from epilepsy were randomized to groups receiving either the anti-epileptic drug “Progabide”” or a placebo in addition to standard chemotherapy. The numbers of seizures suffered in each of four, two-week periods were recorded for each patient along with a baseline seizure count for the 8 weeks prior to being randomized to treatment and age.
The Research Question
The main question of interest is whether taking progabide reduced the number of epileptic seizures compared with placebo.
The Data
Indicators
subject
the patient ID, a factor with levels 1 to 59
period
treatment period, an ordered factor with levels 1 to 4
Outcome or dependent variable
+
seizure.rate
the number of seizures (2-weeks)Main predictor or independent variable of interest
treatment
the treatment group, a factor with levels placebo and Progabide
Time-invariant Covariates
age
the age of the patient
base
the number of seizures before the trial (8 weeks)
17.2.1 Read in the data
data("epilepsy", package = "HSAUR")
Problem: The outcome (seizure.rate
) were counts over a TWO-week period and we would like to interpret the results in terms of effects on the WEEKLY rate.
- If we divide the values by TWO to get weekly rates, the outcome might be a decimal number
- The Poisson distribution may only be used for whole numbers (not deciamls)
Solution: We need to include an offset
term in the model that indicates the LOG DURATION of each period.
- Every observation period is exactly 2 weeks in this experiment
- Create a variable in the original dataset that is equal to the LOG DURATION (
per = log(2)
) - To the formula for the
glm()
orgee()
, add:+ offset(per)
17.2.2 Long Format
<- epilepsy %>%
data_long ::select(subject, age, treatment, base,
dplyr%>%
period, seizure.rate) ::mutate(per = log(2)) %>% # new variable to use with the offset
dplyr::mutate(base_wk = base/8)
dplyr
str(data_long)
'data.frame': 236 obs. of 8 variables:
$ subject : Factor w/ 59 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
$ age : int 31 31 31 31 30 30 30 30 25 25 ...
$ treatment : Factor w/ 2 levels "placebo","Progabide": 1 1 1 1 1 1 1 1 1 1 ...
$ base : int 11 11 11 11 11 11 11 11 6 6 ...
$ period : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 2 3 4 1 2 3 4 1 2 ...
$ seizure.rate: int 5 3 3 3 3 5 3 3 2 4 ...
$ per : num 0.693 0.693 0.693 0.693 0.693 ...
$ base_wk : num 1.38 1.38 1.38 1.38 1.38 ...
::headTail(data_long, top = 10, bottom = 6) psych
subject age treatment base period seizure.rate per base_wk
1 1 31 placebo 11 1 5 0.69 1.38
110 1 31 placebo 11 2 3 0.69 1.38
112 1 31 placebo 11 3 3 0.69 1.38
114 1 31 placebo 11 4 3 0.69 1.38
2 2 30 placebo 11 1 3 0.69 1.38
210 2 30 placebo 11 2 5 0.69 1.38
212 2 30 placebo 11 3 3 0.69 1.38
214 2 30 placebo 11 4 3 0.69 1.38
3 3 25 placebo 6 1 2 0.69 0.75
310 3 25 placebo 6 2 4 0.69 0.75
... <NA> ... <NA> ... <NA> ... ... ...
582 58 36 Progabide 13 3 0 0.69 1.62
583 58 36 Progabide 13 4 0 0.69 1.62
59 59 37 Progabide 12 1 1 0.69 1.5
591 59 37 Progabide 12 2 4 0.69 1.5
592 59 37 Progabide 12 3 3 0.69 1.5
593 59 37 Progabide 12 4 2 0.69 1.5
17.2.3 Wide Format
<- data_long %>%
data_wide ::spread(key = period,
tidyrvalue = seizure.rate,
sep = "_") %>%
::arrange(subject)
dplyr
str(data_wide)
'data.frame': 59 obs. of 10 variables:
$ subject : Factor w/ 59 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ age : int 31 30 25 36 22 29 31 42 37 28 ...
$ treatment: Factor w/ 2 levels "placebo","Progabide": 1 1 1 1 1 1 1 1 1 1 ...
$ base : int 11 11 6 8 66 27 12 52 23 10 ...
$ per : num 0.693 0.693 0.693 0.693 0.693 ...
$ base_wk : num 1.38 1.38 0.75 1 8.25 ...
$ period_1 : int 5 3 2 4 7 5 6 40 5 14 ...
$ period_2 : int 3 5 4 4 18 2 4 20 6 13 ...
$ period_3 : int 3 3 0 1 9 8 0 23 6 6 ...
$ period_4 : int 3 3 5 4 21 7 2 12 5 0 ...
::headTail(data_wide) psych
subject age treatment base per base_wk period_1 period_2 period_3 period_4
1 1 31 placebo 11 0.69 1.38 5 3 3 3
2 2 30 placebo 11 0.69 1.38 3 5 3 3
3 3 25 placebo 6 0.69 0.75 2 4 0 5
4 4 36 placebo 8 0.69 1 4 4 1 4
... <NA> ... <NA> ... ... ... ... ... ... ...
56 56 26 Progabide 22 0.69 2.75 1 23 19 8
57 57 21 Progabide 25 0.69 3.12 2 3 0 1
58 58 36 Progabide 13 0.69 1.62 0 0 0 0
59 59 37 Progabide 12 0.69 1.5 1 4 3 2
17.3 Exploratory Data Analysis
17.3.1 Summarize
17.3.1.1 Demographics and Baseline
%>%
data_wide ::group_by(treatment) %>%
dplyr::table1(age, base, base_wk,
furnituretotal = TRUE,
test = TRUE,
digits = 2,
type = "full",
output = "markdown")
Total | placebo | Progabide | Test | P-Value | |
---|---|---|---|---|---|
n = 59 | n = 28 | n = 31 | |||
age | T-Test: 0.76 | 0.449 | |||
28.34 (6.30) | 29.00 (6.00) | 27.74 (6.60) | |||
base | T-Test: -0.12 | 0.907 | |||
31.22 (26.88) | 30.79 (26.10) | 31.61 (27.98) | |||
base_wk | T-Test: -0.12 | 0.907 | |||
3.90 (3.36) | 3.85 (3.26) | 3.95 (3.50) |
17.3.1.2 Outcome Across Time
Note: The Poisson distribution specifies that the MEAN = VARIANCE
In this dataset, the variance is much larger than the mean, at all time points for both groups. This is evidence of overdispersion and suggest the scale parameter should be greater than one.
%>%
data_long ::group_by(treatment, period) %>%
dplyr::summarise(N = n(),
dplyrM = mean(seizure.rate),
VAR = var(seizure.rate),
SD = sd(seizure.rate)) %>%
::pander() pander
summarise()
has grouped output by ‘treatment’. You can override using the .groups
argument.
treatment | period | N | M | VAR | SD |
---|---|---|---|---|---|
placebo | 1 | 28 | 9.4 | 103 | 10.1 |
placebo | 2 | 28 | 8.3 | 67 | 8.2 |
placebo | 3 | 28 | 8.8 | 215 | 14.7 |
placebo | 4 | 28 | 8.0 | 58 | 7.6 |
Progabide | 1 | 31 | 8.6 | 333 | 18.2 |
Progabide | 2 | 31 | 8.4 | 141 | 11.9 |
Progabide | 3 | 31 | 8.1 | 193 | 13.9 |
Progabide | 4 | 31 | 6.7 | 127 | 11.3 |
17.3.1.3 Correlation Across Time
Raw Scale
%>%
data_long ::select(subject, period, seizure.rate ) %>%
dplyr::spread(key = period,
tidyrvalue = seizure.rate ) %>%
::select(-subject) %>%
dplyrcor() %>%
::corrplot.mixed() corrplot
Log Scale
%>%
data_long ::mutate(rate_wk = log(seizure.rate + 1)) %>%
dplyr::select(subject, period, rate_wk) %>%
dplyr::spread(key = period,
tidyrvalue = rate_wk) %>%
::select(-subject) %>%
dplyrcor() %>%
::corrplot.mixed() corrplot
17.3.2 Visualize
17.3.2.1 Oucome on the Raw Scale
There appear to be quite a few extreme values or outliers, particularly for the Progabide group during period one.
Since the outcome is truely a COUNT, we will model it with a Poisson distribution combined with a LOG link.
%>%
data_long ggplot(aes(x = period,
y = seizure.rate)) +
geom_boxplot() +
theme_bw() +
facet_grid(.~ treatment)
To investigate possible outliers, we should transform the outcome with the log
function first.
Note: Since some participants reported no seizures during a two week period and the
log(0)
is unndefinded, we must add some amount to the values before transforming. Here we have chosen to add the value of \(1\).
%>%
data_long ggplot(aes(x = period,
y = log(seizure.rate + 1))) +
geom_boxplot() +
theme_bw() +
facet_grid(.~ treatment)
%>%
data_long ggplot(aes(x = period,
y = log(seizure.rate + 1))) +
geom_line(aes(group = subject)) +
theme_bw() +
facet_grid(.~ treatment)
%>%
data_long ggplot(aes(x = period,
y = log(seizure.rate + 1))) +
geom_smooth(aes(group = subject),
method = "lm",
se = FALSE) +
geom_smooth(aes(group = 1),
color = "red",
size = 1.5,
method = "lm",
se = FALSE) +
theme_bw() +
facet_grid(.~ treatment)
17.4 Poisson Regression (GLM)
Note: THIS IS NEVER APPROPRIATE TO CONDUCT A GLM ON REPEATED MEASURES. THIS IS DONE FOR ILLUSTRATION PURPOSES ONLY!
17.4.1 Fit the model
<- glm(seizure.rate ~ base + age + treatment + offset(per),
fit_glm data = data_long,
family = poisson(link = "log"))
summary(fit_glm)
Call:
glm(formula = seizure.rate ~ base + age + treatment + offset(per),
family = poisson(link = "log"), data = data_long)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.4360 -1.4034 -0.5029 0.4842 12.3223
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1306156 0.1356191 -0.963 0.3355
base 0.0226517 0.0005093 44.476 < 2e-16 ***
age 0.0227401 0.0040240 5.651 1.59e-08 ***
treatmentProgabide -0.1527009 0.0478051 -3.194 0.0014 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2521.75 on 235 degrees of freedom
Residual deviance: 958.46 on 232 degrees of freedom
AIC: 1732.5
Number of Fisher Scoring iterations: 5
17.4.2 Tabulate the Model Parameters
::knitreg(list(fit_glm,
texreg::extract_glm_exp(fit_glm,
texreghelprinclude.any = FALSE)),
custom.model.names = c("b (SE)", "IRR [95 CI]"),
caption = "Poisson Generalized Linear Regression (GLM)",
single.row = TRUE,
digits = 3,
ci.test = 1)
b (SE) | IRR [95 CI] | |
---|---|---|
(Intercept) | -0.131 (0.136) | 0.878 [0.672; 1.144] |
base | 0.023 (0.001)*** | 1.023 [1.022; 1.024]* |
age | 0.023 (0.004)*** | 1.023 [1.015; 1.031]* |
treatmentProgabide | -0.153 (0.048)** | 0.858 [0.782; 0.943]* |
AIC | 1732.459 | |
BIC | 1746.314 | |
Log Likelihood | -862.229 | |
Deviance | 958.464 | |
Num. obs. | 236 | |
***p < 0.001; **p < 0.01; *p < 0.05 (or Null hypothesis value outside the confidence interval). |
17.5 Generalized Estimating Equations (GEE)
17.5.1 Match Poisson Regresssion (GLM)
- correlation structure:
independence
- scale parameter = \(1\)
<- gee::gee(seizure.rate ~ base + age + treatment + offset(per),
fit_gee_ind_s1 data = data_long,
family = poisson(link = "log"),
id = subject,
corstr = "independence",
scale.fix = TRUE,
scale.value = 1)
(Intercept) base age treatmentProgabide
-0.13061561 0.02265174 0.02274013 -0.15270095
summary(fit_gee_ind_s1)
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Logarithm
Variance to Mean Relation: Poisson
Correlation Structure: Independent
Call:
gee::gee(formula = seizure.rate ~ base + age + treatment + offset(per),
id = subject, data = data_long, family = poisson(link = "log"),
corstr = "independence", scale.fix = TRUE, scale.value = 1)
Summary of Residuals:
Min 1Q Median 3Q Max
-4.9195387 0.1808059 1.7073405 4.8850644 69.9658560
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -0.13061561 0.1356191185 -0.9631062 0.365148155 -0.3577058
base 0.02265174 0.0005093011 44.4761250 0.001235664 18.3316325
age 0.02274013 0.0040239970 5.6511312 0.011580405 1.9636736
treatmentProgabide -0.15270095 0.0478051054 -3.1942393 0.171108915 -0.8924196
Estimated Scale Parameter: 1
Number of Iterations: 1
Working Correlation
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
The estimates and the naive standard errors match the GLM exactly.
The naive SE’s are much smaller (half) than the robust (sandwich) SE’s, suggesting a poor fit.
17.5.2 Change Correlation Sturucture
- correlation structure:
exchangeable
- scale parameter = \(1\)
<- gee::gee(seizure.rate ~ base + age + treatment + offset(per),
fit_gee_exc_s1 data = data_long,
family = poisson(link = "log"),
id = subject,
corstr = "exchangeable",
scale.fix = TRUE,
scale.value = 1)
(Intercept) base age treatmentProgabide
-0.13061561 0.02265174 0.02274013 -0.15270095
summary(fit_gee_exc_s1)
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Logarithm
Variance to Mean Relation: Poisson
Correlation Structure: Exchangeable
Call:
gee::gee(formula = seizure.rate ~ base + age + treatment + offset(per),
id = subject, data = data_long, family = poisson(link = "log"),
corstr = "exchangeable", scale.fix = TRUE, scale.value = 1)
Summary of Residuals:
Min 1Q Median 3Q Max
-4.9195387 0.1808059 1.7073405 4.8850644 69.9658560
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -0.13061561 0.2004416507 -0.6516391 0.365148155 -0.3577058
base 0.02265174 0.0007527342 30.0926122 0.001235664 18.3316325
age 0.02274013 0.0059473665 3.8235638 0.011580405 1.9636736
treatmentProgabide -0.15270095 0.0706547450 -2.1612270 0.171108915 -0.8924196
Estimated Scale Parameter: 1
Number of Iterations: 1
Working Correlation
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.3948033 0.3948033 0.3948033
[2,] 0.3948033 1.0000000 0.3948033 0.3948033
[3,] 0.3948033 0.3948033 1.0000000 0.3948033
[4,] 0.3948033 0.3948033 0.3948033 1.0000000
- Although the estimated beta parameters are not much different, the naive SE’s are some closer to the robust SE’s.
17.5.2.1 Interpretation
For determining significance, no p-values are given, however the p-value will be < .05 when the z-score is > 1.96.
“For this example, the estimates of standard errors under the independence are about half of the corresponding robust estimates, and the the situation improves only a little when an exchangeable structure is fitted. Using naive standard errors leads, in particular, to a highly significant treatment effect which disappears when the robust estimates are used.”
17.5.3 Estimate the Additional Scale Parameter
“The problem with the GEE approach here, using either the independence or exchangeable correlation structure lies in constraining the scale parameter to be one. For these data there is overdispersion (variance is larger than the mean value) which has to be accommodated by allowing this parameter to be freely estimated.”
- correlation structure:
exchangeable
- scale parameter = freely estimated
<- gee::gee(seizure.rate ~ base + age + treatment + offset(per),
fit_gee_exc_sf data = data_long,
family = poisson(link = "log"),
id = subject,
corstr = "exchangeable",
scale.fix = FALSE)
(Intercept) base age treatmentProgabide
-0.13061561 0.02265174 0.02274013 -0.15270095
summary(fit_gee_exc_sf)
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Logarithm
Variance to Mean Relation: Poisson
Correlation Structure: Exchangeable
Call:
gee::gee(formula = seizure.rate ~ base + age + treatment + offset(per),
id = subject, data = data_long, family = poisson(link = "log"),
corstr = "exchangeable", scale.fix = FALSE)
Summary of Residuals:
Min 1Q Median 3Q Max
-4.9195387 0.1808059 1.7073405 4.8850644 69.9658560
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -0.13061561 0.45219954 -0.2888451 0.365148155 -0.3577058
base 0.02265174 0.00169818 13.3388301 0.001235664 18.3316325
age 0.02274013 0.01341735 1.6948302 0.011580405 1.9636736
treatmentProgabide -0.15270095 0.15939823 -0.9579840 0.171108915 -0.8924196
Estimated Scale Parameter: 5.089608
Number of Iterations: 1
Working Correlation
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.3948033 0.3948033 0.3948033
[2,] 0.3948033 1.0000000 0.3948033 0.3948033
[3,] 0.3948033 0.3948033 1.0000000 0.3948033
[4,] 0.3948033 0.3948033 0.3948033 1.0000000
The naive SE’s are much closer inline with the robust SE’s.
The scale parameter is estimated to be much larger than \(1\).
“When this is done (scale parameter is freely estimated) it gives the last set of results shown above. THe estimate of \(\phi\) is 5.09 and the naive and robust estimates of the standard errors are now very similar.”
17.5.4 Compare Models
17.5.4.1 Raw Estimates (logit scale)
::knitreg(list(fit_glm,
texreg
fit_gee_ind_s1,
fit_gee_exc_s1,
fit_gee_exc_sf),custom.model.names = c("GLM",
"GEE-Indep(1)",
"GEE-Exchg(1)",
"GEE-Exchg(free)"),
caption = "Estimates on Logit Scale",
digits = 3)
GLM | GEE-Indep(1) | GEE-Exchg(1) | GEE-Exchg(free) | |
---|---|---|---|---|
(Intercept) | -0.131 | -0.131 | -0.131 | -0.131 |
(0.136) | (0.365) | (0.365) | (0.365) | |
base | 0.023*** | 0.023*** | 0.023*** | 0.023*** |
(0.001) | (0.001) | (0.001) | (0.001) | |
age | 0.023*** | 0.023* | 0.023* | 0.023* |
(0.004) | (0.012) | (0.012) | (0.012) | |
treatmentProgabide | -0.153** | -0.153 | -0.153 | -0.153 |
(0.048) | (0.171) | (0.171) | (0.171) | |
AIC | 1732.459 | |||
BIC | 1746.314 | |||
Log Likelihood | -862.229 | |||
Deviance | 958.464 | |||
Num. obs. | 236 | 236 | 236 | 236 |
Scale | 1.000 | 1.000 | 5.090 | |
***p < 0.001; **p < 0.01; *p < 0.05 |
17.5.4.2 Exponentiate the Estimates (odds ratio scale)
::knitreg(list(extract_glm_exp(fit_glm),
texregextract_gee_exp(fit_gee_ind_s1),
extract_gee_exp(fit_gee_exc_s1),
extract_gee_exp(fit_gee_exc_sf)),
custom.model.names = c("GLM",
"GEE-Indep(1)",
"GEE-Exchg(1)",
"GEE-Exchg(free)"),
digits = 3,
caption = "Estimates on Odds-Ratio Scale",
caption.above = TRUE,
ci.test = 1)
GLM | GEE-Indep(1) | GEE-Exchg(1) | GEE-Exchg(free) | |
---|---|---|---|---|
(Intercept) | 0.878 | 0.878 | 0.878 | 0.878 |
[0.672; 1.144] | [0.429; 1.795] | [0.429; 1.795] | [0.429; 1.795] | |
base | 1.023* | 1.023* | 1.023* | 1.023* |
[1.022; 1.024] | [1.020; 1.025] | [1.020; 1.025] | [1.020; 1.025] | |
age | 1.023* | 1.023* | 1.023* | 1.023* |
[1.015; 1.031] | [1.000; 1.046] | [1.000; 1.046] | [1.000; 1.046] | |
treatmentProgabide | 0.858* | 0.858 | 0.858 | 0.858 |
[0.782; 0.943] | [0.614; 1.200] | [0.614; 1.200] | [0.614; 1.200] | |
Dispersion | 1.000 | 1.000 | 5.090 | |
* Null hypothesis value outside the confidence interval. |