8 Logistic Regression - Ex: Bronchopulmonary Dysplasia in Premature Infants

example walk through:

https://stats.idre.ucla.edu/r/dae/logit-regression/

info:

https://onlinecourses.science.psu.edu/stat504/node/216/

sjPlot::tab_model (HTML only)

http://www.strengejacke.de/sjPlot/articles/sjtlm.html#changing-summary-style-and-content

finafit

https://www.r-bloggers.com/elegant-regression-results-tables-and-plots-in-r-the-finalfit-package/

Install a package Dr. Schwartz wrote:

remotes::install_github("sarbearschwartz/texreghelpr")
library(tidyverse)
library(haven)        # read in SPSS dataset
library(furniture)    # nice table1() descriptives
library(stargazer)    # display nice tables: summary & regression
library(texreg)       # Convert Regression Output to LaTeX or HTML tables
library(texreghelpr)  # Dr. Schwartz's helper funtcions for texreg tables
library(psych)        # contains some useful functions, like headTail
library(car)          # Companion to Applied Regression
library(pscl)         # psudo R-squared function
library(interactions) # interaction plots
library(sjPlot)       # various plots
library(performance)  # r-squared values

8.1 Background

Simple example demonstrating basic modeling approach: Data on Bronchopulmonary Dysplasia (BPD) from 223 low birth weight infants (weighing less than 1750 grams).

8.1.1 Source

Data courtesy of Dr. Linda Van Marter.

8.1.2 Reference

Van Marter, L.J., Leviton, A., Kuban, K.C.K., Pagano, M. & Allred, E.N. (1990). Maternal glucocorticoid therapy and reduced risk of bronchopulmonary dysplasia. Pediatrics, 86, 331-336.

The data are from a study of low birth weight infants in a neonatal intensive care unit. The study was designed to examine the development of bronchopulmonary dysplasia (BPD), a chronic lung disease, in a sample of 223 infants weighing less than 1750 grams. The response variable is binary, denoting whether an infant develops BPD by day 28 of life (where BPD is defined by both oxygen requirement and compatible chest radiograph).

8.1.3 Variables

  • bpd 0 = no, 1 = yes
  • brthwght number of grams
  • gestage number of weeks
  • toxemia in mother, 0 = no, 1 = yes
bpd_raw <- read.table("https://raw.githubusercontent.com/CEHS-research/data/master/Regression/VanMarter_%20BPD.txt", 
                      header      = TRUE, 
                      strip.white = TRUE)
n <- nrow(bpd_raw)
n
[1] 223
tibble::glimpse(bpd_raw)
Rows: 223
Columns: 4
$ bpd      <int> 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0~
$ brthwght <int> 850, 1500, 1360, 960, 1560, 1120, 810, 1620, 1000, 700, 1330,~
$ gestage  <int> 27, 33, 32, 35, 33, 29, 28, 32, 30, 26, 31, 31, 31, 29, 33, 3~
$ toxemia  <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
head(bpd_raw)
# A tibble: 6 x 4
    bpd brthwght gestage toxemia
  <int>    <int>   <int>   <int>
1     1      850      27       0
2     0     1500      33       0
3     1     1360      32       0
4     0      960      35       1
5     0     1560      33       0
6     0     1120      29       0

Note: For logistic regression, you need to leave the outcome (dependent variable) coded as zeros 0 and ones 1 and NOT apply lables. You do want to apply labels to factors that function as predictors (independent varaibles).

bpd_clean <- bpd_raw %>% 
  dplyr::mutate(toxemia = factor(toxemia,
                                 levels = c(0, 1),
                                 labels = c("No", "Yes")))
summary(bpd_clean)
      bpd            brthwght       gestage      toxemia  
 Min.   :0.0000   Min.   : 450   Min.   :25.00   No :194  
 1st Qu.:0.0000   1st Qu.: 895   1st Qu.:28.00   Yes: 29  
 Median :0.0000   Median :1140   Median :30.00            
 Mean   :0.3408   Mean   :1173   Mean   :30.09            
 3rd Qu.:1.0000   3rd Qu.:1465   3rd Qu.:32.00            
 Max.   :1.0000   Max.   :1730   Max.   :37.00            

8.2 Logistic Regresion: Fit the Model to the data

Instead of using the lm() function from base R, you use glm(). You also need to add an option to specify which generalization you want to use. To do logistic regression for a binary outcome, use family = binomial(link = "logit").

8.2.1 Null Model: no independent variables

fit_glm_0 <- glm(bpd ~ 1, 
                 data = bpd_clean, 
                 family = binomial(link = "logit"))

summary(fit_glm_0)

Call:
glm(formula = bpd ~ 1, family = binomial(link = "logit"), data = bpd_clean)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-0.913  -0.913  -0.913   1.467   1.467  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.6597     0.1413  -4.669 3.02e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 286.14  on 222  degrees of freedom
Residual deviance: 286.14  on 222  degrees of freedom
AIC: 288.14

Number of Fisher Scoring iterations: 4

8.2.2 Main Effects Model: add 3 predictors

Note: Since the unites of weight are so small, the estimated parameter will be super small. To offset the small units, we can re-scale the weights by dividing the grams by 100 to create “hectograms.”

fit_glm_1 <- glm(bpd ~ I(brthwght/100) + gestage + toxemia, 
                 data = bpd_clean, 
                 family = binomial(link = "logit")) 

summary(fit_glm_1)

Call:
glm(formula = bpd ~ I(brthwght/100) + gestage + toxemia, family = binomial(link = "logit"), 
    data = bpd_clean)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8400  -0.7029  -0.3352   0.7261   2.9902  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     13.93608    2.98255   4.673 2.98e-06 ***
I(brthwght/100) -0.26436    0.08123  -3.254  0.00114 ** 
gestage         -0.38854    0.11489  -3.382  0.00072 ***
toxemiaYes      -1.34379    0.60750  -2.212  0.02697 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 286.14  on 222  degrees of freedom
Residual deviance: 203.71  on 219  degrees of freedom
AIC: 211.71

Number of Fisher Scoring iterations: 5

8.3 Model Fit

8.3.1 Log Likelihood and Deviance

logLik(fit_glm_0) # Null Model
'log Lik.' -143.07 (df=1)
logLik(fit_glm_1) # Full Model
'log Lik.' -101.8538 (df=4)

Note: Deviance = -2 times the Log Likelihood

deviance(fit_glm_0) # Null Model
[1] 286.14
deviance(fit_glm_1) # Full Model
[1] 203.7075

8.3.2 AIC and BIC

AIC(fit_glm_0, fit_glm_1)  # Full Model
# A tibble: 2 x 2
     df   AIC
  <dbl> <dbl>
1     1  288.
2     4  212.
BIC(fit_glm_0, fit_glm_1)  # Full Model
# A tibble: 2 x 2
     df   BIC
  <dbl> <dbl>
1     1  292.
2     4  225.

8.4 Variance Explained

8.4.1 Many Options

Technically, \(R^2\) cannot be computed the same way in logistic regression as it is in OLS regression. There are several (over 10) alternatives that endever to calculate a similar metric in different ways.

8.4.2 McFadden’s pseud-R^2

McFadden’s \(pseudo-R^2\), in logistic regression, is defined as \(1−\frac{L_1}{L_0}\), where \(L_0\) represents the log likelihood for the “constant-only” or

\[ R^2_{McF} = 1 - \frac{L_1}{L_0} \]

MFR2 <- 1 - (logLik(fit_glm_1)/logLik(fit_glm_0))
MFR2
'log Lik.' 0.2880843 (df=4)
performance::r2_mcfadden(fit_glm_1)
# R2 for Generalized Linear Regression
       R2: -0.996
  adj. R2: -1.003

8.4.3 Cox & Snell

\(l = e^{L}\), since \(L\) is the log of the likelihood and \(l\) is the likelihood…\(log(l) = L\)

\[ R^2_{CS} = 1 - \Bigg( \frac{l_0}{l_1} \Bigg) ^{2 \backslash n} \\ n = \text{sample size} \]

CSR2 <- 1 - (exp(logLik(fit_glm_0))/exp(logLik(fit_glm_1)))^(2/n)
CSR2
'log Lik.' 0.3090253 (df=1)
performance::r2_coxsnell(fit_glm_1)
Cox & Snell's R2 
       0.3090253 

8.4.4 Nagelkerke or Cragg and Uhler’s

\[ R^2_{Nag} = \frac{1 - \Bigg( \frac{l_0}{l_1} \Bigg) ^{2 \backslash n}} {1 - \Big( l_0 \Big) ^{2 \backslash n}} \]

NR2 <- CSR2 / (1 - exp(logLik(fit_glm_0))^(2/n))
NR2 
'log Lik.' 0.4275191 (df=1)
performance::r2_nagelkerke(fit_glm_1)
Nagelkerke's R2 
      0.4275191 

8.4.5 Tjur’s statistic

performance::r2(fit_glm_1)
# R2 for Logistic Regression
  Tjur's R2: 0.346

8.4.6 Several at Once

the pscl::pR2() function

Outputs:

  • llh The log-likelihood from the fitted model
  • llhNull The log-likelihood from the intercept-only restricted model
  • G2 Minus two times the difference in the log-likelihoods
  • McFadden McFadden’s pseudo r-squared
  • r2ML Maximum likelihood pseudo r-squared
  • r2CU Cragg and Uhler’s pseudo r-squared
pscl::pR2(fit_glm_1)
fitting null model for pseudo-r2
         llh      llhNull           G2     McFadden         r2ML         r2CU 
-101.8537711 -143.0699809   82.4324196    0.2880843    0.3090253    0.4275191 

8.5 Model Compairisons, Inferential

8.5.1 Likelihood Ratio Test (LRT, aka. Deviance Difference Test)

anova(fit_glm_0, fit_glm_1, test = "LRT")
# A tibble: 2 x 5
  `Resid. Df` `Resid. Dev`    Df Deviance `Pr(>Chi)`
        <dbl>        <dbl> <dbl>    <dbl>      <dbl>
1         222         286.    NA     NA    NA       
2         219         204.     3     82.4   9.23e-18

8.5.2 Bayes Factor and Performance Score

performance::compare_performance(fit_glm_0, fit_glm_1, rank = TRUE)
# A tibble: 2 x 12
  Name      Model   AIC   BIC R2_Tjur  RMSE Sigma Log_loss Score_log Score_spherical
  <chr>     <chr> <dbl> <dbl>   <dbl> <dbl> <dbl>    <dbl>     <dbl>           <dbl>
1 fit_glm_1 glm    212.  225.   0.346 0.380 0.964    0.457     -40.0          0.0196
2 fit_glm_0 glm    288.  292.   0     0.474 1.14     0.642     -30.4          0.0410
# ... with 2 more variables: PCP <dbl>, Performance_Score <dbl>

8.6 Parameter Estimates

8.6.2 Exponentiate: Odds Ratio Scale

fit_glm_1 %>% coef() %>% exp()
    (Intercept) I(brthwght/100)         gestage      toxemiaYes 
   1.128142e+06    7.676988e-01    6.780490e-01    2.608561e-01 
fit_glm_1 %>% confint() %>% exp()
                       2.5 %       97.5 %
(Intercept)     4.402379e+03 5.591330e+08
I(brthwght/100) 6.511832e-01 8.967757e-01
gestage         5.351280e-01 8.414808e-01
toxemiaYes      7.314875e-02 8.078916e-01

8.7 Significance of Terms

8.7.1 Wald’s \(t\)-Test

fit_glm_1 %>% summary() %>% coef()
                  Estimate Std. Error   z value     Pr(>|z|)
(Intercept)     13.9360826 2.98255085  4.672538 2.975003e-06
I(brthwght/100) -0.2643578 0.08123149 -3.254376 1.136419e-03
gestage         -0.3885357 0.11489128 -3.381768 7.202086e-04
toxemiaYes      -1.3437865 0.60750335 -2.211982 2.696791e-02

8.7.2 Single term deletion, \(\chi^2\) LRT

Note: Significance of each variable is assessed by comparing it to the model that drops just that one term (type = 3); order doesn’t matter.

drop1(fit_glm_1, type = 3, test = "LRT")
# A tibble: 4 x 5
     Df Deviance   AIC   LRT `Pr(>Chi)`
  <dbl>    <dbl> <dbl> <dbl>      <dbl>
1    NA     204.  212. NA     NA       
2     1     215.  221. 11.4    0.000744
3     1     217.  223. 13.1    0.000293
4     1     209.  215.  5.52   0.0188  

8.7.3 Sequential addition, \(\chi^2\) LRT

Note: Signifcance of each additional variable at a time; ordered first to last

anova(fit_glm_1, test = "LRT")
# A tibble: 4 x 5
     Df Deviance `Resid. Df` `Resid. Dev` `Pr(>Chi)`
  <int>    <dbl>       <int>        <dbl>      <dbl>
1    NA    NA            222         286.  NA       
2     1    62.4          221         224.   2.78e-15
3     1    14.5          220         209.   1.41e- 4
4     1     5.52         219         204.   1.88e- 2

8.8 Parameter Estimate Tables

8.8.2 Odds-Ratio Scale (exponentiate)

texreg::knitreg(texreghelpr::extract_glm_exp(fit_glm_1),
                single.row = TRUE)
Statistical models
  Model 1
(Intercept) 1128141.99 [4402.38; 559132968.51]*
brthwght/100 0.77 [ 0.65; 0.90]*
gestage 0.68 [ 0.54; 0.84]*
toxemiaYes 0.26 [ 0.07; 0.81]*
* 0 outside the confidence interval.

8.8.3 BOTH: Logit and Odds-Ratio

texreg::knitreg(list(fit_glm_1,
                     texreghelpr::extract_glm_exp(fit_glm_1,
                                                  include.aic = FALSE,
                                                  include.bic = FALSE,
                                                  include.loglik = FALSE,
                                                  include.deviance = FALSE,
                                                  include.nobs = FALSE)),
                custom.model.names = c("b (SE)",
                                       "OR [95% CI]"),
                single.row = TRUE,
                ci.test = 1)
Statistical models
  b (SE) OR [95% CI]
(Intercept) 13.94 (2.98)*** 1128141.99 [4402.38; 559132968.51]*
brthwght/100 -0.26 (0.08)** 0.77 [ 0.65; 0.90]*
gestage -0.39 (0.11)*** 0.68 [ 0.54; 0.84]*
toxemiaYes -1.34 (0.61)* 0.26 [ 0.07; 0.81]*
AIC 211.71  
BIC 225.34  
Log Likelihood -101.85  
Deviance 203.71  
Num. obs. 223  
p < 0.001; p < 0.01; p < 0.05 (or Null hypothesis value outside the confidence interval).

8.9 Marginal or Predicted Values

8.9.1 Across All Predictors

Note: By default it will select 5-6 “nice” values for each continuous variable. All levels of categorical factors will be included.

effects::Effect(focal.predictors = c("brthwght", "gestage", "toxemia"),
                mod = fit_glm_1)

 brthwght*gestage*toxemia effect
, , toxemia = No

        gestage
brthwght        25        28         31         34          37
    450  0.9540464 0.8661657 0.66860145 0.38609882 0.163919741
    770  0.8990883 0.7352702 0.46404255 0.21253946 0.077608513
    1100 0.7883077 0.5372180 0.26571767 0.10137254 0.033971437
    1400 0.6275409 0.3443597 0.14069462 0.04856170 0.015661772
    1700 0.4325654 0.1920105 0.06897089 0.02257203 0.007147496

, , toxemia = Yes

        gestage
brthwght        25         28         31          34          37
    450  0.8441313 0.62800950 0.34481263 0.140937262 0.048654446
    770  0.6991701 0.42012556 0.18424240 0.065775334 0.021476635
    1100 0.4927426 0.23243034 0.08625483 0.028585525 0.009089901
    1400 0.3053170 0.12049913 0.04096070 0.013139235 0.004133317
    1700 0.1658709 0.05837137 0.01895794 0.005987954 0.001874370

8.9.2 Specify Some Predictors

Note: if a predictor is left off thefocal.predictors, the predictions are AVERAGED over that variable.

effects::Effect(focal.predictors = c("brthwght", "gestage"),
                mod = fit_glm_1)

 brthwght*gestage effect
        gestage
brthwght        25        28         31         34          37
    450  0.9457476 0.8445817 0.62880972 0.34558725 0.141352684
    770  0.8820911 0.6998904 0.42096066 0.18475801 0.065986229
    1100 0.7576801 0.4935992 0.23304229 0.08652530 0.028680839
    1400 0.5858727 0.3060443 0.12086278 0.04109553 0.013183745
    1700 0.3902779 0.1663456 0.05856001 0.01902178 0.006008386
effects::Effect(focal.predictors = c("brthwght", "toxemia"),
                mod = fit_glm_1)

 brthwght*toxemia effect
        toxemia
brthwght         No        Yes
    450  0.74150653 0.42801049
    770  0.55178083 0.24307063
    1100 0.33972675 0.11833437
    1400 0.18883689 0.05725008
    1700 0.09529265 0.02674118

8.9.3 Set a constant (fixed) value for a predictor(s)

effects::Effect(focal.predictors = c("brthwght"),
                fixed.predictors = list(gestage = 34, 
                                        toxemia = "no"),
                mod = fit_glm_1)

 brthwght effect
brthwght
       450        770       1100       1400       1700 
0.70662760 0.50827827 0.30168968 0.16351033 0.08125536 

8.9.4 Set values for a continuous predictor

effects::Effect(focal.predictors = c("brthwght", "gestage"),
                fixed.predictors = list(toxemia = "no"),
                xlevels = list(gestage = c(24, 32, 36)),
                mod = fit_glm_1)

 brthwght*gestage effect
        gestage
brthwght        24         32          36
    450  0.9625602 0.53458922 0.195357871
    770  0.9168973 0.33018096 0.094361297
    1100 0.8217923 0.17083126 0.041730765
    1400 0.6760033 0.08526886 0.019322688
    1700 0.4856018 0.04046955 0.008836077

8.9.5 Add SE and 95% Confidence Interval

effects::Effect(focal.predictors = c("brthwght", "gestage", "toxemia"),
                xlevels = list(brthwght = c(500, 1000, 1500),
                               gestage = c(30, 36)),
                mod = fit_glm_1) %>% 
  data.frame()
# A tibble: 12 x 7
   brthwght gestage toxemia     fit      se    lower  upper
      <dbl>   <dbl> <fct>     <dbl>   <dbl>    <dbl>  <dbl>
 1      500      30 No      0.723   0.109   0.474    0.883 
 2     1000      30 No      0.410   0.0523  0.313    0.515 
 3     1500      30 No      0.156   0.0475  0.0839   0.273 
 4      500      36 No      0.202   0.179   0.0281   0.690 
 5     1000      36 No      0.0633  0.0484  0.0135   0.251 
 6     1500      36 No      0.0177  0.0115  0.00492  0.0616
 7      500      30 Yes     0.405   0.171   0.145    0.732 
 8     1000      30 Yes     0.154   0.0756  0.0548   0.362 
 9     1500      30 Yes     0.0461  0.0312  0.0119   0.162 
10      500      36 Yes     0.0620  0.0701  0.00619  0.412 
11     1000      36 Yes     0.0173  0.0168  0.00255  0.109 
12     1500      36 Yes     0.00468 0.00422 0.000794 0.0270

8.10 Marginal Model Plots

8.10.1 Individual Marginal Plots for one IV, individually

The sjPlot::plot_model() function automatically transforms the predictions to the probability score when you include the type = "pred" option.

sjPlot::plot_model(fit_glm_1,
                   type = "pred")
$brthwght


$gestage


$toxemia

8.10.2 Combination Marginal Plots for two-three IVs, all at once

For continuous IV that are not on the x-axis (pred), be default three values will be selected: the mean and plus-or-minus one stadard error for the mean (SEM).

The outcome.scale = "link" option plots the LOGIT scale on the y-axis.

interactions::interact_plot(model = fit_glm_1,
                            pred = brthwght,
                            modx = gestage,
                            mod2 = toxemia,
                            outcome.scale = "link")

Alternatively, you may use the modx.labels option to set specific values at which to plot the moderator.

The outcome.scale = "response" option plots the PROBABILITY scale on the y-axis.

interactions::interact_plot(model = fit_glm_1,
                            pred = brthwght,
                            modx = gestage,
                            modx.labels = c(28, 32, 36),
                            mod2 = toxemia,
                            outcome.scale = "response")

You can always do more work to get to a PUBLISH-ABLE version

interactions::interact_plot(model = fit_glm_1,
                            pred = brthwght,
                            modx = gestage,
                            modx.labels = c(28, 30, 32),
                            mod2 = toxemia,
                            outcome.scale = "response",
                            x.label = "Birthweight, grams",
                            y.label = "Probability of Bronchopulmonary Dysplasia",
                            legend.main = "Gestational Age, weeks",
                            mod2.label = c("Mother does NOT have Toxemia",
                                           "Mother DOES have Toxemia"),
                            colors = rep("black", 3)) +
  geom_hline(yintercept = .5, alpha = .2) +
  theme_bw() +
  theme(legend.background = element_rect(color = "black"),
        legend.position = c(1, 1),
        legend.justification = c(1.1, 1.1),
        legend.key.width = unit(2, "cm"))

8.10.3 Total Control is also available

effects::Effect(focal.predictors = c("brthwght", "toxemia", "gestage"),
                mod = fit_glm_1,
                xlevels = list(brthwght = seq(from = 450, to = 1730, by = 10),
                               gestage = c(28, 32, 36))) %>% 
  data.frame() %>% 
  dplyr::mutate(gestage = factor(gestage)) %>% 
  ggplot(aes(x = brthwght,
             y = fit)) +
  geom_ribbon(aes(ymin = lower,
                  ymax = upper,
                  fill = toxemia),
              alpha = .2) +
  geom_line(aes(linetype = toxemia,
                color = toxemia),
            size = 1) +
  facet_grid(. ~ gestage, labeller = label_both) +
  theme_bw()

effects::Effect(focal.predictors = c("brthwght", "toxemia", "gestage"),
                mod = fit_glm_1,
                xlevels = list(brthwght = seq(from = 450, to = 1730, by = 10),
                               gestage = c(28, 32, 36))) %>% 
  data.frame() %>% 
  dplyr::mutate(gestage = factor(gestage)) %>% 
  ggplot(aes(x = brthwght,
             y = fit)) +
  geom_line(aes(linetype = toxemia,
                color = toxemia),
            size = 1) +
  facet_grid(. ~ gestage, labeller = label_both) +
  theme_bw()