11 Logistic Regression - Ex: volunteering (Hoffman)

library(tidyverse)
library(haven)        # read in SPSS dataset
library(furniture)    # nice table1() descriptives
library(stargazer)    # display nice tables: summary & regression
library(texreg)       # Convert Regression Output to LaTeX or HTML Tables
library(texreghelpr)  # GITHUB: sarbearschartz/texreghelpr
library(psych)        # contains some useful functions, like headTail
library(car)          # Companion to Applied Regression
library(sjPlot)       # Quick plots and tables for models
library(pscl)         # psudo R-squared function
library(glue)         # Interpreted String Literals 
library(interactions) # interaction plots
library(sjPlot)       # various plots
library(performance)  # r-squared values

This dataset comes from John Hoffman’s textbook: Regression Models for Categorical, Count, and Related Variables: An Applied Approach (2004) Amazon link, 2014 edition

Chapter 3: Logistic and Probit Regression Models

Dataset: The following example uses the SPSS data set gss.sav. The dependent variable of interest is labeled volrelig.

"The variable labeled volrelig, which indicates whether or not a respondent volunteered for a religious organization in the previous year is coded 0 = no, 1 = yes. A hypothesis we wish to explore is that females are more likely than males to volunteer for religious organizations. Hence, in this data set, we code gender as 0 = male and 1 = female. In order to preclude the possibility that age and education explain the proposed association between gender and volrelig, we include these variables in the model after transforming them into z-scores. An advantage of this transformation is that it becomes a simple exercise to compute odds or probabilities for males and females at the mean of age and education, because these variables have now been transformed to have a mean of zero.

df_gss <- haven::read_spss("https://raw.githubusercontent.com/CEHS-research/data/master/Hoffmann_datasets/gss.sav") %>% 
  haven::as_factor()

tibble::glimpse(df_gss)
Rows: 2,903
Columns: 20
$ id       <dbl> 402, 1473, 1909, 334, 1751, 456, 292, 2817, 2810, 2232, 2174,~
$ marital  <fct> divorced, widowed, widowed, widowed, married, divorced, never~
$ divorce  <fct> yes, no, no, yes, no, yes, no, yes, no, no, no, no, no, no, y~
$ childs   <fct> 2, 0, 7, 2, 2, 0, 2, 3, 0, 2, 2, 5, 0, 2, 1, 0, 1, 0, 0, 0, 0~
$ age      <dbl> 54, 24, 75, 41, 37, 40, 36, 33, 18, 35, 35, 34, 40, 37, 41, 6~
$ income   <dbl> 10, 2, NA, NA, 12, NA, 9, NA, NA, 6, 12, 11, 12, 10, 12, 12, ~
$ polviews <fct> middle of the road, slight conservative, extreme conservative~
$ fund     <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ attend   <fct> NA, NA, NA, NA, NA, NA, nearly every week, several times a ye~
$ spanking <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ totrelig <dbl> NA, NA, NA, NA, NA, NA, NA, 1000, NA, NA, NA, NA, NA, NA, NA,~
$ sei      <dbl> 38.9, 29.0, 29.1, 29.0, 38.1, 39.4, 38.4, 31.3, NA, 39.0, 29.~
$ pasei    <dbl> NA, 48.6, 22.5, 26.7, 38.1, NA, NA, NA, NA, 50.7, 78.5, NA, 7~
$ volteer  <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0~
$ female   <fct> female, male, female, female, male, female, female, female, f~
$ nonwhite <fct> non-white, white, non-white, white, white, white, non-white, ~
$ prayer   <fct> daily, several times a week, daily, several times a week, sev~
$ educate  <dbl> 12, 17, 8, 12, 12, NA, 15, 12, 11, 14, 14, 12, 20, 12, 15, 20~
$ volrelig <fct> no, no, no, yes, yes, no, no, no, no, no, no, no, no, no, no,~
$ polview1 <fct> moderate, conservative, conservative, moderate, conservative,~
psych::headTail(df_gss)
# A tibble: 9 x 20
  id    marital       divorce childs age   income polviews fund  attend spanking
  <chr> <fct>         <fct>   <fct>  <chr> <chr>  <fct>    <fct> <fct>  <fct>   
1 402   divorced      yes     2      54    10     middle ~ <NA>  <NA>   <NA>    
2 1473  widowed       no      0      24    2      slight ~ <NA>  <NA>   <NA>    
3 1909  widowed       no      7      75    <NA>   extreme~ <NA>  <NA>   <NA>    
4 334   widowed       yes     2      41    <NA>   middle ~ <NA>  <NA>   <NA>    
5 ...   <NA>          <NA>    <NA>   ...   ...    <NA>     <NA>  <NA>   <NA>    
6 375   never married no      0      32    12     liberal  libe~ about~ strongl~
7 399   never married no      1      43    12     slight ~ fund~ about~ agree   
8 1640  never married no      0      29    12     liberal  fund~ once ~ agree   
9 2344  never married no      0      23    5      slight ~ mode~ once ~ strongl~
# ... with 10 more variables: totrelig <chr>, sei <chr>, pasei <chr>,
#   volteer <chr>, female <fct>, nonwhite <fct>, prayer <fct>, educate <chr>,
#   volrelig <fct>, polview1 <fct>

11.1 Exploratory Data Analysis

11.1.1 Visualization

df_gss %>% 
  ggplot(aes(x = educate,
             y = volrelig)) +
  geom_count() +
  theme_bw() +
  labs(x = "Education in Years",
       y = "Respondent Volunteered for a Religious Organization\nin the Previous Year")

11.1.2 Summary Statistics

df_gss %>% 
  dplyr::group_by(volrelig) %>% 
  furniture::table1(female,
                    age,
                    educate,
                    total = TRUE,
                    test = TRUE,
                    digits = 3)

-------------------------------------------------------------------
                                    volrelig 
           Total           no              yes             P-Value
           n = 2894        n = 2688        n = 206                
 female                                                    0.044  
    male   1283 (44.3%)    1206 (44.9%)    77 (37.4%)             
    female 1611 (55.7%)    1482 (55.1%)    129 (62.6%)            
 age                                                       0.135  
           44.767 (16.850) 44.656 (17.034) 46.218 (14.192)        
 educate                                                   <.001  
           13.363 (2.928)  13.296 (2.922)  14.238 (2.867)         
-------------------------------------------------------------------

11.2 Fit Model

df_gss_model <- df_gss %>% 
  dplyr::mutate(volrelig01 = case_when(volrelig == "no" ~ 0,
                                       volrelig == "yes" ~ 1)) %>% 
  dplyr::mutate(z_age = (age - 44.767)/16.850) %>% 
  dplyr::mutate(z_educ = (educate - 13.363)/2.928) %>% 
  dplyr::select(id, age, z_age, educate, z_educ, female, volrelig, volrelig01)

df_gss_model
# A tibble: 2,903 x 8
      id   age  z_age educate z_educ female volrelig volrelig01
   <dbl> <dbl>  <dbl>   <dbl>  <dbl> <fct>  <fct>         <dbl>
 1   402    54  0.548      12 -0.466 female no                0
 2  1473    24 -1.23       17  1.24  male   no                0
 3  1909    75  1.79        8 -1.83  female no                0
 4   334    41 -0.224      12 -0.466 female yes               1
 5  1751    37 -0.461      12 -0.466 male   yes               1
 6   456    40 -0.283      NA NA     female no                0
 7   292    36 -0.520      15  0.559 female no                0
 8  2817    33 -0.698      12 -0.466 female no                0
 9  2810    18 -1.59       11 -0.807 female no                0
10  2232    35 -0.580      14  0.218 male   no                0
# ... with 2,893 more rows
fit_glm_1 <- glm(volrelig01 ~ female + z_age + z_educ,
                 data = df_gss_model,
                 family = binomial(link = "logit"))

fit_glm_1 %>% summary() 

Call:
glm(formula = volrelig01 ~ female + z_age + z_educ, family = binomial(link = "logit"), 
    data = df_gss_model)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6949  -0.4138  -0.3621  -0.3142   2.6399  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.8319     0.1215 -23.311  < 2e-16 ***
femalefemale   0.3543     0.1503   2.357   0.0184 *  
z_age          0.1418     0.0739   1.919   0.0549 .  
z_educ         0.3562     0.0737   4.833 1.34e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1485.7  on 2893  degrees of freedom
Residual deviance: 1456.3  on 2890  degrees of freedom
  (9 observations deleted due to missingness)
AIC: 1464.3

Number of Fisher Scoring iterations: 5
fit_glm_1 %>% summary() %>% coef()
               Estimate Std. Error    z value      Pr(>|z|)
(Intercept)  -2.8318640 0.12148235 -23.310908 3.436457e-120
femalefemale  0.3542592 0.15031927   2.356712  1.843756e-02
z_age         0.1418388 0.07390196   1.919284  5.494843e-02
z_educ        0.3561886 0.07369657   4.833177  1.343714e-06
texreg::knitreg(list(fit_glm_1,
                     texreghelpr::extract_glm_exp(fit_glm_1)),
                custom.model.names = c("b (SE)",
                                       "OR [95 CI]"),
                caption = "Hoffman's EXAMPLE 3.4 A Logistic Regression Model of Volunteer Work, bottom of page 53",
                caption.above = TRUE,
                single.row = TRUE,
                digits = 4,
                ci.test = 1)
Hoffman’s EXAMPLE 3.4 A Logistic Regression Model of Volunteer Work, bottom of page 53
  b (SE) OR [95 CI]
(Intercept) -2.8319 (0.1215)*** 0.0589 [0.0460; 0.0742]*
femalefemale 0.3543 (0.1503)* 1.4251 [1.0644; 1.9203]*
z_age 0.1418 (0.0739) 1.1524 [0.9957; 1.3306]
z_educ 0.3562 (0.0737)*** 1.4279 [1.2362; 1.6504]*
AIC 1464.2759  
BIC 1488.1574  
Log Likelihood -728.1379  
Deviance 1456.2759  
Num. obs. 2894  
p < 0.001; p < 0.01; p < 0.05 (or Null hypothesis value outside the confidence interval).

11.3 Interpretation

11.3.1 Odds-ratio Scale

fit_glm_1 %>% coef()
 (Intercept) femalefemale        z_age       z_educ 
  -2.8318640    0.3542592    0.1418388    0.3561886 
exp(-2.831)
[1] 0.05895387
exp(-2.831 + 0.354)
[1] 0.08399483
exp(0.354)
[1] 1.424755
fit_glm_1 %>% coef() %>% exp()
 (Intercept) femalefemale        z_age       z_educ 
  0.05890295   1.42512448   1.15239090   1.42787675 

NOTE: Odds = 1 –> There is a 50-50 change of that thing happening for whom ever we are refering to.

NOTE: an odds-ratio = 1 –> There is the same change of that thing happening for both groups.

Controlling for age and education,

  • the odds of volunteering among MALES is exp(-2.831) = .0589, and

  • the odds of volunteering among FEMALES is exp(-2.831 + 0.354) = exp(-2.48) = .0840.

  • Females have 42% higher odds of having volunteered for a religious organization over the previous Year.

What, then, is the odds ratio?

(0.0840)/(0.0589)
[1] 1.426146

11.3.2 Response Scale (aka. Probability)

fit_glm_1 %>% 
  emmeans::emmeans(~ female,
                   type = "response")
 female   prob      SE  df asymp.LCL asymp.UCL
 male   0.0556 0.00638 Inf    0.0444    0.0695
 female 0.0774 0.00671 Inf    0.0653    0.0917

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 

Controlling for age and education,

  • the probability of volunteering among MALES is .0556 and
  • the probability of volunteering among FEMALES is .0774.

Use these probabilities to compute the odds ratio for gender.

(.0774/(1 - .0774))/(.0556/(1 - .0556))
[1] 1.42498

Note that these odds and probabilities are similar. This often occurs when we are dealing with probabilities that are relatively close to zero; in other words, it is a common occurrence for rare events. To see this, simply compute a cross-tabulation of volrelig and gender and compare the odds and probabilities. Then try it out for any rare event you may wish to simulate