11 Logistic Regression - Ex: volunteering (Hoffman)

library(tidyverse)
library(haven)        # read in SPSS dataset
library(furniture)    # nice table1() descriptives
library(stargazer)    # display nice tables: summary & regression
library(texreg)       # Convert Regression Output to LaTeX or HTML Tables
library(texreghelpr)  # GITHUB: sarbearschartz/texreghelpr
library(psych)        # contains some useful functions, like headTail
library(car)          # Companion to Applied Regression
library(sjPlot)       # Quick plots and tables for models
library(pscl)         # psudo R-squared function
library(glue)         # Interpreted String Literals 
library(interactions) # interaction plots
library(sjPlot)       # various plots
library(performance)  # r-squared values

This dataset comes from John Hoffman’s textbook: Regression Models for Categorical, Count, and Related Variables: An Applied Approach (2004) Amazon link, 2014 edition

Chapter 3: Logistic and Probit Regression Models

Dataset: The following example uses the SPSS data set gss.sav. The dependent variable of interest is labeled volrelig.

"The variable labeled volrelig, which indicates whether or not a respondent volunteered for a religious organization in the previous year is coded 0 = no, 1 = yes. A hypothesis we wish to explore is that females are more likely than males to volunteer for religious organizations. Hence, in this data set, we code gender as 0 = male and 1 = female. In order to preclude the possibility that age and education explain the proposed association between gender and volrelig, we include these variables in the model after transforming them into z-scores. An advantage of this transformation is that it becomes a simple exercise to compute odds or probabilities for males and females at the mean of age and education, because these variables have now been transformed to have a mean of zero.

df_gss <- haven::read_spss("https://raw.githubusercontent.com/CEHS-research/data/master/Hoffmann_datasets/gss.sav") %>% 
  haven::as_factor()

tibble::glimpse(df_gss)

Rows: 2,903
Columns: 20
$ id       <dbl> 402, 1473, 1909, 334, 1751, 456, 292, 2817, 2810, 2232, 2174,~
$ marital  <fct> divorced, widowed, widowed, widowed, married, divorced, never~
$ divorce  <fct> yes, no, no, yes, no, yes, no, yes, no, no, no, no, no, no, y~
$ childs   <fct> 2, 0, 7, 2, 2, 0, 2, 3, 0, 2, 2, 5, 0, 2, 1, 0, 1, 0, 0, 0, 0~
$ age      <dbl> 54, 24, 75, 41, 37, 40, 36, 33, 18, 35, 35, 34, 40, 37, 41, 6~
$ income   <dbl> 10, 2, NA, NA, 12, NA, 9, NA, NA, 6, 12, 11, 12, 10, 12, 12, ~
$ polviews <fct> middle of the road, slight conservative, extreme conservative~
$ fund     <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ attend   <fct> NA, NA, NA, NA, NA, NA, nearly every week, several times a ye~
$ spanking <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ totrelig <dbl> NA, NA, NA, NA, NA, NA, NA, 1000, NA, NA, NA, NA, NA, NA, NA,~
$ sei      <dbl> 38.9, 29.0, 29.1, 29.0, 38.1, 39.4, 38.4, 31.3, NA, 39.0, 29.~
$ pasei    <dbl> NA, 48.6, 22.5, 26.7, 38.1, NA, NA, NA, NA, 50.7, 78.5, NA, 7~
$ volteer  <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0~
$ female   <fct> female, male, female, female, male, female, female, female, f~
$ nonwhite <fct> non-white, white, non-white, white, white, white, non-white, ~
$ prayer   <fct> daily, several times a week, daily, several times a week, sev~
$ educate  <dbl> 12, 17, 8, 12, 12, NA, 15, 12, 11, 14, 14, 12, 20, 12, 15, 20~
$ volrelig <fct> no, no, no, yes, yes, no, no, no, no, no, no, no, no, no, no,~
$ polview1 <fct> moderate, conservative, conservative, moderate, conservative,~

psych::headTail(df_gss)

# A tibble: 9 x 20
  id    marital       divorce childs age   income polviews fund  attend spanking
  <chr> <fct>         <fct>   <fct>  <chr> <chr>  <fct>    <fct> <fct>  <fct>   
1 402   divorced      yes     2      54    10     middle ~ <NA>  <NA>   <NA>    
2 1473  widowed       no      0      24    2      slight ~ <NA>  <NA>   <NA>    
3 1909  widowed       no      7      75    <NA>   extreme~ <NA>  <NA>   <NA>    
4 334   widowed       yes     2      41    <NA>   middle ~ <NA>  <NA>   <NA>    
5 ...   <NA>          <NA>    <NA>   ...   ...    <NA>     <NA>  <NA>   <NA>    
6 375   never married no      0      32    12     liberal  libe~ about~ strongl~
7 399   never married no      1      43    12     slight ~ fund~ about~ agree   
8 1640  never married no      0      29    12     liberal  fund~ once ~ agree   
9 2344  never married no      0      23    5      slight ~ mode~ once ~ strongl~
# ... with 10 more variables: totrelig <chr>, sei <chr>, pasei <chr>,
#   volteer <chr>, female <fct>, nonwhite <fct>, prayer <fct>, educate <chr>,
#   volrelig <fct>, polview1 <fct>

11.1 Exploratory Data Analysis

11.1.1 Visualization

df_gss %>% 
  ggplot(aes(x = educate,
             y = volrelig)) +
  geom_count() +
  theme_bw() +
  labs(x = "Education in Years",
       y = "Respondent Volunteered for a Religious Organization\nin the Previous Year")

11.1.2 Summary Statistics

df_gss %>% 
  dplyr::group_by(volrelig) %>% 
  furniture::table1(female,
                    age,
                    educate,
                    total = TRUE,
                    test = TRUE,
                    digits = 3)


-------------------------------------------------------------------
                                    volrelig 
           Total           no              yes             P-Value
           n = 2894        n = 2688        n = 206                
 female                                                    0.044  
    male   1283 (44.3%)    1206 (44.9%)    77 (37.4%)             
    female 1611 (55.7%)    1482 (55.1%)    129 (62.6%)            
 age                                                       0.135  
           44.767 (16.850) 44.656 (17.034) 46.218 (14.192)        
 educate                                                   <.001  
           13.363 (2.928)  13.296 (2.922)  14.238 (2.867)         
-------------------------------------------------------------------

11.2 Fit Model

df_gss_model <- df_gss %>% 
  dplyr::mutate(volrelig01 = case_when(volrelig == "no" ~ 0,
                                       volrelig == "yes" ~ 1)) %>% 
  dplyr::mutate(z_age = (age - 44.767)/16.850) %>% 
  dplyr::mutate(z_educ = (educate - 13.363)/2.928) %>% 
  dplyr::select(id, age, z_age, educate, z_educ, female, volrelig, volrelig01)

df_gss_model

# A tibble: 2,903 x 8
      id   age  z_age educate z_educ female volrelig volrelig01
   <dbl> <dbl>  <dbl>   <dbl>  <dbl> <fct>  <fct>         <dbl>
 1   402    54  0.548      12 -0.466 female no                0
 2  1473    24 -1.23       17  1.24  male   no                0
 3  1909    75  1.79        8 -1.83  female no                0
 4   334    41 -0.224      12 -0.466 female yes               1
 5  1751    37 -0.461      12 -0.466 male   yes               1
 6   456    40 -0.283      NA NA     female no                0
 7   292    36 -0.520      15  0.559 female no                0
 8  2817    33 -0.698      12 -0.466 female no                0
 9  2810    18 -1.59       11 -0.807 female no                0
10  2232    35 -0.580      14  0.218 male   no                0
# ... with 2,893 more rows

fit_glm_1 <- glm(volrelig01 ~ female + z_age + z_educ,
                 data = df_gss_model,
                 family = binomial(link = "logit"))

fit_glm_1 %>% summary()


Call:
glm(formula = volrelig01 ~ female + z_age + z_educ, family = binomial(link = "logit"), 
    data = df_gss_model)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6949  -0.4138  -0.3621  -0.3142   2.6399  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.8319     0.1215 -23.311  < 2e-16 ***
femalefemale   0.3543     0.1503   2.357   0.0184 *  
z_age          0.1418     0.0739   1.919   0.0549 .  
z_educ         0.3562     0.0737   4.833 1.34e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1485.7  on 2893  degrees of freedom
Residual deviance: 1456.3  on 2890  degrees of freedom
  (9 observations deleted due to missingness)
AIC: 1464.3

Number of Fisher Scoring iterations: 5

fit_glm_1 %>% summary() %>% coef()

               Estimate Std. Error    z value      Pr(>|z|)
(Intercept)  -2.8318640 0.12148235 -23.310908 3.436457e-120
femalefemale  0.3542592 0.15031927   2.356712  1.843756e-02
z_age         0.1418388 0.07390196   1.919284  5.494843e-02
z_educ        0.3561886 0.07369657   4.833177  1.343714e-06

texreg::knitreg(list(fit_glm_1,
                     texreghelpr::extract_glm_exp(fit_glm_1)),
                custom.model.names = c("b (SE)",
                                       "OR [95 CI]"),
                caption = "Hoffman's EXAMPLE 3.4 A Logistic Regression Model of Volunteer Work, bottom of page 53",
                caption.above = TRUE,
                single.row = TRUE,
                digits = 4,
                ci.test = 1)

Hoffman’s EXAMPLE 3.4 A Logistic Regression Model of Volunteer Work, bottom of page 53
	b (SE)	OR [95 CI]
(Intercept)	-2.8319 (0.1215)^***	0.0589 [0.0460; 0.0742]^*
femalefemale	0.3543 (0.1503)^*	1.4251 [1.0644; 1.9203]^*
z_age	0.1418 (0.0739)	1.1524 [0.9957; 1.3306]
z_educ	0.3562 (0.0737)^***	1.4279 [1.2362; 1.6504]^*
AIC	1464.2759
BIC	1488.1574
Log Likelihood	-728.1379
Deviance	1456.2759
Num. obs.	2894
p < 0.001; p < 0.01; p < 0.05 (or Null hypothesis value outside the confidence interval).

11.3 Interpretation

11.3.1 Odds-ratio Scale

fit_glm_1 %>% coef()

 (Intercept) femalefemale        z_age       z_educ 
  -2.8318640    0.3542592    0.1418388    0.3561886

exp(-2.831)

[1] 0.05895387

exp(-2.831 + 0.354)

[1] 0.08399483

exp(0.354)

[1] 1.424755

fit_glm_1 %>% coef() %>% exp()

 (Intercept) femalefemale        z_age       z_educ 
  0.05890295   1.42512448   1.15239090   1.42787675

NOTE: Odds = 1 –> There is a 50-50 change of that thing happening for whom ever we are refering to.

NOTE: an odds-ratio = 1 –> There is the same change of that thing happening for both groups.

Controlling for age and education,

the odds of volunteering among MALES is exp(-2.831) = .0589, and
the odds of volunteering among FEMALES is exp(-2.831 + 0.354) = exp(-2.48) = .0840.
Females have 42% higher odds of having volunteered for a religious organization over the previous Year.

What, then, is the odds ratio?

(0.0840)/(0.0589)

[1] 1.426146

11.3.2 Response Scale (aka. Probability)

fit_glm_1 %>% 
  emmeans::emmeans(~ female,
                   type = "response")

 female   prob      SE  df asymp.LCL asymp.UCL
 male   0.0556 0.00638 Inf    0.0444    0.0695
 female 0.0774 0.00671 Inf    0.0653    0.0917

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale

Controlling for age and education,

the probability of volunteering among MALES is .0556 and
the probability of volunteering among FEMALES is .0774.

Use these probabilities to compute the odds ratio for gender.

(.0774/(1 - .0774))/(.0556/(1 - .0556))

[1] 1.42498

Note that these odds and probabilities are similar. This often occurs when we are dealing with probabilities that are relatively close to zero; in other words, it is a common occurrence for rare events. To see this, simply compute a cross-tabulation of volrelig and gender and compare the odds and probabilities. Then try it out for any rare event you may wish to simulate