11 Logistic Regression - Ex: volunteering (Hoffman)
library(tidyverse)
library(haven) # read in SPSS dataset
library(furniture) # nice table1() descriptives
library(stargazer) # display nice tables: summary & regression
library(texreg) # Convert Regression Output to LaTeX or HTML Tables
library(texreghelpr) # GITHUB: sarbearschartz/texreghelpr
library(psych) # contains some useful functions, like headTail
library(car) # Companion to Applied Regression
library(sjPlot) # Quick plots and tables for models
library(pscl) # psudo R-squared function
library(glue) # Interpreted String Literals
library(interactions) # interaction plots
library(sjPlot) # various plots
library(performance) # r-squared values
This dataset comes from John Hoffman’s textbook: Regression Models for Categorical, Count, and Related Variables: An Applied Approach (2004) Amazon link, 2014 edition
Chapter 3: Logistic and Probit Regression Models
Dataset: The following example uses the SPSS data set gss.sav
. The dependent variable of interest is labeled volrelig
.
"The variable labeled
volrelig
, which indicates whether or not a respondent volunteered for a religious organization in the previous year is coded0
= no,1
= yes. A hypothesis we wish to explore is that females are more likely than males to volunteer for religious organizations. Hence, in this data set, we code gender as0
= male and1
= female. In order to preclude the possibility that age and education explain the proposed association betweengender
andvolrelig
, we include these variables in the model after transforming them into z-scores. An advantage of this transformation is that it becomes a simple exercise to compute odds or probabilities for males and females at the mean of age and education, because these variables have now been transformed to have a mean of zero.
<- haven::read_spss("https://raw.githubusercontent.com/CEHS-research/data/master/Hoffmann_datasets/gss.sav") %>%
df_gss ::as_factor()
haven
::glimpse(df_gss) tibble
Rows: 2,903
Columns: 20
$ id <dbl> 402, 1473, 1909, 334, 1751, 456, 292, 2817, 2810, 2232, 2174,~
$ marital <fct> divorced, widowed, widowed, widowed, married, divorced, never~
$ divorce <fct> yes, no, no, yes, no, yes, no, yes, no, no, no, no, no, no, y~
$ childs <fct> 2, 0, 7, 2, 2, 0, 2, 3, 0, 2, 2, 5, 0, 2, 1, 0, 1, 0, 0, 0, 0~
$ age <dbl> 54, 24, 75, 41, 37, 40, 36, 33, 18, 35, 35, 34, 40, 37, 41, 6~
$ income <dbl> 10, 2, NA, NA, 12, NA, 9, NA, NA, 6, 12, 11, 12, 10, 12, 12, ~
$ polviews <fct> middle of the road, slight conservative, extreme conservative~
$ fund <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ attend <fct> NA, NA, NA, NA, NA, NA, nearly every week, several times a ye~
$ spanking <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ totrelig <dbl> NA, NA, NA, NA, NA, NA, NA, 1000, NA, NA, NA, NA, NA, NA, NA,~
$ sei <dbl> 38.9, 29.0, 29.1, 29.0, 38.1, 39.4, 38.4, 31.3, NA, 39.0, 29.~
$ pasei <dbl> NA, 48.6, 22.5, 26.7, 38.1, NA, NA, NA, NA, 50.7, 78.5, NA, 7~
$ volteer <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0~
$ female <fct> female, male, female, female, male, female, female, female, f~
$ nonwhite <fct> non-white, white, non-white, white, white, white, non-white, ~
$ prayer <fct> daily, several times a week, daily, several times a week, sev~
$ educate <dbl> 12, 17, 8, 12, 12, NA, 15, 12, 11, 14, 14, 12, 20, 12, 15, 20~
$ volrelig <fct> no, no, no, yes, yes, no, no, no, no, no, no, no, no, no, no,~
$ polview1 <fct> moderate, conservative, conservative, moderate, conservative,~
::headTail(df_gss) psych
# A tibble: 9 x 20
id marital divorce childs age income polviews fund attend spanking
<chr> <fct> <fct> <fct> <chr> <chr> <fct> <fct> <fct> <fct>
1 402 divorced yes 2 54 10 middle ~ <NA> <NA> <NA>
2 1473 widowed no 0 24 2 slight ~ <NA> <NA> <NA>
3 1909 widowed no 7 75 <NA> extreme~ <NA> <NA> <NA>
4 334 widowed yes 2 41 <NA> middle ~ <NA> <NA> <NA>
5 ... <NA> <NA> <NA> ... ... <NA> <NA> <NA> <NA>
6 375 never married no 0 32 12 liberal libe~ about~ strongl~
7 399 never married no 1 43 12 slight ~ fund~ about~ agree
8 1640 never married no 0 29 12 liberal fund~ once ~ agree
9 2344 never married no 0 23 5 slight ~ mode~ once ~ strongl~
# ... with 10 more variables: totrelig <chr>, sei <chr>, pasei <chr>,
# volteer <chr>, female <fct>, nonwhite <fct>, prayer <fct>, educate <chr>,
# volrelig <fct>, polview1 <fct>
11.1 Exploratory Data Analysis
11.1.1 Visualization
%>%
df_gss ggplot(aes(x = educate,
y = volrelig)) +
geom_count() +
theme_bw() +
labs(x = "Education in Years",
y = "Respondent Volunteered for a Religious Organization\nin the Previous Year")
11.1.2 Summary Statistics
%>%
df_gss ::group_by(volrelig) %>%
dplyr::table1(female,
furniture
age,
educate,total = TRUE,
test = TRUE,
digits = 3)
-------------------------------------------------------------------
volrelig
Total no yes P-Value
n = 2894 n = 2688 n = 206
female 0.044
male 1283 (44.3%) 1206 (44.9%) 77 (37.4%)
female 1611 (55.7%) 1482 (55.1%) 129 (62.6%)
age 0.135
44.767 (16.850) 44.656 (17.034) 46.218 (14.192)
educate <.001
13.363 (2.928) 13.296 (2.922) 14.238 (2.867)
-------------------------------------------------------------------
11.2 Fit Model
<- df_gss %>%
df_gss_model ::mutate(volrelig01 = case_when(volrelig == "no" ~ 0,
dplyr== "yes" ~ 1)) %>%
volrelig ::mutate(z_age = (age - 44.767)/16.850) %>%
dplyr::mutate(z_educ = (educate - 13.363)/2.928) %>%
dplyr::select(id, age, z_age, educate, z_educ, female, volrelig, volrelig01)
dplyr
df_gss_model
# A tibble: 2,903 x 8
id age z_age educate z_educ female volrelig volrelig01
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl>
1 402 54 0.548 12 -0.466 female no 0
2 1473 24 -1.23 17 1.24 male no 0
3 1909 75 1.79 8 -1.83 female no 0
4 334 41 -0.224 12 -0.466 female yes 1
5 1751 37 -0.461 12 -0.466 male yes 1
6 456 40 -0.283 NA NA female no 0
7 292 36 -0.520 15 0.559 female no 0
8 2817 33 -0.698 12 -0.466 female no 0
9 2810 18 -1.59 11 -0.807 female no 0
10 2232 35 -0.580 14 0.218 male no 0
# ... with 2,893 more rows
<- glm(volrelig01 ~ female + z_age + z_educ,
fit_glm_1 data = df_gss_model,
family = binomial(link = "logit"))
%>% summary() fit_glm_1
Call:
glm(formula = volrelig01 ~ female + z_age + z_educ, family = binomial(link = "logit"),
data = df_gss_model)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6949 -0.4138 -0.3621 -0.3142 2.6399
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.8319 0.1215 -23.311 < 2e-16 ***
femalefemale 0.3543 0.1503 2.357 0.0184 *
z_age 0.1418 0.0739 1.919 0.0549 .
z_educ 0.3562 0.0737 4.833 1.34e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1485.7 on 2893 degrees of freedom
Residual deviance: 1456.3 on 2890 degrees of freedom
(9 observations deleted due to missingness)
AIC: 1464.3
Number of Fisher Scoring iterations: 5
%>% summary() %>% coef() fit_glm_1
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.8318640 0.12148235 -23.310908 3.436457e-120
femalefemale 0.3542592 0.15031927 2.356712 1.843756e-02
z_age 0.1418388 0.07390196 1.919284 5.494843e-02
z_educ 0.3561886 0.07369657 4.833177 1.343714e-06
::knitreg(list(fit_glm_1,
texreg::extract_glm_exp(fit_glm_1)),
texreghelprcustom.model.names = c("b (SE)",
"OR [95 CI]"),
caption = "Hoffman's EXAMPLE 3.4 A Logistic Regression Model of Volunteer Work, bottom of page 53",
caption.above = TRUE,
single.row = TRUE,
digits = 4,
ci.test = 1)
b (SE) | OR [95 CI] | |
---|---|---|
(Intercept) | -2.8319 (0.1215)*** | 0.0589 [0.0460; 0.0742]* |
femalefemale | 0.3543 (0.1503)* | 1.4251 [1.0644; 1.9203]* |
z_age | 0.1418 (0.0739) | 1.1524 [0.9957; 1.3306] |
z_educ | 0.3562 (0.0737)*** | 1.4279 [1.2362; 1.6504]* |
AIC | 1464.2759 | |
BIC | 1488.1574 | |
Log Likelihood | -728.1379 | |
Deviance | 1456.2759 | |
Num. obs. | 2894 | |
p < 0.001; p < 0.01; p < 0.05 (or Null hypothesis value outside the confidence interval). |
11.3 Interpretation
11.3.1 Odds-ratio Scale
%>% coef() fit_glm_1
(Intercept) femalefemale z_age z_educ
-2.8318640 0.3542592 0.1418388 0.3561886
exp(-2.831)
[1] 0.05895387
exp(-2.831 + 0.354)
[1] 0.08399483
exp(0.354)
[1] 1.424755
%>% coef() %>% exp() fit_glm_1
(Intercept) femalefemale z_age z_educ
0.05890295 1.42512448 1.15239090 1.42787675
NOTE: Odds = 1 –> There is a 50-50 change of that thing happening for whom ever we are refering to.
NOTE: an odds-ratio = 1 –> There is the same change of that thing happening for both groups.
Controlling for age and education,
the odds of volunteering among MALES is
exp(-2.831) = .0589
, andthe odds of volunteering among FEMALES is
exp(-2.831 + 0.354) = exp(-2.48) = .0840
.Females have 42% higher odds of having volunteered for a religious organization over the previous Year.
What, then, is the odds ratio?
0.0840)/(0.0589) (
[1] 1.426146
11.3.2 Response Scale (aka. Probability)
%>%
fit_glm_1 ::emmeans(~ female,
emmeanstype = "response")
female prob SE df asymp.LCL asymp.UCL
male 0.0556 0.00638 Inf 0.0444 0.0695
female 0.0774 0.00671 Inf 0.0653 0.0917
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
Controlling for age and education,
- the probability of volunteering among MALES is
.0556
and - the probability of volunteering among FEMALES is
.0774
.
Use these probabilities to compute the odds ratio for gender.
0774/(1 - .0774))/(.0556/(1 - .0556)) (.
[1] 1.42498
Note that these odds and probabilities are similar. This often occurs when we are dealing with probabilities that are relatively close to zero; in other words, it is a common occurrence for rare events. To see this, simply compute a cross-tabulation of volrelig
and gender
and compare the odds and probabilities. Then try it out for any rare event you may wish to simulate