Chapter 18 was about several extensions of linear regression in the form of GLMs. At the end of the chapter, several additional linear models were discussed including multilevel modeling and structural equation models. The following examples help illustrate how these approaches can be used in research using R
.
tidyverse
package, the furniture
package, and the educ7610
package.library(tidyverse)
library(furniture)
library(educ7610)
comp
and gss
data sets found in educ7610
into R.data(comp)
data(gss)
comp
data set first. Does the size (size
) and the leadership of the CEO (ceo
) predict whether or not the company had a profit (profit
; dichotomous where 1 = had a profit, 0 = otherwise). Let’s use a logistic regression to understand the relationship.logit_fit <- glm(profit ~ size + ceo, data = comp, family = binomial(link = "logit"))
summary(logit_fit)
##
## Call:
## glm(formula = profit ~ size + ceo, family = binomial(link = "logit"),
## data = comp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5066 -0.3870 -0.1962 0.3858 1.3746
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.1491 3.7994 -2.145 0.03197 *
## size 4.6237 1.7179 2.692 0.00711 **
## ceo 1.7899 0.9624 1.860 0.06292 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33.104 on 23 degrees of freedom
## Residual deviance: 13.691 on 21 degrees of freedom
## AIC: 19.691
##
## Number of Fisher Scoring iterations: 6
educ7610::oddsratios()
function.oddsratios(logit_fit)
## OR 2.5 % 97.5 %
## (Intercept) 2.890071e-04 1.536655e-08 1.488494e-01
## size 1.018716e+02 6.846841e+00 1.079402e+04
## ceo 5.988990e+00 1.097680e+00 6.327124e+01
margins
package again.library(margins)
margins(logit_fit) %>%
summary()
## factor AME SE z p lower upper
## ceo 0.1509 0.0651 2.3186 0.0204 0.0233 0.2784
## size 0.3897 0.0334 11.6568 0.0000 0.3242 0.4552
furniture::tableX()
we can see that these are nearly perfectly related. When this happens the results from the odds ratios can be a bit crazy.comp %>%
tableX(size, profit)
## profit
## size 0 1 Total
## 0 12 2 14
## 1 1 9 10
## Total 13 11 24
diagnostics(logit_fit) %>%
round(3)
## profit size ceo pred resid dfb.1_ dfb.size dfb.ceo dffit cov.r
## 1 1 1 3.4 2.560 0.386 -0.064 0.141 0.065 0.170 1.248
## 2 0 0 2.7 -3.316 -0.267 -0.081 0.066 0.070 -0.090 1.220
## 3 0 0 3.2 -2.421 -0.413 -0.121 0.117 0.093 -0.162 1.217
## 4 1 1 2.7 1.307 0.692 0.085 0.313 -0.087 0.538 1.312
## 5 1 1 3.4 2.560 0.386 -0.064 0.141 0.065 0.170 1.248
## 6 1 1 3.8 3.276 0.272 -0.056 0.083 0.058 0.097 1.228
## 7 1 1 4.2 3.992 0.191 -0.040 0.047 0.041 0.056 1.211
## 8 1 1 3.4 2.560 0.386 -0.064 0.141 0.065 0.170 1.248
## 9 0 1 3.7 3.097 -2.507 0.710 -1.130 -0.730 -1.321 0.158
## 10 1 1 4.1 3.813 0.209 -0.044 0.054 0.045 0.064 1.215
## 11 1 1 4.5 4.529 0.146 -0.028 0.030 0.029 0.037 1.199
## 12 1 0 4.3 -0.452 1.375 -0.263 -0.256 0.516 1.150 0.832
## 13 1 0 4.8 0.443 0.996 -0.711 0.034 0.947 1.309 1.297
## 14 0 0 3.2 -2.421 -0.413 -0.121 0.117 0.093 -0.162 1.217
## 15 0 0 2.6 -3.495 -0.244 -0.073 0.058 0.064 -0.079 1.218
## 16 0 0 2.9 -2.958 -0.318 -0.098 0.084 0.082 -0.114 1.222
## 17 0 0 3.4 -2.063 -0.489 -0.129 0.141 0.088 -0.204 1.208
## 18 0 0 3.1 -2.600 -0.378 -0.114 0.105 0.091 -0.144 1.219
## 19 0 0 2.4 -3.853 -0.205 -0.058 0.044 0.052 -0.061 1.213
## 20 0 0 2.7 -3.316 -0.267 -0.081 0.066 0.070 -0.090 1.220
## 21 1 1 2.1 0.233 1.080 1.017 0.519 -1.046 1.897 1.301
## 22 0 0 4.3 -0.452 -0.992 0.181 0.176 -0.354 -0.790 1.122
## 23 0 0 2.3 -4.032 -0.188 -0.052 0.039 0.046 -0.054 1.210
## 24 0 0 3.2 -2.421 -0.413 -0.121 0.117 0.093 -0.162 1.217
## cook.d hat
## 1 0.003 0.105
## 2 0.001 0.067
## 3 0.003 0.087
## 4 0.036 0.233
## 5 0.003 0.105
## 6 0.001 0.074
## 7 0.000 0.052
## 8 0.003 0.105
## 9 0.702 0.080
## 10 0.000 0.057
## 11 0.000 0.039
## 12 0.206 0.232
## 13 0.223 0.389
## 14 0.003 0.087
## 15 0.001 0.063
## 16 0.002 0.075
## 17 0.005 0.096
## 18 0.002 0.083
## 19 0.000 0.055
## 20 0.001 0.067
## 21 0.466 0.479
## 22 0.083 0.232
## 23 0.000 0.051
## 24 0.003 0.087
par(mfrow = c(2,2))
plot(logit_fit)
tvhours
from the gss
data set. Does education (educ
) predict number of hours spent watching TV per day when controlling for age
and sex
?pois_fit <- glm(tvhours ~ educ + age + sex, data = gss, family = poisson(link = "log"))
summary(pois_fit)
##
## Call:
## glm(formula = tvhours ~ educ + age + sex, family = poisson(link = "log"),
## data = gss)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0091 -1.0097 -0.3247 0.4255 7.8367
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.4005132 0.0845090 16.572 < 2e-16 ***
## educ -0.0458439 0.0049655 -9.232 < 2e-16 ***
## age 0.0069276 0.0008975 7.719 1.18e-14 ***
## sexMale -0.0813783 0.0320024 -2.543 0.011 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2487.2 on 1323 degrees of freedom
## Residual deviance: 2317.2 on 1320 degrees of freedom
## (699 observations deleted due to missingness)
## AIC: 5843.5
##
## Number of Fisher Scoring iterations: 5
educ
?educ7610::riskratios()
. Note that the risk ratio and the odds ratios functions do the exact same thing: grab the coefficients and confidence intervals and exponentiate them.riskratios(pois_fit)
## RR 2.5 % 97.5 %
## (Intercept) 4.0572818 3.4376898 4.7871254
## educ 0.9551911 0.9459294 0.9645026
## age 1.0069517 1.0051795 1.0087223
## sexMale 0.9218449 0.8657443 0.9814656
margins(pois_fit) %>%
summary()
## factor AME SE z p lower upper
## age 0.0207 0.0027 7.6609 0.0000 0.0154 0.0259
## educ -0.1367 0.0150 -9.1344 0.0000 -0.1660 -0.1074
## sexMale -0.2420 0.0950 -2.5487 0.0108 -0.4282 -0.0559
income06
) into 3 discrete values.gss3 <- gss %>%
mutate(inc_cat = cut_number(income06, 3),
inc_cat = factor(inc_cat, labels = c("Low", "Mid", "High")))
educ
and hompop
predict this categorical income variable (inc_cat
). We will use the glm_olr()
function from the educ7610
package. Note that this function is a wrapper of MASS::polr()
. See this website for more information on polr()
.olr_fit <- glm_olr(inc_cat ~ educ + hompop, data = gss3)
olr_fit
## Ordinal Logistic Regression
## Value Std. Error t value p value
## educ 0.29783 0.017622 16.901 4.4103e-64
## hompop 0.34114 0.033895 10.065 7.9166e-24
## Low|Mid 4.08444 0.264444 15.445 8.1029e-54
## Mid|High 6.01402 0.286829 20.967 1.3060e-97
## ---
educ
and hompop
) and the “cut-offs” or “thresholds” of the model (Low|Mid
and Mid|High
). Generally, we will only be interpreting the coefficients while the thresholds are important for the model to work. With this in mind, what is the interpretation of the coefficients here?oddsratios(olr_fit)
## OR 2.5 % 97.5 %
## educ 1.346929 1.301701 1.394819
## hompop 1.406546 1.316680 1.503844
educ
have on the ordinal income variable? What about hompop
?We will provide a very brief introduction to the following methods:
Note that more training should be obtained for using these methods without guidance (i.e., you’ll know just enough to be dangerous!).
bmt
data set and a few minor variable adjustments.library(survival)
library(OIsurv)
library(ranger)
data(bmt)
bmt2 <- bmt %>%
dplyr::select(-c(t2,d2,d3)) %>%
dplyr::mutate(group = factor(group),
da = factor(da),
dc = factor(dc),
dp = factor(dp),
z3 = factor(z3),
z4 = factor(z4),
z5 = factor(z5),
z6 = factor(z6),
z8 = factor(z8),
z9 = factor(z9),
z10 = factor(z10)
)
fit <- bmt2 %>%
filter(complete.cases(t1,d1,group)) %>%
survfit(Surv(t1,factor(d1)) ~ group, data = .)
surv_curve <- function(survfit){
df <- survfit$pstate %>%
data.frame %>%
setNames(c("left", "remaining")) %>%
dplyr::mutate(group = rep(seq_along(survfit$n),
times = survfit$strata)) %>%
dplyr::mutate(time = survfit$time)
ggplot(df, aes(time, remaining, group = factor(group), color = factor(group))) +
geom_line() +
geom_point(alpha = .2, shape = 3) +
theme_bw() +
labs(y = "Survival Rate",
x = "Time",
color = "Group") +
coord_cartesian(ylim = c(0,1))
}
surv_curve(fit)
bmt2 %>%
coxph(Surv(t1,d1) ~ group, data = .)
## Call:
## coxph(formula = Surv(t1, d1) ~ group, data = .)
##
## coef exp(coef) se(coef) z p
## group2 -0.6552 0.5193 0.2937 -2.231 0.0257
## group3 0.3702 1.4480 0.2674 1.384 0.1662
##
## Likelihood ratio test=15 on 2 df, p=0.0005539
## n= 137, number of events= 81
car
s linearHypothesis()
to assess differences across groups 2 and 3.bmt2 %>%
coxph(Surv(t1,d1) ~ group, data = .) %>%
car::linearHypothesis("group2 = group3")
## Linear hypothesis test
##
## Hypothesis:
## group2 - group3 = 0
##
## Model 1: restricted model
## Model 2: Surv(t1, d1) ~ group
##
## Res.Df Df Chisq Pr(>Chisq)
## 1 136
## 2 135 1 14.219 0.0001627 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lme4
and lmerTest
to estimate these types of models. We’ll also use the data from lme4
called InstEval
where students are nested within classes (specifically we have the professor, not the class).library(lme4)
library(lmerTest)
data(InstEval)
InstEval <- InstEval %>%
mutate(studage = factor(studage, labels = c("Youngest", "Young-Mid", "Old-Mid", "Oldest"),
ordered = FALSE),
lectage = factor(lectage, labels = c("Youngest", "Young", "Mid", "Old-Mid", "Old", "Oldest"),
ordered = FALSE))
str(InstEval)
## 'data.frame': 73421 obs. of 7 variables:
## $ s : Factor w/ 2972 levels "1","2","3","4",..: 1 1 1 1 2 2 3 3 3 3 ...
## $ d : Factor w/ 1128 levels "1","6","7","8",..: 525 560 832 1068 62 406 3 6 19 75 ...
## $ studage: Factor w/ 4 levels "Youngest","Young-Mid",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ lectage: Factor w/ 6 levels "Youngest","Young",..: 2 1 2 2 1 1 1 1 1 1 ...
## $ service: Factor w/ 2 levels "0","1": 1 2 1 2 1 1 2 1 1 1 ...
## $ dept : Factor w/ 14 levels "15","5","10",..: 14 5 14 12 2 2 13 3 3 3 ...
## $ y : int 5 2 5 3 2 4 4 5 5 4 ...
studage
), lecturer/professor age (lectage
), whether this was in the lecturer’s/professor’s department (service
), the department of the class (dept
) on the rating students gave of the class (y
) while controlling for the nesting of students within classes, classes within departments, and professors within departments. To control for the nesting we use a new notation where we do + (1 | var)
, suggesting to control for var
in the simplest way.fit_mlm <- lmer(y ~ studage + lectage + service + (1 | s) + (1 | d) + (1 | dept),
data = InstEval)
summary(fit_mlm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: y ~ studage + lectage + service + (1 | s) + (1 | d) + (1 | dept)
## Data: InstEval
##
## REML criterion at convergence: 237615.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.09054 -0.74589 0.04066 0.77162 3.15582
##
## Random effects:
## Groups Name Variance Std.Dev.
## s (Intercept) 0.106724 0.32669
## d (Intercept) 0.260800 0.51069
## dept (Intercept) 0.006925 0.08321
## Residual 1.383442 1.17620
## Number of obs: 73421, groups: s, 2972; d, 1128; dept, 14
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 3.291e+00 3.325e-02 2.604e+01 98.976 < 2e-16 ***
## studageYoung-Mid 5.197e-02 2.318e-02 4.090e+03 2.242 0.02502 *
## studageOld-Mid 7.216e-02 2.398e-02 4.533e+03 3.010 0.00263 **
## studageOldest 1.363e-01 2.639e-02 5.063e+03 5.166 2.48e-07 ***
## lectageYoung -8.074e-02 1.538e-02 6.834e+04 -5.249 1.53e-07 ***
## lectageMid -1.102e-01 1.672e-02 7.205e+04 -6.588 4.49e-11 ***
## lectageOld-Mid -1.892e-01 1.960e-02 6.845e+04 -9.650 < 2e-16 ***
## lectageOld -1.644e-01 2.141e-02 6.859e+04 -7.681 1.59e-14 ***
## lectageOldest -2.460e-01 2.049e-02 5.807e+04 -12.009 < 2e-16 ***
## service1 -7.278e-02 1.348e-02 3.994e+04 -5.399 6.73e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) stdY-M stdO-M stdgOl lctgYn lctgMd lctO-M lectgOld
## studgYng-Md -0.322
## studgOld-Md -0.353 0.517
## studagOldst -0.346 0.491 0.575
## lectageYong -0.195 -0.009 -0.023 -0.025
## lectageMid -0.034 -0.197 -0.236 -0.259 0.342
## lectgOld-Md -0.057 -0.170 -0.209 -0.234 0.473 0.403
## lectageOld 0.000 -0.136 -0.274 -0.283 0.300 0.417 0.374
## lectagOldst 0.004 -0.144 -0.297 -0.373 0.406 0.447 0.498 0.490
## service1 -0.142 0.017 0.056 0.054 -0.033 -0.044 -0.063 -0.072
## lctgOlds
## studgYng-Md
## studgOld-Md
## studagOldst
## lectageYong
## lectageMid
## lectgOld-Md
## lectageOld
## lectagOldst
## service1 -0.121
lavaan
(stands for LAtent VAriable ANalysis) with the FacialBurns
data set provided therein.library(lavaan)
data(FacialBurns)
str(FacialBurns)
## 'data.frame': 77 obs. of 6 variables:
## $ Selfesteem: int 35 40 38 30 27 35 35 28 29 26 ...
## $ HADS : int 6 2 2 4 11 5 4 13 19 16 ...
## $ Age : int 23 61 34 29 46 18 64 20 47 41 ...
## $ TBSA : num 3.5 6 8 6 19.2 ...
## $ RUM : num 5 4 4 5 13 8 6 8 13 8 ...
## $ Sex : int 1 1 1 1 1 1 1 1 1 1 ...
lavaan
we will create a model specification that is just a long string as shown below. Here, we look at the effect of age, sex, rumination (RUM
), and anxiety (HADS
) on self-esteem.model <- "
## Just using SEM for a multiple regression
Selfesteem ~ Age + Sex + RUM + HADS + TBSA
"
sem(model, data = FacialBurns) %>% summary()
## lavaan 0.6-3 ended normally after 19 iterations
##
## Optimization method NLMINB
## Number of free parameters 6
##
## Number of observations 77
##
## Estimator ML
## Model Fit Test Statistic 0.000
## Degrees of freedom 0
## Minimum Function Value 0.0000000000000
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## Selfesteem ~
## Age 0.029 0.035 0.824 0.410
## Sex 2.262 1.268 1.784 0.074
## RUM -0.374 0.170 -2.194 0.028
## HADS -0.421 0.074 -5.668 0.000
## TBSA -0.127 0.076 -1.683 0.092
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .Selfesteem 17.004 2.740 6.205 0.000
Std.all
column that means both the outcome and the predictors were standardized.sem(model, data = FacialBurns) %>% summary(standardized = TRUE)
## lavaan 0.6-3 ended normally after 19 iterations
##
## Optimization method NLMINB
## Number of free parameters 6
##
## Number of observations 77
##
## Estimator ML
## Model Fit Test Statistic 0.000
## Degrees of freedom 0
## Minimum Function Value 0.0000000000000
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## Selfesteem ~
## Age 0.029 0.035 0.824 0.410 0.029 0.070
## Sex 2.262 1.268 1.784 0.074 2.262 0.165
## RUM -0.374 0.170 -2.194 0.028 -0.374 -0.235
## HADS -0.421 0.074 -5.668 0.000 -0.421 -0.529
## TBSA -0.127 0.076 -1.683 0.092 -0.127 -0.162
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .Selfesteem 17.004 2.740 6.205 0.000 17.004 0.524
lm(Selfesteem ~ Age + Sex + RUM + HADS + TBSA,
data = FacialBurns) %>%
summary()
##
## Call:
## lm(formula = Selfesteem ~ Age + Sex + RUM + HADS + TBSA, data = FacialBurns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.276 -2.725 -0.347 2.773 10.357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.19221 2.08571 17.352 < 2e-16 ***
## Age 0.02896 0.03659 0.792 0.4313
## Sex 2.26199 1.32033 1.713 0.0910 .
## RUM -0.37365 0.17737 -2.107 0.0387 *
## HADS -0.42051 0.07726 -5.443 7.09e-07 ***
## TBSA -0.12744 0.07886 -1.616 0.1105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.294 on 71 degrees of freedom
## Multiple R-squared: 0.4761, Adjusted R-squared: 0.4392
## F-statistic: 12.91 on 5 and 71 DF, p-value: 6.201e-09
The approaches closely-related to linear regression are wide reaching, allowing one to model all sorts of outcome variables in various ways. Using what you know about linear regression, the use of nealry all these approaches is similar enough that you can learn more about any of them that you need to for your research with far less effort than if you did not know linear regression.