```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Chapter 18 was about several extensions of linear regression in the form of GLMs. At the end of the chapter, several additional linear models were discussed including multilevel modeling and structural equation models. The following examples help illustrate how these approaches can be used in research using `R`.
1. Let's start by loading the `tidyverse` package, the `furniture` package, and the `rlm` package.
```{r}
library(tidyverse)
library(furniture)
library(rlm)
```
2. Import the `comp` and `gss` data sets found in `rlm` into R.
```{r}
data(comp)
data(gss)
```
## Generalized Linear Models
### Logistic Regression
3. Let's look at the `comp` data set first. Does the size (`size`) and the leadership of the CEO (`ceo`) predict whether or not the company had a profit (`profit`; dichotomous where 1 = had a profit, 0 = otherwise). Let's use a logistic regression to understand the relationship.
```{r}
logit_fit <- glm(profit ~ size + ceo, data = comp, family = binomial(link = "logit"))
summary(logit_fit)
```
4. What do the coefficients from this model mean here? Is this in an intuitive metric?
5. We can get more interpretable effects using Odds Ratios and Average Marginal Effects (AME). First, we'll get the odds ratios using the `rlm::oddsratios()` function.
```{r}
oddsratios(logit_fit)
```
6. Interpret the odds ratios and the confidence intervals of these odds ratios in context of the research question.
7. Now we'll get the AMEs. We'll use the `margins` package again.
```{r}
library(margins)
margins(logit_fit) %>%
summary()
```
8. Interpret these AMEs in context of the research question.
9. Overall, the AME is a more reasonable estimate here, due to some issues with the data. Specifically, one issue with logistic regression is when an outcome is perfectly predicted by a predictor. In this case, if we use `furniture::tableX()` we can see that these are nearly perfectly related. When this happens the results from the odds ratios can be a bit crazy.
```{r}
comp %>%
tableX(size, profit)
```
10. Let's check the diagnostics of this logistic regression model.
```{r}
diagnostics(logit_fit) %>%
round(3)
par(mfrow = c(2,2))
plot(logit_fit)
```
11. These diagnostics will look different than in linear regression due to the nature of the dichotomous outcome. However, some of the weirdness in these diagnostics (especially "Residuals vs Fitted" and "Scale-Location") has to do with this very small sample size for logistic regression. Generally, we want a much larger sample size to use logistic regression.
12. Overall, we likely have an issue with normality. Point "9" may be problematic with Cook's distance. This could more of a problem of misspecification than anything else. What variables could we be omitting here?
### Poisson Regression
13. To see how Poisson regression works, we'll use `tvhours` from the `gss` data set. Does education (`educ`) predict number of hours spent watching TV per day when controlling for `age` and `sex`?
```{r}
pois_fit <- glm(tvhours ~ educ + age + sex, data = gss, family = poisson(link = "log"))
summary(pois_fit)
```
14. What do the coefficients mean here on `educ`?
15. We can also get risk ratios for Poisson regression with `rlm::riskratios()`. *Note that the risk ratio and the odds ratios functions do the exact same thing: grab the coefficients and confidence intervals and exponentiate them.*
```{r}
riskratios(pois_fit)
```
16. Interpret the risk ratios in context of the research question.
17. Let's also get the AMEs.
```{r}
margins(pois_fit) %>%
summary()
```
18. Interpet these in context of the research question.
### Ordinal Logistic Regression
19. To consider how to use ordinal logistic regression, we are going to split the the income variable (`income06`) into 3 discrete values.
```{r}
gss3 <- gss %>%
mutate(inc_cat = cut_number(income06, 3),
inc_cat = factor(inc_cat, labels = c("Low", "Mid", "High")))
```
20. Using this new variable, let's use an ordinal logistic regression to test if `educ` and `hompop` predict this categorical income variable (`inc_cat`). We will use the `glm_olr()` function from the `rlm` package. Note that this function is a wrapper of `MASS::polr()`. See [this website](https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/) for more information on `polr()`.
```{r}
olr_fit <- glm_olr(inc_cat ~ educ + hompop, data = gss3)
olr_fit
```
21. This gives us the coefficients of the model (both `educ` and `hompop`) and the "cut-offs" or "thresholds" of the model (`Low|Mid` and `Mid|High`). Generally, we will only be interpreting the coefficients while the thresholds are important for the model to work. With this in mind, what is the interpretation of the coefficients here?
22. We can get the odds ratios of this model.
```{r}
oddsratios(olr_fit)
```
23. What effect does `educ` have on the ordinal income variable? What about `hompop`?
## Other Linear Models
We will provide a very brief introduction to the following methods:
* Survival Analysis
* Time Series Analysis
* Multilevel Modeling (specifically, mixed effects modeling)
* Structural Equation Modeling (SEM)
Note that more training should be obtained for using these methods without guidance (i.e., you'll know just enough to be dangerous!).
### Survival Analysis
24. Survival analysis is for "censored" data, data wherein the outcome is an event that may or may not have happened before the end of the study. Instead of assuming that if the event didn't happen by the end of the study that it would never happen, survival analysis allows us to assess events without this assumption. To do a survival analysis, we will need the following packages and the `bmt` data set and a few minor variable adjustments.
```{r}
library(survival)
library(OIsurv)
library(ranger)
data(bmt)
bmt2 <- bmt %>%
dplyr::select(-c(t2,d2,d3)) %>%
dplyr::mutate(group = factor(group),
da = factor(da),
dc = factor(dc),
dp = factor(dp),
z3 = factor(z3),
z4 = factor(z4),
z5 = factor(z5),
z6 = factor(z6),
z8 = factor(z8),
z9 = factor(z9),
z10 = factor(z10)
)
```
25. To understand survival analysis, we will use two important aspects of it: 1) KM plots and 2) Cox Proportional Hazards model. First, for the KM plot. We want to see if there are differences in risk across the groups. There's some data manipulation happening in this chunk so feel free to ignore that and focus on the output.
```{r}
fit <- bmt2 %>%
filter(complete.cases(t1,d1,group)) %>%
survfit(Surv(t1,factor(d1)) ~ group, data = .)
surv_curve <- function(survfit){
df <- survfit$pstate %>%
data.frame %>%
setNames(c("left", "remaining")) %>%
dplyr::mutate(group = rep(seq_along(survfit$n),
times = survfit$strata)) %>%
dplyr::mutate(time = survfit$time)
ggplot(df, aes(time, remaining, group = factor(group), color = factor(group))) +
geom_line() +
geom_point(alpha = .2, shape = 3) +
theme_bw() +
labs(y = "Survival Rate",
x = "Time",
color = "Group") +
coord_cartesian(ylim = c(0,1))
}
surv_curve(fit)
```
26. What does this plot tell us about the survival in each group? Which group has the best survival rate?
27. Now let's use a Cox Proportional Hazards model to understand the differences in the risk across the groups.
```{r}
bmt2 %>%
coxph(Surv(t1,d1) ~ group, data = .)
```
28. Are group 2 or 3 significantly different than group 1 (the reference group) in risk?
29. We can also use `car`s `linearHypothesis()` to assess differences across groups 2 and 3.
```{r}
bmt2 %>%
coxph(Surv(t1,d1) ~ group, data = .) %>%
car::linearHypothesis("group2 = group3")
```
30. Are they significantly different? Is this a surprise from the coefficients from the model?
### Time Series Analysis
31. Because time series is not used extensively in most of our fields, we won't cover an example here. Know, however, that many of the methods used are deeply related to linear regression as we've discussed it.
### Multilevel Modeling
32. Multilevel models are particularly helpful when we have issues of non-independence. For example, kids nested within classes are not independent, especially if the intervention is delivered to the class as a whole. Another example is longitudinal data wherein time points are nested within individuals. In these situations, we have a couple options. We can estimate an effect for each classroom or each individual. But this really eats up degrees of freedom and we are rarely interested in whether classroom 1 is significantly different than classroom 2 or if Susan specifically performed better than Ralphy. But we don't want these differences to affect the estimates we do care about.
33. One way around this is to use what are called *random effects*. These are different than what we normally are estimating--*fixed effects*--because they don't estimate a coefficient but it still controls the variation produced by the nesting.
34. Let's look at using package `lme4` and `lmerTest` to estimate these types of models. We'll also use the data from `lme4` called `InstEval` where students are nested within classes (specifically we have the professor, not the class).
```{r}
library(lme4)
library(lmerTest)
data(InstEval)
InstEval <- InstEval %>%
mutate(studage = factor(studage, labels = c("Youngest", "Young-Mid", "Old-Mid", "Oldest"),
ordered = FALSE),
lectage = factor(lectage, labels = c("Youngest", "Young", "Mid", "Old-Mid", "Old", "Oldest"),
ordered = FALSE))
str(InstEval)
```
35. Let's look at the effect of student age (`studage`), lecturer/professor age (`lectage`), whether this was in the lecturer's/professor's department (`service`), the department of the class (`dept`) on the rating students gave of the class (`y`) while controlling for the nesting of students within classes, classes within departments, and professors within departments. To control for the nesting we use a new notation where we do `+ (1 | var)`, suggesting to control for `var` in the simplest way.
```{r, cache = TRUE}
fit_mlm <- lmer(y ~ studage + lectage + service + (1 | s) + (1 | d) + (1 | dept),
data = InstEval)
summary(fit_mlm)
```
36. Interpretation is similar to linear regression when it comes to the coefficients. What is the effect of student age here (hint: look at the Fixed effects)? Is it significant? Note that the p-values here are based on a certain type of estimate of the degrees of freedom called the Satterthwaite Approximation.
### Structural Equation Modeling
37. SEM is a complex set of tools that encompass a great deal of research questions. This is only going to be the briefest of brief introductions here. To do so, we will use the main SEM package `lavaan` (stands for LAtent VAriable ANalysis) with the `FacialBurns` data set provided therein.
```{r}
library(lavaan)
data(FacialBurns)
str(FacialBurns)
```
38. To use `lavaan` we will create a model specification that is just a long string as shown below. Here, we look at the effect of age, sex, rumination (`RUM`), and anxiety (`HADS`) on self-esteem.
```{r}
model <- "
## Just using SEM for a multiple regression
Selfesteem ~ Age + Sex + RUM + HADS + TBSA
"
sem(model, data = FacialBurns) %>% summary()
```
39. What is the effect of anxiety on self-esteem? Is it significant?
40. We can further get standardized effect sizes of these, using the following. The standardized effect is the `Std.all` column that means both the outcome and the predictors were standardized.
```{r}
sem(model, data = FacialBurns) %>% summary(standardized = TRUE)
```
41. What is the standardized effect of anxiety on self-esteem? How would you interpret that?
42. Note that the same results would be obtained using linear regression here:
```{r}
lm(Selfesteem ~ Age + Sex + RUM + HADS + TBSA,
data = FacialBurns) %>%
summary()
```
## Conclusion
The approaches closely-related to linear regression are wide reaching, allowing one to model all sorts of outcome variables in various ways. Using what you know about linear regression, the use of nealry all these approaches is similar enough that you can learn more about any of them that you need to for your research with far less effort than if you did not know linear regression.