Introduction

This is a guide for your homework. You can complete it in an R Notebook, an R script with output pasted/printed, or you can simply use this .Rmd file and fill in the code chunks. This assignment follows the examples and therefore the code there should be very helpful for you.

Due: Later

Chapter 7

Chapter 7 talks about using regression as a prediction tool (i.e., for predictive modeling). The following examples help illustrate a few items that were discussed. Follow all instructions to complete Chapter 7.

Prediction with Linear Regression

  1. Open RStudio, start a new R script or RMarkdown document.
  2. Let’s start by loading the tidyverse package (you can ignore the notes that you see below that it gives you once you load it), the furniture package, and the caret package. The caret package provides us with almost everything you need to do predictive modeling with regression (or many other approaches).
library(tidyverse)
  1. Import your data set into R.

  2. Let’s start predicting your outcome of interest with several of the variables in your data set. First thing we need to do is deal with the missing values in the data that will be used for the prediction. There are several ways to handle the missing data but we will take the easiest route by just removing them (often not recommended–see “multiple imputation”). To do so, use filter(complete.cases()) as highlighted in the examples for Chapter 7.

  3. Because we are using a random aspect of our analysis (cross-validation), let’s set our random seed so we can replicate our results with set.seed().

  4. Now we can set up our cross-validation using trainControl().

  5. Using that, we can how fit our cross-validated regression models. The method = "lm" uses linear regression just like the regression that we’ve discussed but is not so invested in coefficients but more about rpediction.

  6. We are given some pieces of information here. It tells us our sample size, how many predictors we’ve included, that we’ve used cross-validation, and then gives us an RMSE (root mean squared error), \(R^2\), and MAE (mean absolute error). These are all measures of accuracy of prediction. How much of the variation in the outcome are we explaining with our predictors?

Selection Approaches

Notably, these selection approaches cannot be used for causal interpretation of the individual coefficients or predictors. Rather, they inform us on the importance of each variable in predicting the outcome.

  1. There are several predictor selection approaches. Among these, the Lasso (least absolute shrinkage and selector operator) is one of the most useful. The Lasso is built on linear regression but integrates a penalty term that allows it to select important (important in terms of prediction) variables. Let’s use the method = "glmnet" option with caret that uses a type of blended lasso approach.

  2. We are given several pieces of information, including the levels of the tuning parameters (aspects of the model that can be adjusted beyond that of linear regression) and their corresponding accuracy levels. How accurate is the best model?
  3. We can further investigate this by seeing how many/which predictors were selected for the best model. We can do that by using coef() while using the s argument where we give the best lambda value from above.

  4. Which variables were selected? Is this list surprising?
  5. Another predictor selection approach is stepwise (as discussed in the book). It has proven to not work super well in accurately selecting predictors but it can be helpful in prediction. We’ll do this below with method = "leapSeq".

  6. Here, nvmax is the number of predictors selected. How many predictors provided the best model?
  7. How much of the variability can we explain with the best model predictors in the model?

Chapter 8

Chapter 8 talks about how we can assess the importance of an individual or a set of predictors. The principles discussed herein help us be able to test whether two predictors in the same regression model have the same effect on the outcome.

Relative Importance of Predictors (Categorical)

  1. Using a categorical predictor in your data set, we want to know, first, fit a multiple regression with two categorical variables as the predictors of interest.

  2. In order to test if one of them has a bigger effect than the other, we will use car’s linearHypothesis(). Note that the syntax for linearHypothesis includes the variable and the level.

  3. Are the size of the effects different?

Relative Importance of Predictors (Continuous)

  1. These types of comparisons are a bit easier with categorical predictors. When it comes to continuous predictors, we need to make sure they are in comparable units. Now let’s use two continuous variables in a multiple regression.

  2. Does it look like the effect of one might be bigger than the other?
  3. Let’s test it with linearHypothesis().

  4. Is this sufficient to say they are different? Why or why not?

  5. Let’s standardize the two predictors and see if this changes anything.

  6. Now that the variables are in the same overall units (standard deviation units), are the effects significantly different?
  7. Which one, the unstandardized or the standardized, would be a more trustworthy comparison in this case?

Dominance Analysis

  1. We can also do dominance analysis using a (still in development) package called dominanceAnalysis. Let’s install and load the package.
#devtools::install_github("clbustos/dominanceAnalysis")
library(dominanceanalysis)
  1. We can see which of the two continuous variables is more dominant.

Chapter 9 and 10

Chapter 9 talks about multicategorical predictors, how we can leverage the information contained therein, and ways we can avoid common problems with these types of variables.

Multicategorical Variables

  1. If you have a multi-categorical variable in your data set, use it for the following analyses. If not, create a ficticious one. Let’s look at the distribution of both of these variables using furniture::tableF().

Multi-Categorical Predictors

  1. Let’s do a simple regression with the multicategorical variable (make sure it is already a factor). The lm() function automatically does dummy coding for us, selecting a reference group for you. What is the reference group for your variable?

  2. Are there significant mean differences between the non-reference levels and the reference level regarding the outcome?

  3. Let’s do a multiple regression using both the multicategorical variable and any covariate of your choosing and assign it to fit.

Changing the Reference Level

  1. It is often good to control the reference level for optimal interpretability. Let’s change the reference level of the multicategorical variable.

Contrasts and Adjusted Means

Simple Contrasts (Comparing Two Levels)

  1. We are also often interested in comparing the non-reference levels as well. We can use the car package with the linearHypothesis() function to test these. To make these comparisons, we use the coefficient names from fit. Compare two of the non-reference groups. Are they significantly different?

Adjusted Means

  1. Using the model in fit, go through each estimate and interpret it, including the intercept, the levels of the multicategorical variable, and the covariate(s).
  2. Why is the intercept here not all that useful? How might we make it useful again?

  3. Let’s check out the adjusted means of the multicategorical variable. We can use the emmeans function for this.

  4. What do these “emmeans” mean? It may help in the interpretation to see the values that emmeans() averaged over, we can use emmeans::ref_grid(). Does this help us more fully interpret the emmeans from before?

Chapter 11

Chapter 11 talks about the problem of multiple tests. These examples will be fairly short but will highlight a few ways to adjust for multiple tests.

Adjusting for Multiple Tests

  1. Using any of your multiple regressions from above that included your multicategorical variable, let’s apply the Bonferroni’s adjustment. To do so, we will grab the p-values from the summary() and feed that to p.adjust().

  2. Are there significant mean differences between the non-reference levels and the reference level regarding the outcome for the multicategorical variable?

  3. Let’s try a different approach; namely, let’s try "fdr" (the false discovery rate). This approach controls the number of false discoveries (as the name inplies). This approach is less conservative than Bonferroni’s.

  4. Do any conclusions change based on using the false discovery rate rather than Bonferroni’s? Which change? Why?