Chapter 7

Prediction with Linear Regression

7.1. Let’s start by loading the tidyverse package (you can ignore the notes that you see below that it gives you once you load it), the furniture package, and the caret package. The caret package provides us with almost everything you need to do predictive modeling with regression (or many other approaches).

7.2. Import your data set into R.

7.3. Let’s start predicting your outcome of interest with several of the variables in the data. First thing we need to do is deal with the missing values in the data that will be used for the prediction. There are several ways to handle the missing data but we will take the easiest route by just removing them (often not recommended–see “multiple imputation”). To do so, use filter(complete.cases()) as highlighted in the examples for Chapter 7.

7.4. Because we are using a random aspect of our analysis (cross-validation), let’s set our random seed so we can replicate our results with set.seed().

7.5. Now we can set up our cross-validation using trainControl().

7.6. Using that, we can how fit our cross-validated regression models. The method = "lm" uses linear regression just like the regression that we’ve discussed but is not so invested in coefficients but more about rpediction.

7.7. We are given some pieces of information here. It tells us our sample size, how many predictors we’ve included, that we’ve used cross-validation, and then gives us an RMSE (root mean squared error), \(R^2\), and MAE (mean absolute error). These are all measures of accuracy of prediction. How much of the variation in the outcome are we explaining with our predictors?

Answer:

Selection Approaches

!!! This section of the homework is optional !!!

There are several predictor selection approaches. Among these, the Lasso (least absolute shrinkage and selector operator) is one of the most useful. The Lasso is built on linear regression but integrates a penalty term that allows it to select important (important in terms of prediction) variables. Let’s use the method = "glmnet" option with caret that uses a type of blended lasso approach.

We are given several pieces of information, including the levels of the tuning parameters (aspects of the model that can be adjusted beyond that of linear regression) and their corresponding accuracy levels. How accurate is the best model?

We can further investigate this by seeing how many/which predictors were selected for the best model. We can do that by using coef() while using the s argument where we give the best lambda value from above.

Which variables were selected? Is this list surprising?

Another predictor selection approach is stepwise (as discussed in the book). It has proven to not work super well in accurately selecting predictors but it can be helpful in prediction. We’ll do this below with method = "leapSeq".

Here, nvmax is the number of predictors selected. How many predictors provided the best model?

How much of the variability can we explain with the best model predictors in the model?

Chapter 8

Relative Importance of Predictors (Categorical)

8.1. Using a categorical predictor in your data set, fit a multiple regression with two categorical variables as the predictors of interest. Interpret the coefficients on both categorical variables.

Answer:

8.2. In order to test if one of them has a bigger effect than the other, we will use car’s linearHypothesis(). Note that the syntax for linearHypothesis includes the variable and the level. Are the size of the effects different?

Answer:

Relative Importance of Predictors (Continuous)

8.3. Now let’s use two continuous variables in a multiple regression. Does it appear like the effect of one might be bigger than the other without testing it?

Answer:

8.4. Let’s test it with linearHypothesis(). Is this sufficient to say they are different? Why or why not?

Answer:

8.5. Let’s standardize the two predictors and see if this changes anything. Now that the variables are in the same overall units (standard deviation units), are the effects significantly different?

Answer:

8.6. Which one, the unstandardized or the standardized, would be a more trustworthy comparison in this case?

Answer:

Dominance Analysis

8.7. We can also do dominance analysis using a (still in development) package called dominanceAnalysis. Let’s install and load the package.

#install.packages("remotes")
#remotes::install_github("clbustos/dominanceAnalysis")
library(dominanceanalysis)

8.8. Run a dominance analysis. Which of the two variables is more “dominant”?

Answer:

Chapter 9 and 10

Multicategorical Variables

9.1. If you have a multi-categorical variable in your data set, use it for the following analyses. If not, create a ficticious one (talk to the instructor or TA about this). Let’s look at the distribution of this variable using furniture::tableF().

Multi-Categorical Predictors

9.2. Let’s do a simple regression with the multicategorical variable (make sure it is already a factor). The lm() function automatically does dummy coding for us, selecting a reference group for you. After running the model, what is the reference group for your variable?

Answer:

9.3. Are there significant mean differences between the non-reference levels and the reference level regarding the outcome?

Answer:

Changing the Reference Level

9.4. It is often good to control the reference level for optimal interpretability. Let’s change the reference level of the multicategorical variable to whichever you think is best.

Contrasts and Adjusted Means

9.5. Let’s do a multiple regression using the multicategorical variable and any covariate of your choosing and assign it to fit.

Simple Contrasts (Comparing Two Levels)

9.6. We are also often interested in comparing the non-reference levels as well. We can use the car package with the linearHypothesis() function to test these. To make these comparisons, we use the coefficient names from fit. Compare two of the non-reference groups. Are they significantly different?

Adjusted Means

9.7. Again, using the model in fit, go through each estimate and interpret it, including the intercept, the levels of the multicategorical variable, and the covariate(s).

Answer:

9.8. Let’s check out the adjusted means of the multicategorical variable. We can use the emmeans function for this. What do these “emmeans” mean? (It may help in the interpretation to see the values that emmeans() averaged over, we can use emmeans::ref_grid(fit).)

Answer:

Chapter 11

Adjusting for Multiple Tests

11.1. Using any of your multiple regressions from above that included your multicategorical variable, let’s apply the Bonferroni’s adjustment. To do so, we will grab the p-values from the summary() and feed that to p.adjust(). These new p-values are adjusted ones so they won’t be exactly the same as the normal p-values that we obtain from summary().

11.2. Are there significant mean differences between the non-reference levels and the reference level regarding the outcome for the multicategorical variable?

Answer:

11.3. Let’s try a different approach; namely, let’s try "fdr" (the false discovery rate). This approach controls the number of false discoveries (as the name inplies). This approach is less conservative than Bonferroni’s.

11.4. Do any conclusions change based on using the false discovery rate rather than Bonferroni’s? Which change? Why?