This is a guide for your homework. You can complete it in an R Notebook, an R script with output pasted/printed, or you can simply use this .Rmd file and fill in the code chunks. This assignment follows the examples and therefore the code there should be very helpful for you.

*Due: Sep 20th*

Chapter 2 introduces simple linear regression. The following examples help illustrate a number of principles that were discussed. Follow all instructions to complete Chapter 2.

- Open RStudio, start a new R script or RMarkdown document.
- Load the
`tidyverse`

package (you can ignore the notes that you see below that it gives you once you load it).

`library(tidyverse)`

Import your data set into R.

Use a simple regression model to assess the relationship of interest between a continuous outcome and a continuous predictor (you pick).

This output gives you two values, that for the intercept and that for the slope of your predictor. What does the slope of the predictor mean here (i.e., interpret the value)?

Simple regression and correlation are intimately tied. Let’s show that below.

Using your data you already imported above, let’s look at the correlation between those same variables.

Let’s compare this to the regression value after we standardize both variables with

`scale()`

. We first have to grab just the complete cases of the variables first.The intercept is essentially zero and the slope is the same as the correlation before. Is this surprising? Why or why not?

Residuals can help us understand several things about the model and the relations we are assessing. Most of this stuff we’ll talk about later but here’s some ways we’ll access the residuals.

Assign the first model to

`fit`

(or any other name you want to use).With that object, use the function

`resid()`

to produce the residuals of the model.Often we are going to be looking at the residuals in plots. To do this simply, we’ll use

`plot()`

. You don’t need to know what the four plots mean, but know that it uses the residuals from the model to do it.

Chapter 3 introduces multiple linear regression. The following examples help illustrate a number of principles that were discussed. Follow all instructions to complete Chapter 3.

Use a multiple regression model to assess the relationship of interest while controlling for another variable.

This output gives you three values, one for the intercept, one for the slope of your predictor of interest, and another for the slope of your covariate. What does the slope of your predictor of interest mean here (i.e., interpret the value)?

To better understand how `educ`

and `hompop`

will change the other’s simple effect to a partial effect, we can first check the correlation between the two. As a reminder, if the correlation between them is non-zero and they both are correlated with the outcome, the simple effect and the partial effect will differ (at least by a little).

Using the your data you already imported above, let’s look at the correlation between between these three variables using

`furniture::tableC()`

.- Are the covariate and the predictor of interest correlated?
Now, run both simple regressions and compare to the results from the multiple regression earlier. How does the estimate on the predictor of interest change?

Let’s run a multiple regression with two covariates now.

- How is the effect of interest interpreted in this case?
Let’s compare this to the regression value after we standardize both variables with

`scale()`

. We first have to grab just the complete cases of the variables first (again, consider why).The intercept is essentially zero again. Now, how is the estimate of education interpreted in this case?

Let’s compare this to the other way of standardizing the estimates (\(b_{standardized} = b \frac{s_x}{s_y}\)).

Although it was a little bit more of a messy computation, the results for the estimates are the same.

Finally, let’s do a partial correlation by using the residuals as shown in the example for chapter 3.

These values differ from the standardized regression coefficients. Why?

Consider if we had a variable that we wanted to control for (let’s call it \(c\)) but didn’t have access to it. We report on the regression below with \(x\) predicting \(y\). If we know that the correlation between \(x\) and \(c\) is positive and the correlation between \(y\) and \(c\) is positive, will the estimate on \(x\) go up, down, or stay the same?

```
## Do not edit this part
set.seed(843)
df <- data_frame(
x = rnorm(100),
y = 2*x + rnorm(100, 2, 5)
)
df %>%
lm(y ~ x, data = .)
```

```
##
## Call:
## lm(formula = y ~ x, data = .)
##
## Coefficients:
## (Intercept) x
## 1.912 1.798
```

Chapter 4 introduces statistical inference with regression. The following examples help illustrate a number of principles that were discussed. Follow all instructions to complete Chapter 4.

Load the

`furniture`

package.Use your simple regression model from Chapter 2 and use

`summary()`

to obtain the F-statistic of the model with it’s accompanying p-value, as well as the standard error, t-value, and p-value of the estimate itself.- This output gives you the estimate, the standard error of the estimate, the t-value of the estimate, and the p-value of the estimate (where the null is that there is no relationship). Is the relationship statistically significant at the \(\alpha = .05\) level?
From that same output above, we see

*a lot more*information. What pieces do you recognize? Which pieces are confusing?Let’s run one of the multiple regressions that you used in Chapter 3.

How is the effect of interest interpreted in this case? Is it statistically significant? Is that the significance/non-significance surprising?

Let’s compare this to the regression value after we standardize the variables with

`scale()`

. We first have to grab just the complete cases of the variables first.Is the standardized estimates more/less/same significant as the unstandardized? Is this surprising?

We can also test the significance of a predictor using model comparisons. Use your simple regression model (assign it to

`fit1`

) and then your multiple regression model (assign it to`fit2`

). Compare the models with`anova()`

. Does this match the significance of the predictor of interest above in the multiple regression?

For inference to be meaningful at all, we need to check if our assumptions are holding. For a linear regression model we have 4 main assumptions:

- Linearity
- Homoscedasticity
- Conditional Distribution of Y is normal and centered at \(\hat{Y}\)
- Independent Sampling

Let’s test these assumptions using plots as we discussed in class. First, let’s see if the relationship between your predictor of interest and the outcome is linear. This one we can take a look at via a scatterplot with a smoothed line showing the relationship. (Note: due to overplotting–points on top of points–we used

`geom_count()`

instead of`geom_point()`

.)Next let’s look at homoscedasticity. This we will look at using our model object. Using the

`plot()`

function with our model object we get four plots. The first one let’s us assess our residuals for homoscedasticity. Does it look homoscedastic?The second plot above shows us whether our residuals are approximately normal. The others we’ll get to later. Do the residuals follow the line or do they deviate?

Finally, independent sampling is not an assumption that we can check. This is something that has to be taken care of during data collection; not during the analysis. So for this one, we will assume the data were collected in this fashion.

Remember that if we want to make inference about a conditional mean (essentially a predicted point when we select the values of the X’s), we can use a simple trick: subtrack the values from the original variables and then the information on the intercept is the information for the conditional mean at that point. Let’s try this below by using your multiple regression model from before and centering your predictor of interest and a covariate at a value that you pick (needs to be part of the sample). Let’s say we want to get a confidence interval around the point that point. What is the standard error of this conditional mean?

Did anything in the results change from the multiple regression model from before.

- What is the interpretation for the conditional mean intercept? What does the intercept mean for the original case?
- Did the regression coefficients change on the variables? Why or why not?
Finally, we can get the confidence interval of this conditional mean using

`confint()`

.What does this 95% confidence interval mean?

Chapter 5 extends much of what we have already learned to dichotomous variables and a few other statistical principles. The following examples help illustrate a number of items that were discussed. Follow all instructions to complete Chapter 5.

`R`

is very helpful in creating dummy variables since it does all the work for you. All you need to do is let`R`

know that the variable is a factor. Select one of your categorical variables and see if`R`

already knows that it should be a factor or not.- If, at the top of the column it says says
`<fct>`

then`R`

already knows that it is a factor; otherwise we need use the`factor()`

function to tell`R`

that it was a factor. Using this categorical variable, let’s predict an outcome of interest (obtain the inferential statistics as well with

`summary()`

).The output for the coefficient now says the variable name plus one of the levels meaning the unlist level is the reference category and this is in comparison to that. With that in mind, what is the interpretation of the estimate? Is it significant?

Let’s control for two other covariates (you pick).

Now that we’ve controlled for these covariates, is there a difference among the categories of your categorical variable? If so, what is the interpretation of the estimate?

Chapter 6 talks about experimental and statistical control. We won’t do any of the examples for the homework but keep the principles in mind that regression can be used with experimental data too.