Problem 1:
Consider a study design in which we have collected multiple response measurements at each value of the predictor. Suppose we have ni observed responses at each value of xi, indexed by i = 1, . . . , m, and yij corresponds to the j-th observation on the response, j = 1, . . . , ni for the i-th value of the predictor. This means we have m unique predictor values, and ni response measurements for each of the m values of the predictor. In this situation, it is possible to create a test that can be used to test for how poorly the regression line captures the linear relationship.
(a) Consider the traditional variance decomposition of a simple regression model: SST = SSReg + RSS. Show that we can further decompose the residual sum of squares into
- the pure error (i.e. deviations of the individual responses from the average response at each unique value of the predictor), denoted by SSPure
- and the lack of fit error (i.e. deviations of the average response at each x value from the regression line), denoted by SSLack.
(b) Determine the degrees of freedom for the pure error and the lack of fit error.
(c) Determine the expected values of the mean squares of the pure error (MSPure) and the lack of fit error (MSLack). You may assume that model assumptions are satisfied.
(d) The test statistic for this test is
F = MSLack/MSPure
Explain why this should follow an F distribution.
(e) Based on the test statistic in (d) and the expected values in (c), explain why a large value of the test statistic implies that the true regression function is not linear, and thus the fit of our regression model is poor.
Problem 2:
A study was run to compare the effect of three different drugs on reducing the pain caused by a particular condition. The drugs are labelled A, B, and C, and the response of interest is a pain scale rating (integer-valued), where higher values implies more pain. The goal of the study was to determine whether there exists a difference in the average pain rating between the three drug treatments. We can answer this question using multiple linear regression methods. The data can be found below:
Drug
|
A
|
4
|
5
|
4
|
3
|
2
|
4
|
3
|
4
|
4
|
Drug
|
B
|
6
|
8
|
4
|
5
|
4
|
6
|
5
|
8
|
6
|
Drug
|
C
|
6
|
7
|
6
|
6
|
7
|
5
|
6
|
5
|
5
|
(a) Show that we can represent the three treatments/drugs in the form of two indicator variables. Why don't we require the use of a third indicator variable?
(b) Find the XJX and XJY matrices for these data.
(c) Estimate the regression coefficients for a multiple linear regression model relating the pain response Y to the three drugs, X.
(d) Show that the above regression model can be re-expressed as
yij = µ + τi + sij
where µ is the overall average pain rating, τi is the average pain rating for drug i, sij is the random error in the pain rating for individual j and drug i, and yij is the pain rating for individual j on drug i.
(e) Perform an appropriate hypothesis test using your model from (c) to determine whether the average pain ratings for each drug are equal (i.e. τi = 0 for all i). Use a significance level α = 0.05 and the residual standard error of 1.089.
Problem 3:
For each of the parts below, please provide a concise (up to three sentences) but detailed explanation for each of the concepts. Make sure you use your own words for your answers.
(a) Suppose we have the following correlations between a response variable and two predictor variables. Explain which predictor the forward selection method would add to the model first. Would the method then add the second predictor variable? Why or why not?
|
X1
|
X2
|
|
Y
|
|
Y
|
1
|
0.93
|
-0.99
|
X1
|
0.93
|
1
|
0.985
|
X2
|
-0.99
|
0.985
|
1
|
(b) Explain how violations in the model assumptions affect the ANOVA test of overall significance in simple linear regression.
(c) In the event that condition 1 or 2 fails, explain why we are unable to use the specific patterns seen in the residual plots to tell us in what way the model assumptions are violated.
(d) Explain why, when you have response measurements that are means or medians, using a weight equal to the number of observations used to create that value can correct for violations of constant variance.
Problem 4:
Consider the New York City menu dataset, which can be found on the assignment page on Quercus or attached with this question.
(a) Fit a multiple linear regression model to predict Price from the variables Food, Decor, and East. Extract the residuals from this model and save them. What do they represent in the context of this model?
(b) Fit a multiple linear model to predict Service from Decor, Food and East. Extract the residuals from this model and save them. What do they represent in the context of this model?
(c) What can we say about the predictors based on the model from (b)?
(d) Plot the residuals saved from part (a) against the residuals saved from part (b). Add a line representing the simple linear regression relationship between these two sets of residuals. What relationship do you see between the two sets of residuals?
(e) Compare the relationship in your plot from (d) to a multiple linear model predicting Price from the variables Food, Decor, Service and East. What similarities do you see? What does the plot represent and how does it achieve this?
(f) How else might this plot be used for diagnostic purposes?
Problem 5:
For this question, you will be using the housing.proper.csv dataset which can be found on the assignment page on Quercus or attached to this question on Crowdmark. These data consist of the median value of owner-occupied homes (Y ) in suburbs of Boston, along with a number of different neighbourhood characteristics. It contains 506 observations on 13 covariates. You are asked by a real estate developer to build the best possible model to predict the median value of homes in a new subdivision being built, but that is also interpretable so they can justify the use of this model to shareholders. The possible predictors for this model include:
- X1 = per capita crime rate by town
- X2 = proportion of residential land zoned for lots over 25000 square feet.
- X3 = proportion of non-retail business acres per town
- X4 = Charles River indicator variable (1 = near river, 0 = far from river)
- X5 = nitric oxide concentration (parts per 10 million)
- X6 = average number of rooms per dwelling
- X7 = proportion of owner occupied units built prior to 1940
- X8 = weighted distance to five Boston employment centres
- X9 = index of accessibility to radial highways
- X10 = full-value property-tax rate
- X11 = pupil-teacher ratio by town
- X12 = 1000(B - 0.63)2, where B is the proportion of African Americans by town
- X13 = a numeric vector of percentage values of lower status population
You may use any technique shown in class to arrive at your final model, but you must justify every decision you make. You will be asked to interpret your final model, explain how you arrived at this model and defend why you think this is the best possible model. You may use up to 5 plots in your explanations and each plot must have a reason for being presented. Please do not include too much R output (ideally fewer than 5 outputs) as all your decisions and model diagnostics should be discussed in the text rather than presented with R output. The discussion of your model should be no longer than 500 words. All R code should be at the end in an appendix so we can verify your final model and the steps you took to arrive there. Your report with plots and output should reasonably be no longer than 3 pages, with the appendix attached after. Do not overload your appendix with code or output that is not relevant to the creation of your final model.