Question 1 -
Here we will repeat parts 4 through 8 from the Midterm. These have been slightly reworded and re-ordered to make the intention clearer.
We will now also consider x2. Using both categorical (x1) and continuous (x2) covariates often referred to as the Analysis of Covariance (ANCOVA), even if Giles thinks it's all just part of linear regression.
For this, we will write the average value of x2 among subjects with x1 = 0 to be x-2,0 and among subjects with x1 = 1 to be x-2,1 and write x~2 to be x2 with the group mean subtracted:
and we will set X2 = [1, x1, x~2].
a. Show that x~2 can be written as x2 - α11 - α2x1. What are α1 and α2? You may ?nd earlier questions useful.
b. Write out X2TX2 for this new model. Show that your estimates β^0 and β^1 are unchanged from Question 2 in the midterm.
If we are interested in β1, was there any point to adding x2?
c. By writing out the prediction equation β^0 + β^1x1 + β^2x~2 in terms of x2 = x~2 + α11 + α2x1, ?nd β^*1, the estimate of β^1 in a model where we used X*2 = [1, x1, x2] instead of X.
Why has β^2 not changed? What is the variance of β^*1?
d. Show that the variance of β^*1 obtained above is equal to the variance pf β^1 times the VIF for x2. The following will be helpful:
x~T2H1CH1x~2 = n0(x-2,0 - x-2)2 + n1(x-2,1 - x-2)2 = n/n0n1(x-2,1 - x-2,0)2
e. There is a concern that the slope on x2 might be different between the x1 = 1 group and the x1 = 0 group. For this reason, the researcher considers adding an interaction term to produce a design matrix X = [1, x1, x~2, x1x~2] where the last column is the element-wise product of x1 and x~2.
Define a sum of squares to measure the total contribution of x~2 to the model in this case.
Bonus - The paper "A Two-State Algorithm for Fairness-aware Machine Learning" by Junpei Komiyama and Hajime Shimao appeared on ArXiv on October 15 this year - Giles read it during the midterm.
Recently, there has been considerable interest in the possibility that machine learning can exacerbate social biases, with examples including face recognition that performs much worse on african americans and evidence that tools used to predict re-offending in parol hearings giving worse scores to disadvantaged groups.
A particular problem is that a tool does not have to explicitly use a protected value like sex or race or age in order to discriminate. It could use something correlated like zip code. One notion of fairness in these circumstances is that the average prediction in classes of a protected value should be the same - men and women should, on average, be treated as having the same probability of committing a crime.
(There are many notions of fairness and this is a topic of very current debate in machine learning.)
Komiyama and Shimao consider using linear regression as a prediction tool in a situation where you have protected variables Z, useful covariates X and a response to predict y. They suggest the following 1. Regress each column of X on Z and take the residuals to get X~2. Now predict y using X~.
We will look at this in the context of the questions above. Here we think of x1 as a prodcted variable, and x2 as something we want to use to predict. x~2 is the residual after regressing x2 on x1.
Consider using the linear model
y = β01 + β1x~2 + ∈
and show that the average of the fitted values when x1 = 0 is the same as the average of the fitted values when x1 = 1.
Can you generalize this to using a matrix of protected values Z and a matrix of covariates X?
Question 2 -
Here we will repeat the analysis above but more generally, with the idea of getting specific about the interpretation of a sequential ANOVA test.
We know that the sums of squares for each covariate is unchanged when the covariates are orthogonal. When they aren't, we need to ask "What is the null hypothesis for this test?"
We describe the test as being "the additional effect of xj after controlling for Xj-1", but what does that mean, mathematically?
To do this, we'll break up the covariate matrix X = [Xj-1, X-j] where X-j = [xj, . . . , xp] and similarly, the coefficient vector will be broken into β = (βj-1T, βj-T)T so that we can write the linear regression model as y = Xj-1βj-1 + X-jβ-j + e.
We will not assume that Xj is orthogonal to X-j. Note that this can be done for any choice of j ∈ {1, . . . , p}.
a. Consider regressing y on only Xj-1. Give an expression for the estimated β^j-1.
b. Show the fitted values (written in terms of true coefficients and errors) from the full regression can be re-written using the fitted values from Part 1a. and the matrix of residuals R-j obtained from regressing each column of X-j on Xj-1.
c. Show that the sums of squares yT = (H - Hj-1)y for X-j|Xj-1 is the same if you replace X-j with R-j (Why must Hy be the same in both cases?)
d. Within the sequential test for xj|Xj-1 show that the sum of squares yT(Hj - Hj-1)y the same whether you use the original X or X* = [Xj-1, (I - Hj-1)xj, (I - H)X-j+1].
e. Show that, when using X* (with corresponding coefficients β*, the sum of squares yT = (Hj - Hj-1)y is only affected by the true value of βj.
f. Hence give a detailed interpretation of the meaning of rejecting the jth sequential test.
Question 3 -
Here we will illustrate the results from Question 1 with a real world data set. We will use the study of mortality in 55 US cities as it is influenced by pollutants NOX (nitrous oxide) and SO2 (sulfur dioxide), while controlling weather (PRECIP) and sociological variables (EDUC and NONWHITE) that appeared on the midterm. In this case we will be interested in the sequential test for EDUC with the covariates taken in the order in the data set.
You can find the data in airpollution.csv on CMS.
a. Create a new data set (referred to X* below) in which NONWHITE, NOX and SO2 are replaced with the residuals after regressing each of them on PRECIP and EDUC.
b. Show that when producing a model to predict MORT with either the original covariates or the new covariates, you get the same predicted values (use the maximum absolute difference in predictions to show this).
c. Add SO2 to MORT (this increases the coefficient of SO2 in the model by 1) and obtain a sequential ANOVA table (using the function anova) using the new response. Show that this changes the sum of squares for EDUC when using the original data.
d. Do the same thing using the new data set X* and observe that the sum of squares for EDUC does not change.
e. What happens if you add EDUC to MORT (ie, make its coefficient larger) instead? Are there differences between the two data sets? Why?
Attachment:- Assignment Files.rar