STATISTICS Homework
1. Load the dataset "PatientSatisfaction.txt" into R. The goal of this analysis is to determine the best subset of predictor variables for determining patient satistfaction.
(a) Indicate which subset of predictor variables are optimal according to the following criteria: AICp, Mallow's Cp, BICp, and PRESSp (i.e. use best subsets variable selection).
(b) Do the four criteria listed above identify the same optimal model? Will this always be the case?
(c) Would forward stepwise regression have any advantages as a screening procedure over best subsets selection?
2. Load the dataset "Bears.csv". Data from n = 19 female wild bears of varying ages are used to estimate the relationship between Y = weight and X = neck circumference.
(a) One of the observations takes on value (x, y) = (10.5, 140). Identify this observation in the dataset. Visually, does this observation appear to be an outlier with respect to any of the following: X, Y, or general the linear relationship between Y and X (i.e. Y |X)? Justify with the appropriate plots.
(b) Compute the leverage for the observation (x, y) = 10.5, 140). What is the cutoff for high leverage in this scenario? Using the rule-of-thumb for leverages presented in class, state whether this point has high leverage?
(c) What is the consequence of including a point with high leverage?
(d) Using the lm() output, calculate each of the following for the (x, y) = (10.5, 140). For some, you will also need the leverage that was calculated above.
i. Studentized residual
ii. Studentized deleted residual
iii. Standardized DFFITS value.
(e) Using the quantities calculated in the previous part should (x, y) = (10.5, 140) be flagged as an outlier?
(f) For the observation (x, y) = (10.5, 140), calculate the following and justify whether this point has strong influence on the model fit?
i. DFBETA
ii. Cook's Distance
3. Data were collected from n = 51 "states" (including the District of Columbia) on the salaries of public school teachers.
(a) Regress Y = average teacher annual salary on X1 = spending per pupil in dollars, X2 a dummy indicator (1/0) for region 2, and X3 = a dummy indicator for region 3. Plot the standardized residuals versus fitted values.
(b) Plot a histogram of the studentized deleted residuals. Are there any outliers in this data? If so, list the index number.
(c) Create a plot of leverages from this model. Are there any outliers with respect to the covariates?
(d) Create plots of Cook's Distances and DFFITS to determine whether any observations have strong influence on the model fit.
Attachment:- HW_Data.zip