Assignment:
Prepare a short report with relevant output, your comments, and answers to the questions (this does not need to be exhaustive or polished, but should contain enough to show that you completed all tasks and analyses).
1. Consider the prostate dataset in the faraway package. Let lpsa be the outcome and treat all other variables as predictors.
(a) Fit the full model. Comment on the fit and discuss which variables seem to be signifiant or not.
(b) Use the regsubset function in R to compute the RSS for each of the best models of a particular size. Plot these best RSS values against the number of betas in the model. Comment on what you see.
(c) Use the RSS values to compute the AIC, BIC, adjusted R2, and Mallow's Cp. Plot these values, comment on the models that they choose, and discuss how similar/different they are.
(d) Use the step function in R to apply stepwise regression with the AIC. Compare with the previous model you selected using AIC.
(e) Refit the model with the lowest BIC. Comment on the fit and discuss which variables seem to be significant. Contrast this with what you saw in (a).
2. Consider the prostate dataset in the faraway package. Let lpsa be the outcome and treat all other variables as predictors.
(a) Fit the full model (as with the previous assignment). You do not need to comment on the fit or parameters.
(b) Compute the Ridge estimate which has the lowest GCV. Use 100 equally spaced lambdas between 0 and 10. Comment on any differences between the resulting estimate and the one form (a).
(c) Compute the LASSO estimate using the following steps.
i. Fit the model using glmnet and plot the coefficients as a function of the penalty, comment on any patterns you see.
ii. Fit the model using cv.glmnet and plot the standard errors as a function of the penalty. Create a second "zoomed in" plot using ylim, so that one can clearly see where the minimum is reached.
iii. Fit the model using glmnet but take λ to be lambda.min from cv.glmnet. Do the same thing, but now using lambda.1se. Compare the estimates with each other as well as with (a) and (b).
iv. Take the predictors selected by lasso with "lambda.1se" as the tuning parameter and refit the model using lm. Comment on any differences with (a).
3. Consider the data from playoffs.csv. In this data set, ten years worth of baseball seasons are summarized (1995-2004). We are interested in in the relationship between the number of playoff appearances and the population size of the teams market (in millions).
(a) Compute a logistic regression with playoff appearance as the outcome and population as the predictor. Interpret the beta for population in terms of the odds.
(b) Construct a new categorical predictor with levels "small", "medium", and "large", where "small" denotes a market with a population under 3 million and a "large" market is over 5 million.
(c) Refit your logistic regression, but with the categorical predictor above (don't include the old population variable). Set the baseline level as "large". Summarize the fit of the model.
(d) Using the output above, compute the following:
i. the odds that a team from a large market makes the playoffs,
ii. the odds that a team from a small market makes the playoff,
iii. the odds ratio between a medium market team and small market team making the playoffs. Interpret the ratio in terms of the odds.
Attachment:- Playoff Appearances.rar