Assignment
1. Consider the data set "sat and grades.csv", which contains SAT and GPA information from 105 computer science majors at a public university. A more detailed description is listed below:
Variable Name Description
high GPA High School GPA
math SAT Score on SAT math section
verb SAT Score on SAT verbal section
comp GPA GPA in computer science courses
univ GPA GPA in all courses
Consider the following model:
univ GPA = b0 + b1high_GPA + e
e ∼ N (0, s)
(a) What is the dependent variable in this model? What is the independent variable?
(b) Use R to run this simple linear regression model. Report the estimated values of b0, b1, s, and R2.
(c) At what value of high school GPA would a student be predicted to have the exact same university GPA? (Hint: do this calculation by hand using the estimates of b0 and b1 produced by R...think about the equation of the regression line and how you could use it to find this number).
(d) If a student had a 3.8 GPA in high school, what would their predicted university GPA be? What is the 95% confidence interval for this prediction? Intuitively, does the predicted value seem reasonable? Does the 95% confidence interval seem reasonable? Why or why not?
(e) If a student had a 0.0 GPA in high school, what would their predicted university GPA be? What is the 95% confidence interval for this prediction? Is it appropriate to plug in a value of 0.0 for a high school GPA (1) in a "real world" context or (2) in a statistical context? Why or why not?
(f) Use R to form a scatter plot of the data. Plot the regression line through the data. Attach the graph to your HW. For full credit, give the plot a title, label the axes clearly, and create a legend.
2. Consider the data set "RGDP 2010.csv", which contains U.S. real GDP data from the first quarter of 2010 through the third quarter of 2016.
(a) Use simple linear regression to fit a time trend to the raw data. What is the estimated value of b1? How would you interpret this coefficient?
(b) We know that RGDP tends to grow exponentially (even though it's hard to notice in this short sample). How could you transform the data so that a linear time trend would be more appropriate?
(c) Use simple linear regression to fit a time trend to the transformed data. Instead of transforming the data directly, set the option "lambda=0" in the tslm() function (this will do the transformation in the background and be useful later). What is the estimated value of b1? How would you interpret this coefficient?
(d) Inspect the residuals produced by the regression in part (c). Does the model from part (c) seem appropriate? Explain.
(e) Produce an eight period ahead forecast for U.S. real GDP using the model from part (c). Plot the observed data, the forecast, and the fitted regression line on one graph. Attach the graph to your homework. For full credit, give the graph an informative title, label all axes, and add a legend.
3. Suppose you had seasonal quarterly time series data for some variable of interest. Consider a regression model with time trend and seasonal dummies:
Y = b0 + b1t + b2Q2 + b3Q3 + b4Q4 + e
e ∼ N (0, s)
(a) When might this model be appropriate?
(b) Why didn't we include a dummy variable for Q1?
(c) What is the interpretation of b0 in this model? Explain thoroughly for full credit.
(d) What is the interpretation of b4 in this model? Explain thoroughly for full credit.
4. Suppose you had data with a trend, but you were unsure whether there were seasonal patterns. You ran two separate regressions: Regression 1, containing only a time trend, and Regression 2, containing both a time trend and seasonal dummies. After running the regressions you compute AICc and BIC for each regression. You find:
Model AICc BIC
Regression 1 -200 -230
Regression 2 -210 -229
(a) What is an information criterion? Intuitively, what two factors do information cri- teria try to balance when judging a model?
(b) Which regression is preferred by AICc? By BIC?
(c) What might you be able to infer about seasonality in the data from the results AICc and BIC above? In other words, is there strong evidence that this data set has seasonality? Why or why not?