Question 1. In this question, I'm after intuitive answers rather than mathematical ones. Indeed, some of the underlying mathematical arguments are far beyond the scope of this course.
(a) Without even thinking about the central limit theorem or any other asymptotic arguments, explain why a data set with n = 30 observations is nowhere near large enough to estimate a model with k = 40 parameters using ordinary least squares regression.
(b) To estimate the partial effect of x2 on y while x3 is kept constant, we need to estimate a regression model like y = β1+β2x2+β3x3+u. Some students find this counterintuitive:
if we're keeping x3 constant, why should it be in our model at all? Explain why this is the right thing to do.
(c) Consider again the model with two explanatory variables from part (b). There are two situations in which the partial effect of x2 on y is equal to the total effect of x2 on y; what are these situations?
(d) Why should none of R2 , R¯2 , information criteria, and F tests be used to compare models with yt on the left hand side to models with ln yt instead?
(e) Recall that the coefficient estimators b2 and b3 are random variables. Why does it make sense that these two random variables are usually correlated with each other?
Question 2 . We have often used the result that TSS = ExpSS + RSS, for example to justify the use of R2
and of F tests. The purpose of this question is to prove that result.
(a) Prove that (yi - y¯)
2 = (yi - yˆi)
2 + (ˆyi - y¯)
2 + 2 (yi - yˆi) (ˆyi - y¯). (Hint: for this part,
the definitions of yˆi and y¯ are completely irrelevant; you may find it easier to just call them a and b or something similar.)
(b) Use the result of part (a) to establish that TSS = ExpSS + RSS + 2 (n - 1) Cov [e, yˆ].
(c) Prove that Cov [e, yˆ] = 0. You may take it as given that the residual is uncorrelated with each of the regressors.
(d) Complete the proof that TSS = ExpSS + RSS.
Question 3 . We have collected data on the annual number of cars of twenty different brands sold in Australia (sales, in number of cars), as well as each brand's average retail price (price, in dollars), their annual marketing expenditure (mark, also in dollars), and whether or not they assemble some of their cars domestically (domestic, dummy variable). We wish to investigate how all of these
factors influence sales, and we settle on the following regression model. regress lnsales lnprice lnmark domestic
Source | SS df MS Number of obs = 20 ---------+---------------------------- F(XXXX, XXXX) = 17.16
Model | 15.2271807 3 5.0757269 Prob > F = 0.0000
Residual | 4.73289348 16 .295805843 R-squared = XXXX
---------+---------------------------- Adj R-squared = XXXX
Total | 19.9600742 19 1.05053022 Root MSE = .54388
---------------------------------------------------------------------
lnsales | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+-----------------------------------------------------------
lnprice | -1.389525 .2392136 -5.81 0.000 -1.896635 -.8824151
lnmark | .1775161 .125631 1.41 0.177 -.0888098 .443842
domestic | .6156965 .389287 1.58 0.133 -.209555 1.440948
_cons | 21.35649 3.31581 6.44 0.000 14.32728 28.38569
---------------------------------------------------------------------
(a) I have removed four numbers from this table, indicated by "XXXX". Compute them.
(b) Describe what the coefficient estimate -1.389525 means, in economic terms.
(c) Suppose we wish to test the claim that the Australian car market is completely pricedriven, so that marketing and whether production is done domestically are irrelevant.
Regressing lnsales only on lnprice gave an RSS of 10.69, whereas regressing lnsales only on lnmark and domestic gave an RSS of 14.72. Which of these two numbers is useful for testing our claim, and why?
(d) Test the claim described in part (c).
Question 4 . The data set education.dta contains data on the years of education of a random sample of 718 Americans, as well as the same information for both of their parents. It is likely that parents' achievements have some predictive power for their children's outcomes, as a result of both a hereditary component of intelligence and the possibility that higher educated parents stimulate their
children more to do well at school. Thus, we consider the model educi = β1+β2meduci+β3feduci+ui.
(a) We will ignore any heteroskedasticity and autocorrelation problems in the remainder of this question. However, discuss whether it is likely that these problems are present in our model.
(b) Estimate this model, and provide interpretations for the three estimated coefficients.
(c) Give 95% confidence intervals for both the conditional mean and the actual value of education for people whose mother has 16 years of education, while the father has 12.
(d) Explain what the restriction β2 = β3 means, in economic terms.
(e) Test the restriction in part (d), and show that it cannot be rejected. (Note: I want to see that you have estimated both the restricted and the unrestricted model. Feel free to use Stata's test command to check your result, but using only that would be too easy.)
(f) Use the restricted model from part (e) to repeat the prediction exercise in part (c). Intuitively, why are the resulting confidence intervals narrower this time?
(g) Go back to the original model, where β2 and β3 are allowed to be different. Now, what would the restriction β2 + β3 = 1 mean, in economic terms?
(h) Test the restriction in part (g). (The same note as in part (e) applies.)