Problem Set - The purpose of this problem set is to reinforce concepts related to models with omitted variables bias, measurement error and multicollinearity.
I. Omitted Variables Bias
Fixed Effects Regressions
1. "The scrap rate for a firm is the number of defective items (ie, products that must discarded) out of every 100 produced. Thus, for a given number of items produced, a decrease in the scrap rate reflects higher worker productivity." (Wooldridge 2002). The scrap rate of a firm (scrap) can therefore be used to measure the impact of a job training grant (grant), controlling for the annual sales of the firm (in dollars) and the number of firm employees. However, the scrap rate also depends upon other factors affecting firm productivity (such as workers' education, ability, experience), which we have termed "firm productivity" below. Thus, we suspect that the true model takes the form:
(1) log(scrapit) = β0 + β1grantit + β2log(sales)it + β3log(employit) + β4d88 + β5firm productivityi + uit
where i is the firm, t is the year (1987, 1988 or 1989). Unfortunately, we do not have a measure of firm productivity in our dataset. Our key variable of interest is "grant", equal to 1 if the firm received the job training grant in a particular year (zero otherwise).
a. Suppose we run the above regression without firm productivity. Under what conditions would the omission of firm productivityi from this regression lead to bias in estimating the effect of grantit on log(scrapit)? Be as specific and precise as possible.
b. Based upon the formula above, would you expect the OLS estimate of the coefficient on grant to be under or overestimated? Be sure to provide support for your answer.
Now we will use actual data from the "JTRAIN.DTA" dataset to estimate this relationship. Since this is panel data, we first need to tell STATA which variable is the cross-sectional variable (i) and which variable is the time variable (t). This can be done using the following command: xtset fcode year.
c. Now we will estimate the above regression only for 1987 and 1988 (excluding the firm productivity variable). [Note: You can tell STATA to ignore the observations for 1989 by using the following command: reg Y X if year!=1989]. Estimate the regression and show the output. What do you conclude about the impact of the training program on the firm's scrap rate? (Note: Be sure to refer to the magnitude and statistical significance of the effect).
d. Since we have panel data, we can remove the bias associated with "firm productivity" by running a fixed effects regression. This can be done using the following command (using only the data from 1987 and 1988): xtreg Y X if year!=1989, fe
[Note: The Y and X variables are the same as those in part c)]. What do you conclude about the impact of the grant on the scrap rate?
e. We can also remove the bias by estimating the same regression using first differences. This requires "first differencing" the data, ie, subtracting each earlier time period from a later time period. This can be done by using the following command: gen dlscrap=D1.lscrap. Generate a first differenced variable for every variable in the regression.
f. Provide a summary statistic of each one of your first differenced variables from part f). What do you notice about the variable dd88? Does this make sense? Why or why not?
g. Now estimate the regression using all of your first-differenced variables (including dd88), again only for 1987 and 1988. What do you conclude about the impact of the grant on the scrap rate?
h. What variable falls out of your first-differenced regression? Where can we find the estimated coefficient for this variable?
i. A final way that we can estimate the fixed effects regression is by creating a binary variable (a "fixed effects") for every firm in the dataset. This can be done by estimating the following regression: xi: Y X i.id, where "id" is the code for the cross-sectional unit in the dataset. Estimate the same regression (again only for 1987 and 1988) using the above command. What is the impact of the grant program on the scrap rate?
j. What assumption are we making about the nature of the omitted variables bias in this regression by using fixed effects? Do you think that this is a reasonable assumption in this context?
Randomization and Instrumental Variables
2. For the next few questions, we will use the dataset on a paper called "Happiness on Tap", looking at the impact of a randomly assigned treatment (a voucher program) on the likelihood of getting connected to a water source in Morocco. The dataset (moroccowater.dta) is on the course website.
(2) connectioni = β0 + β1 treatmenti + ui
a. Suppose that we run the above regression, where "treatment" is being assigned to a treatment (households offered a voucher/zero interest loan for connection) and "connection" is a binary variable equal to one if the household got connected to a water source (which was still a choice). What must be true for the authors to claim that the coefficient on "treatment" is unbiased?
b. While the above equation is interesting, we are in fact more interested in the following regression equation:
(3) chlorinei = α0 + α1connectioni + vi
which shows how being connected to a water tap might affect a household's water quality (as measured by the presence of chlorine). Since "connection" is not randomly assigned, and we are not controlling for other right-hand side variables, we might be concerned that it suffers from bias. Name one variable that might be omitted from the above regression and discuss how it might bias the coefficient on "connection".
c. Estimate the regression in equation (3). What is the impact of "connection" on the presence of chlorine? Be sure to talk about the magnitude and statistical significance.
d. Since connection is biased, we could use an instrumental variables approach to remove the bias. This requires that we find an instrument, Z, for connection. What are the two assumptions that we must make about this instrument?
e. A potential instrument that we can use for connection is "treatment". We first want to test whether this is a "strong" instrument (in other words, whether treatment explains the variation in connection or is strongly correlated with connection). This is also known as our "first stage" regression. Estimate the first stage regression. Do you think that "treatment" is a strong instrument for "connection"? Why or why not? [Note: You can look both at the statistical significance of the treatment variable and the F-statistic for the overall regression to support your conclusion].
f. To estimate the instrumental variables estimator, we want to plug in the predicted values of "connection" into our second stage equation (equation 3). We can do this in STATA by running the following command: ivregress 2sls Y (connection=treatment). Estimate the 2sls estimator for connection. What do you conclude about the impact of connection on the presence of chlorine? How does this compare to your answer in part d)?
II. Multicollinearity
3. For the next few questions, we will use the colombia.dta of Colombian workers. The model we are interested in estimating is the following:
ln(Wi)= β0 + β1Si + β2Ei + β3Ai + ui
where Wi is the wage, Si is years of schooling, Ei is years of experience defined by Ei = Ai - Si - 6, and Ai is age in years. We will assume that u is homoskedastic and uncorrelated with the RHS variables.
a. What are possible reasons for including A in the equation after already controlling for S and E?
b. Estimate the above model. Remember that experience is measured by the following expression: exper=age-school-6. What in the STATA printout tells you that the regression suffered from perfect multicollinearity? If you estimate the regression more than once, do you get the same result each time?
c. In light of the perfect multicollinearity, we may want to exclude A from our regression. Rewrite the above equation to show that ln(W) is a simple linear function of S, E and an error term (Hint: You should use first rewrite the expression for experience and then substitute this in for A into equation (1).
d. How should we interpret our new coefficients on S and E from the equation in part (c)? (Hint: Look at the lecture notes for this).
e. Instead of substituting in an expression for A, we can simply run a regression of ln(W) on E and S. According to the formula we used for omitted variables bias, what is the expected value of the coefficient on S in our regression that omits A? (Define carefully and completely any notation you introduce.)
f. If you ran a regression of A on S and E, what coefficient would you get on S? (Use what you know about the relationship between these variables to give a specific numerical value.) Explain.
g. Plug in this information to the formula for part (e). What does the resulting equation tell us about the expected value of the coefficient on S? How does this compare to the answer you got in part d?
Attachment:- Assignment Files.rar