Part 1 -
Q1. Star Field is home to the Lions professional baseball team. The team's new marketing director, Janna Kay, has been trying to develop a better understanding of the key drivers of attendance at the stadium to increase ticket revenues, optimize concession inventories and staffing, and schedule the timing of promotional giveaways. Using some historical data on a set of information, we build up the following model:
Attendance=b0+bl nightgame+b2 temp_f+b3 sunday+b4 saturday+b5 friday+b6 promo+b7 openingday+b8 school+u,
where Attendance is the total attendance of the game, temp_f is the high temperature of the game day, nightgame is a dummy variable indicating whether game is played during the night, Friday, Saturday and Sunday are the dummies for the day of the week, promo, opening_day and school are dummies indicating whether there are some promotional activities, whether it is the opening day, and whether local public school
1. Estimate the model.
2. According to the estimation results, which day of the week usually has the highest attendance? Why?
3. How much is the attendance expected to change if the high temperature on the game day becomes 10 degrees higher?
4. What is the estimated difference of the average attendance between night games and daytime games?
5. Predict the attendance on a regular (not opening day) Monday afternoon with 100 degree high temperatures when the school system is not in session, also assuming that no promotion is available.
Suppose that we believe the attendance is not always increasing as the high temperature increases and we add the square term of high temperature, tempf-2, in the model
attendance=b0+bl nightgame+b2 temp_f+b3 sunday+b4 saturday+b5 friday+b6 promo+b7 openingday+b8 school+b9 temp_f^2+u
6. Does the above result support our conjecture or not?
7. Using the estimated coefficient of temp_f and temp_f -2, briefly explain how does the average attendance change as the high temperature increases.
8. According to the results, do you want to keep temp_f^2 in the model or not? Why?
Q2. Suppose you want to estimate the seasonal effect on the revenue. There is a constant term included in the regression as usual. How many dummies are needed to perform such analysis?
Q3. Determinants of price per ounce of cola. Cathy Schafer, a student of mine, estimated the following regression from cross-section data of 77 observations.
Pi = B0 + B1D1i + B2D2i + B3D3i + ui
where Pi = price per ounce of cola
D1i = 001 if discount store, = 010 if chain store, = 100 if convenience store
D2i = 10 if branded good, = 01 if unbranded good
D3i = 0001 if 67.6 ounce (2 liter) bottle, = 0010 if 28-33 ounce bottles, = 0100 if 16 ounce bottle, and 1000 = if 12 ounce cans
The results were as follows:
P^i = 0.143 - 0.00000D1i + 0.0090D2i + 0.00001D3i
t = (-0.3837) (8.3927) (5.8125) R2 = 0.6033
where the figures in parentheses are the estimated t values.
(a) Comment on the way the dummies have been introduced in the model.
(b) How would you interpret the results, assuming the dummy setup is acceptable?
(c) The coefficient of D3 is positive and statistically significant, How would you rationalize this result?
Q4. Load package gcookbook and type data (diamonds) to load the data set. The definition of table and depth can be found in the following picture
1. A diamond's quality can be measured by cut, ordered by Ideal, Premium, Very Good, Good, and Fair. Create dummy D1 to represent Ideal and Premium, and D2 to represent Very Good and Good.
2. Regress price on carat, depth, table, D1 and D2, all interactions terms between dummies and quantitative variables (carat, depth and table). Interpret your result.
3. Create a random sample of size 1000 from the diamonds data. Draw the scatterplot of carat vs log(price), color coded by cut.
Q5. Load the me.csv file. The population regression function is
Y = 2 + X + u.
W1 = X + ∈1 and W2 = X + ∈2 are two measurements for X with errors.
1. Regress Y on X, plot the scatter plot of (Y, X) and the regression line. Is β1 close to 1?
2. Regress Y on W1, plot the scatter plot of (Y, W1) and the regression line.
3. Regress Y on W2, plot the scatter plot of (Y, W2) and the regression line.
4. The bias of β^i is given by β^1 - 1. Which case yields the largest bias? Why?
5. Now use 2SLS to solve the measurement error problem. Regress Y on W1 and use W2 as the IV for W1. Compare your result with (1)(2)(3).
Part 2 -
Q1. Suppose you wish to estimate the effect of class attendance on student performance on econometrics. We denote by Yi the student i's final exam score, atti the attendance rate and GPAi the GPA of the previous semester.
Yi = β0 + β1atti + β2GPAi + ui
1. Let dist be the distance from the students' living places to the classroom. Do you think dist is uncorrelated with u?
2. Assuming that dist and u are uncorrelated, what other assumption must dist satisfy to be a valid IV for att? Also argue that dist satisfies this assumption.
Q2. The beer data gives information for 50 US states and Washington, DC for the year 1985-2000 on the following variables:
Variable
|
Definition
|
beer_sales
|
per capita beer sales in the state
|
income
|
in dollars
|
beer_tax
|
state tax rate on beer
|
fips_state
|
state id
|
1. Fit an pooled OLS regression of beer sales on income and tax
2. Fit a fixed effect model
3. Repeat 1. and 2., using logs of the three variables
4. What is the expected effect of beer tax on beer sales? Do the results support your expectation?
5. Would you expect income to have positive or negative effect on beer consumption? If it is negative, what does that mean?
Q3. Suppose we want to estimate the Cobb-Douglas production function for different production plants using panel data:
Yit = Kitβ_1Litβ_2exp(ηi)exp(uit),
where (Kit, Lit) are capital and labor inputs. Different plants will be indexed by i(= 1, 2, ... N) and time will be indexed by t(= 1,2, ... T). uit is the disturbance term with zero mean and independent of K and L. Because different plants may use different technology or expose to different technological shocks, we introduce the plant-specific fixed effect ηi. It is easy to see that the higher the ηi, the higher the output level given the same input level (K, L).
1. Describe how to remove the fixed effect ηi.
2. Load data_production.csv. Estimate (β1, β2) via dummy variable regression.
3. Estimate (β1, β2) via fixed effect regression (plm). Do you obtain the same regression coefficients as in 2?
4. Suppose that there is no fixed effect (so pooled OLS would work) Yit = Kitβ_1Litβ_2 explicit) but we don't have a good measurement for capital. Instead, we have the book value of capital, W1it; and the market value of capital, W2it. Describe how to use 2SLS to estimate the production function.
5. Regress log(Yit) on log(W1it), log(Lit). Consider 4 cases: First, simple regression; Second, 2SLS with log(W2it) as IV; Third, 2SLS with fixed effect; Fourth, fixed effect only (no IV). The true coefficient is (0.7, 0.3). Which model gives you more precise answer? Also compare results from these 4 models.
Attachment:- Assignment Files.zip