The following two problems will require a lot of calculations in STATA (or however you opt to execute the calculations). It will generate many pages of output. Here is how your should organize it. The first pages should contain your answers to all the questions, along with showing any key algebraic equations or explanations you need to use along the way. After that, include a printout of the output from the regressions you executed in support of your answers. Highlight any numbers in this output that you used in the first section. (To save paper, you may print this section double-side and/or with 2-up format.) Last, include a copy of the DO file that contains the commands you asked STATA to execute. Be sure you organize these in a way that will be clear to the reader.
1. (52 points total, 4 points each part) With this assignment you will find a STATA data file called boston.dta. For reference, the variables in this file are:
nox = nitric oxides concentration (parts per 10 million)
rm = average number of rooms per dwelling
age = proportion of owner-occupied units built prior to 1940
dis = weighted distances to five Boston employment centers
ptratio = pupil-teacher ratio by town
lstat = percent lower status of population
medv = median value of owner-occupied homes (in thousands of dollars)
Open this dataset within STATA (only STATA can open it). Before you begin answering the following, it's not a bad idea to ask STATA to summarize the data using the command summarize. You should also start a log file to store your results.
a.) Run the following regression:
b.) Hypothesize the sign of the bias, if any, resulting from excluding age from the regression. Explain your reasoning. (There is no wrong answer as long as you make a sensible story.)
c.) Use the data to verify (or not) your claim from b). Break down the bias into its pieces.
d.) Now, run the regression:
e.) At a level of α=.05, for which, if any, values of βi, would you reject the null hypothesis that βi=0?
f.) What is the predicted medv with nox=0.5, rm=4, age=60, dis=3, ptratio=20, and lstat=10?
g.) Redo (f) but with nox=0.6. What is the difference in predicted medv between these two communities? Compare this with the coefficient of nox.
h.) Ceteris Peribus, compared to (f), what is the impact of reducing the pupil-teacher ratio to 18?
i.) What percentage of the variation in medv is explained by the six X-variables?
Now change the measurement of nox. Use the 'gen' command:
gen noxppm=nox/10
and then use this in place of nox in the regression command
regress medv noxppm rm age dis ptratio lstat
j.) Compare the coefficient, standard error, and t-ratio for noxppm to that of nox. Interpret the difference between this model and the previous.
k.) Also compare the and remaining coefficients. Interpret the difference between this model and the original regression model.
Now change the variable age to newage
gen newage=100-age
and then use this in place of age in the original regression command. That is, execute:
regress medv nox rm newage dis ptratio lstat
l.) Compare the coefficient, standard error, and t-ratio for newage to that of age. Interpret the difference between this model and the previous.
m.) Also compare the and remaining coefficients. Interpret the difference between this model and the original regression model.
2. (48 points total. 5 points each part, +3 for free.) For the following problem, use the STATA dataset called crime.dta. This data set was compiled by Christopher Cornwell and William Trumbull to study factors that influence crime rates. The data set contains observations for 90 counties in North Carolina for 1981. The definitions of the variables represented in the data set are:
crmrte=crime rate
prbarr=probability of arrest
prbconv=probability of conviction
prbpris=probability of a prison sentence
avgsen=average sentence in days
polpc=number of police per capita
density=population density
pctmin=percent minority
pctymle=percent young males
wmfg=average weekly wage in manufacturing
wcon=average weekly wage in construction
wtuc=average weekly wage in transportation,utilities,and communications
wtrd=average weekly wage in wholesale and retail trade
wfir=average weekly wage in finance,insurance,and real estate
wser=average weekly wage in services
wfed=average weekly wage in federal government
wsta=average weekly wage in state government
wloc=average weekly wage in local government
According to the economic model of crime rates, lower crime rates are associated with better labor markets (higher wages), more police presence and tougher sentences, and lower population density. We will use this data set to examine these hypotheses. Use a significance level of α=.05 for all hypothesis tests.
a.) Run a regression of crmrte on all of the other variables. Call this Model 1.
b.) Do any t-statistics indicate a variable is not statistically significant? Which?
c.) Interpret the F-statistic STATA has calculated for Model 1.
d.) Test the hypothesis that the coefficients on wfed and wsta are equal to each other. Use the t-test method described in the lectures. What transformation do you need to do here? Be specific.
e.) Test the hypothesis that the coefficients on wfed, wsta and wloc are all equal to each other. Do this by writing down the formula for the relevant F-statistic. Calculate it (by running the appropriate restricted regression) and test the hypothesis. Report these results. This restricted version of the regression will be called Model 2.
f.) Return to Model 1. Now test the hypothesis that pctmin and pctymle both equal zero. Do this by writing down the formula for the relevant F-statistic. Calculate it (by running the appropriate restricted regression) and test the hypothesis. Report these results. This restricted version of the regression will be called Model 3.
The model could potentially be simplified by replacing all the wage variables with an
average. Specifically, let us define
Generate this variable.
g.) Return to Model 1 and run using avgwage in place of the individual wage variables. Check the validity of this restriction. As before, do this by writing down the formula for the relevant F-statistic. Calculate it (by running the appropriate restricted regression) and test the hypothesis. Report these results. This restricted version of the regression will be called Model 4.
h.) Let's focus our attention on the coefficient for the variable polpc. How does the value of this coefficient change - as well as its statistical significance -as we move from model to model? To answer this, write down a table containing the results for this coefficient for each of the four models. In this table, include the coefficient values, the values of the t-statistic (for a hypothesis that the coefficient=0,) and whether you'd reject the hypothesis.
i.) What do your results in the last question imply about the relationship between the number of police and the crime rate. Are you confident in these results based on the work you have done? Why or why not?