Part -1:
Dirty Data Assignment
Hardcore working with real data in Stata. Collapsing, Merging, Regressing.
1. Our goal for this problem is to learn how to clean up a small but difficult to use set that has information on population by county so that we can merge it to our original dataset of 990 forms.
a. Download the following dataset:https://www.dropbox.com/s/bg71qxiwxh0c8pn/UrbanityCodes.dta?dl=0
The password is PSAA643.
b. As always, get to know this new dataset. (Show your code and outputs) (sum describe list)
c. What information does this dataset give you?
d. How is the variable "Population_2010" coded? Why might this method of coding not be useful? Why might a different type of coding be more useful?
e. We want to create a new variable for 2010 population that is a number rather than a string. To do this, we will use the command "destring". Type "help destring" in your stata window. Read what it says about destring.
f. Create a new variable called "pop_2010_number" using the destring command. Hint: You will want to have a ", generate" in your command or you will not create a new variable.
g. Check to see if your new variable looks the same as your old variable by opening up the data browser and looking at the two variables side by side. Another way to check would be to do a cross-tab, but Stata IC often has problems doing cross-tabs with a large number of values, so the eye-ball method should be fine. (It is sufficient to say that you looked at it after you have done this check.)
h. Summarize your new variable. Do the means, max, and min look reasonable (as populations by county) to you? If not, what looks unusual?
i. We will be merging this dataset to your original dataset with a variable for county fip code. This is a way of keeping track which county is which across datasets. In UrbanityCodes.dta there are two variables that we could possibly use for this purpose. FIPS and fips. What is the difference between these two codes? (Hint, use "describe".)
j. How is the fips variable coded in NCCS_CORE_2013_orig.dta ?
k. In order to merge the two datasets on county fip, we will need to have county fip variables in each set that are coded the exact same way, either as strings or as long. Create a new variable in the UrbanityCode.dta set called fipsformerge. gen fipsformerge = FIPS . Why did I choose to go with the string option instead of the long option?
l. Save your dataset under a new name, UrbanityCode_formerge.dta
2. The goal for this problem is to collapse your original dataset to get a number for total expenses from all nonprofits by county. You will merge your collapsed data to your population data above because you are interested to see if the total amount of nonprofit expenditures in a county is correlated with the county's population.
a. Load NCCS_CORE_2013_orig.dta
b. Type: "help collapse". Read the section on collapsing data.
c. Use the following code to collapse expenditures (adding them up) to the county cell level: collapse (sum) exps, by(fips) .
d. You are interested in the total expenditures for all nonprofits in a county. How would your code change if instead you were interested in the *average* expenditures for all nonprofits in in a county?
e. Name at least two other measures besides adding and averaging that you could get with the collapse command and explain why they might be of interest.
f. gen fipsformerge = fips . Why do you need to make this variable? Note that there is 1 missing variable when you do this. Noticing that is an important attention-to-detail thing that you should get used to doing. In this case, I can tell you that that missing variable comes from the 487 observations in the original dataset that do not have any county information given.
g. Since missing is unlikely to all be the same missing county, merging missings together would add measurement error. drop if fipsformerge==""
h. Save your dataset under the name NCCS_CORE_2013_collapse.dta
3. The goal for this problem is to merge your two datasets together so that you can say something about how charitable expenditures are correlated with the population in a county.
a. Type "help merge" into stata. Read about merging.
b. You will be merging your collapsed version of the original dataset to the population set using the variable "fipsformerge". This will be
a 1:1 merge because each county only shows up once in each dataset. Merge your two new datasets together with fipsformerge as the merging variable.
c. If you did this correctly, you will notice that 57 observations didn't match. Sometimes not matching means you messed up. Sometimes not matching means that some numbers are in one dataset but not in the other or vice versa. It is important to determine which is the case. Look through the dataset carefully. Sometimes you'll notice that an entire state is missing from one dataset or another (for example one set might have Puerto Rico and another might not). That does not seem to be the situation in this case. After you look through the dataset, write down any comments you have about unmatched data in your solutions.
d. Note that it could be that 990 data are missing in some counties because there are no non-profits in those counties. If that is the case, then you would want to replace their expenditures with "0". We don't know enough at this time to figure out if that is something we should be doing. Right now, let's treat them as missing-it isn't a large portion of your dataset so hopefully unlikely to bias your results too much. In your solutions write down, "I need to know more about the 990 dataset to know what to do with missing 990 information."
4. Now let's do some data analysis.
a. Regress with exps as the Y variable and pop_2010_number as the X variable.
b. In words, how does an increase of 1 in the county population affect the total county non-profit 990 expenditures?
c. Is this correlation significant?
d. Can you say for certain that an increase in population causes an increase in non-profit expenditures? What else might be going on?
5. Bringing it all together.
a. Do you feel smarter?
b. How are you going to remember these skills when you need them in the future? (Note that you do not need to actually remember the exact code, just that these are things that you can do in stata and that you can look them up.)
Part -2:
Exercises
1. Use gss2006_chapter8 . dta. Imagine that you heard somebody say that there was no reason to provide more educational opportunities for women because so many of them just stay at home anyway. You have a variable measuring education, educ, and a variable measuring hours worked in the last week, hrsl. Do a correlation and regression of hours worked in the last week on years of educations. Then do this separately for women and for men. Interpret the slope for the overall sample and then for women and for men separately. Is there an element of truth to what you heard?
2. Use gss2006_chapter8.dta. What is the relationship between the hours a person works and the hours his or her spouse works? Do this for women and for men separately. Compute the correlation, the regression of these. Next test if the correlation is statistically significant and interpret the results, and the scattergrams.
3. Use gss2006_chapter8.dta. Repeat figure 8.2 using your own subsample of 250 observations. Then repeat the figure using a jitter (3) option. Compare the two figures. Set your seed at 111.
4. Use gss2006_chapter8.dta. Compute the correlations between happy, and health by using correlate and then again by using pwcorr. Why are the results slightly different? Then estimate the correlations by using pworr, and get the significance level and the number of observations for each case. Finally, repeat the pwcorr command so that all the Ns are the same (that is, t casewise/listwise deletion).
5. Use gss2002_c.hapter8.dta. There are two variables called happy7 and satfan7. Run the codebook command on these variables. Notice how the higher score goes with being unhappy or being dissatisfied. You always want the higher score to mean more of a variable, so generate new variables (happynew and satfamnew) that reverse these codes so that a score of 1 on happynew means very swum, and a score of 7 means very happy. Similarly, a score of 1 on satfamnew means very dissatisfied and a score of 7 means very satisfied. Now do a region happiness on family satisfaction with the new variables. How correlated are these variables? Write the regression equation. Interpret the constant and the slope.
Attachment:- Chi-square.rar