Be sure to read each question carefully and answer all parts. For all Stata questions, be sure to provide the log output (which should be edited and commented so that it is easier to grade).
Q1. Find an article THAT YOU ARE INTERESTED IN, any article (or book chapter etc.), from other classes, from the news, from research journals, policy briefs etc. that has statistics (in the broad sense-it could have anything we have learned in class, or it could have something "statistical" outside of the scope of 631) in it.
a. Print it out/ copy it and attach it.
b. Either: 1. Talk about something in this article that this class has made understandable for you, explaining to what you are referring, what part of the class has clarified it for you, and what you think it means.
2. Talk about something in this article that you do not understand, explaining to what you are referring and what you do not understand about it. Ask questions that you think might help clarify what you do not understand for you.
Q3. Use a graphical analysis (which often looks more impressive than just doing statistical analysis) to support your answer to the following question: You are concerned with staff wasting time over lunch. Government employees in your branch are allowed a 60 minute lunch break. For a week you monitor employees (without their knowledge) and how long they spend at lunch. You consider a lunch break of longer than 65 minutes unacceptable and a waste of tax-payer resources. Do your employees have a problem with long lunch breaks? (Note: you may want to enter your data in twice and compare to make sure you have no data entry errors. Also be sure to explain how your graphical analysis supports your answer.) What statistics could you run to support your graphical analysis? Run them.
Minutes Spent at Lunch
64 93 66 68
60 65 63 85
78 86 73 77
69 63 64 87
93 61 80 65
82 75 60 64
62 84 75 63
63 61 70 67
76 73 72 91
80 70 89 82
***Copy and paste STATA results for each question for question 4.
Q4. Download the GSS dataset from e-campus if you do not already have this dataset. This is a shortened version of the full dataset which was downloaded from the NORC website. Documentation can be found here: https://gss.norc.org/Get-Documentation. (Use STATA)
a. Exploring
i. Explore the dataset with Stata using the tools you have learned. As with all questions, show your output. (Note that if your output is too lengthy, before printing your solutions you may clip out the middle part with a note that you have done so.)
ii. For what years is this dataset available in the shortened version I uploaded?
b. Is this dataset longitudinal, panel, cross-section, repeated cross section, case study, some combination (if so, of what), or some other form of dataset?
c. Cut the dataset so you only have data for the year 2012 left.
d. Look carefully through the variables and variable descriptions. You have been asked to compare various statistics by gender and age.
i. Pick an outcome that you are interested in from the list of variables available. Make sure that it is not missing from the dataset.
ii. Cut the dataset down so that the only variables remaining are for year, sex, age, and your variable(s) of interest.
iii. Save this smaller set as a new dataset.
iv. Explore this new dataset.
e. How is your outcome of choice coded?
i. Is it coded in a way that will make sense for analysis?
ii. If it is coded in a way that will make sense for analysis, say N/A for this part. If it is not:
1. Can it be coded it in a way that it will make sense for analysis (hint: you may want to turn a categorical variable into a binary variable or a continuous variable)?
2. Do so if it can (otherwise you may want to choose another outcome, and move back to part d).
3. Explain any assumptions you made when you changed this variable (or made a new variable from the old one).
f. Test to see if males and females act differently with respect to your variable. Be sure to include your hypothesis tests and whether or not your results are significant. In your opinion, is the magnitude of the difference big?
g. Create a variable for older. Explain how you define "older" vs. "younger," and any other assumptions that you make. [Hint: be EXTRA careful with missing values]
h. Are your results different for gender (from f) if you look only at older people? Be sure to include your hypothesis tests and whether or not your results are significant.
i. Reopen the original dataset.
i. If you have not already, create a .do file that walks through steps d through h (so that it does not cut the original dataset by year).
ii. Cut the dataset so the year is 2000. If your outcome variable in your .do file does not exist for the year you have picked, choose another outcome that does and modify your .do file accordingly.
iii. Do your .do file on the dataset for the year 2000. [As always, show your output.]
iv. Are your results on the hypothesis tests from 4g and 4h different for the year 2000 compared to 2012? If so, why might they be different?
***Show STATA output for question 5
Q5. Using the GSS
a. Pick two variables of your choice,
b. Make sure they are in an appropriate format for a scattergram
c. Create a scattergram using STATA.
d. Repeat the scattergram using the jitter option.
e. Repeat again with the sunflower option.
f. How are these plots different?
Q6. You have been given a large budget, have successfully bribed your local IRB and have access to a local prison population- just kidding! You actually have access to hundreds of psychology undergraduate students. You have been asked to evaluate whether or not listening to classical music while studying for Psych 101 benefits students' midterm test scores. [Note that you must answer the question in the context of the problem-it is not sufficient to just copy from your class notes.]
a. What is the best way to answer this question?
b. Using the procedure you have learned in class, formally guide your (not-bribed) IRB committee through the steps you would take to answer this question, including the pros and cons of each step if there are any.
c. What kind of statistical analysis would you use at the end?
Q7. You want to know the relationship between number of US troops per citizen in an occupied foreign city and the civilian death rate in those cities. Your statistical team comes back with the following information: "Using data from all US occupied foreign cities in the past 10 years, we ran the following regression: Y = B X + alpha, where Y is the civilian death rate (measured from 0 to 100) and X is the ratio of US troops to citizens times 100 (also measured from 0 to 100). We found that B = 19 and the standard error on B is 5.1. Alpha is .3. The R2 on this regression is .35."
a. What is the t-statistic for X? Is B significant? (Use STATA if necessary)
b. What does B mean in this case?
c. Your newly hired analyst points out, "Your R2 is only .35. Therefore we should ignore this regression because the fit is really low." Is (s)he right? What should you explain to him/her?
d. What are possible omitted variables?
e. After you have explained part c to your analyst, (s)he recommends that, based on this regression, you remove all US troops from all occupied cities. Why does (s)he recommend this? Should you take his/her advice? Why or why not?
f. How might you better answer this question? (Note: the actual numbers on the Iraq war alone from Brookings Institute show a small negative sign on B.)
Q8. You are working for the department of public health. Your supervisor has recently had a bad experience involving NoDoze and is sure that caffeine is evil. (S)he tells you to find all the literature you can on the evils of caffeine so that your office can make a public health announcement against the stuff.
a. Should you ignore the literature on the potential benefits of caffeine? Why or why not? (Give both moral and immoral answers.)
b. You find a medical science article on heart attacks and caffeine intake. It runs the regression Y = BX + alpha on a group of 70,000 men aged 45 to 65 over a period of 5 years, where X is the number of cups of coffee a person drinks each day and Y is the number of heart attacks the man has had during those 5 years.
i. B = .0000002, and the reported t-statistic is 47. What does this mean in words? Is this relationship significant? Is this an important relationship?
ii. What are possible omitted variables?
c. You find another medical science article on the effect of caffeine on fetal growth. It follows 10 mothers throughout their pregnancies and finds the following: Y = BX + alpha where X = number of cups of coffee the mother drinks each day over two cups and Y = fetal birth weight in pounds.
i. B = -2, t = .9. What does this mean in words? Is the relationship significant? Is it important?
ii. What are possible omitted variables?
iii. Does this regression justify looking for other articles on this topic? Why or why not? Would your answer be the same if n =10,000?
***See attached file for question nine.
Q9. Use the Stata command ttesti to answer the MB&B ttest problem of your choice from chapters 11, 12, or 14 (make sure it is a problem that can be solved using ttesti, not ttest or sampsi). Provide hypothesis and log file output and make it clear which question you are addressing. (Yes, you may use a problem that you have already solved by hand and/or has solutions in the back of the book.)
Q10. Give (real or hypothetical) examples of the following. Do not use either the examples that I gave you from your class notes or from Wikipedia:
a. A situation where someone is led astray by misuse of the Representativeness Heuristic
b. A situation where someone is led astray by misuse of the Availability Heuristic
c. A situation where Framing could change someone's answer to a survey question.
Attachment:- Assignment.rar