Can use Stata or SurfStat-show all formulas and calculations, even if using Stata or SurfStat
Part 1. deaths of infants in the US in the first year of life
In the US, the probability that a child dies in the first year of life is 0.0085.
In a certain rural county, for a recent year, 950 infants were born.
(Assume that these are all singleton births.)
18 of these infants died in their first year of life.
The county supervisors ask you if this indicates that the county's experience is different from that of the US population, if too many children are dying in their first year of life.
1. Explain what probability distribution you will use to assess these data, and exactly why. Discuss all relevant features and assumptions.
2. If the probability of dying in the first year of life for infants born in this county is the same as the US probability, what is the probability that 18 or more children of the 950 born would die in their first year of life? Give all details.
3. Based on your probability calculation, what you do conclude?
Write a summary sentence for the county supervisors.
Part 2. reasoning about disease from a test result
Physicians often order a lung biopsy for individuals with chronic persistent cough and no indication of disease on chest X-ray. However, routine lung biopsies cannot differentiate between lung cancer and another, non-fatal lung disease called sarcoidosis. An abnormal biopsy finding could be either disease and so requires follow-up testing.
Suppose for men in their 60s with chronic persistent cough who have never smoked tobacco we know the following:
P( A | C ) = .95 P( A | S ) = .90 P( A | N ) = .005
and P( C ) = .005 P( S ) = .035 P( N ) = .96
1. Set up a tree diagram to display this information, with the top set of branches representing cancer, sarcoidosis, and no lung disease and the second set of branches the biopsy outcome.
2. Find the overall probability that a man in his 60s with chronic persistent cough who has never smoked tobacco gets an abnormal biopsy.
3. Find the probability of sarcoidosis (S) given the biopsy is abnormal (A).
Part 3. power and sample sizes
common scenario for questions 1 to 5:
The investigators plan to use a sample of size 60
to carry out tests of hypotheses with the null hypothesis µ = 250 .
They think the population standard deviation is equal to 50.
They really want their hypothesis tests to correctly reject the null hypothesis if µ = 235 .
Stata use is suggested!
1. If the investigators use α = .05 and a one-sided alternative hypothesis µ<250,what is the power of the test? Show how you found the answer.
2. If the investigators use α = .01 and a one-sided alternative hypothesis µ < 250,
what is the power of the test? Show how you found the answer.
3. Compare your answers to questions 1 and 2 and comment on the difference.
4. If the investigators use α = .05 and a two-sided alternative hypothesis µ ≠ 250,
what is the power of the test? Show how you found the answer.
5. Compare your answers to questions 1 and 4 and comment on the difference.
6. Assume that the population mean really is 235. What size sample would be needed to have 90% power to correctly reject the null hypothesis µ = 250 if the investigators use a one-sided alternative hypothesis µ < 250? Continue to use 50 as an estimate of the population standard deviation. Show how you found the answer.
7. Assume that the population mean really is 235. What size sample would be needed to have 90% power to correctly reject the null hypothesis µ = 250 if the investigators use a two-sided alternative hypothesis µ ≠ 250? Continue to use 50 as an estimate of the population standard deviation.
8. Compare your answers to questions 6 and 7 and comment on the difference.
Part 4. association between breast cancer and age at birth of first child
As part of a large study looking at factors associated with breast cancer in women, the investigators were interested in the hypothesis that development of breast cancer is associated with a woman's age at the birth of her first child.
This is a case-control study; 3,220 women with breast cancer and 10,245 controls were selected.
The women in the study all had at least one child. (The children born to women with breast cancer were born at least a year before their mothers' cancers were diagnosed.) The women's age at the birth of their first child was categorized as less than 30 years or 30 years of age or older.
1. State the null hypothesis and alternative hypothesis for a test of association in a way that explicitly reflectsthe case-control design.
2. Explain why these data satisfy the assumptions to use the χ 2 test and χ 2distribution to assess association.
3. Find χ2 test statistic. (Stata use is fine, but include output!)
4. Find the degrees of freedom.
5. Find the P value.
6. State your conclusion. Write a detailed summary sentence, taking account of the context and study design.
7. Calculate the odds ratio (OR) for these data.
8. Interpret the odds ratio in words.
9. Use Stata's cci command to calculate a confidence interval for the OR and paste it in here, using CourierNew font to line up the table nicely and removing extra lines.
Be sure that you get the same OR estimate as the one you calculated in question 7.
10. Write a sentence fully interpreting the OR and the confidence interval.
11. Explain why these data do not provide evidence that delay in birth of first child causes breast cancer.
Part 5. association between FEV1 and height in children ages 6 to 10
FEV1, forced expiratory volume, the amount of air a person can blow out in 1 second after breathing in deeply, is an easy way to assess lung function. The units are liters/ 1 second.
Researchers studying asthma were interested in how children's FEV1 varied with their height.
Here is the graph of the relationship between FEV1 and height for a random sample of 351 children, along with the fitted least squares line:
1. Based on this plot, does a line give a good summary of the relationship between FEV1 and height? Explain your reasoning.
2. Is the variability of the observed points around the fitted least squares line roughly constant? Explain yourreasoning. Focus on the observations with heights between 50 and 65 inches, where there are good numbers of observations.
3. Explain why we do not have to look at the distribution of the standardized residuals to know that it is OK to use methods based on the t distribution to find confidence intervals and carry out hypothesis tests for these data.
4. Use the output from the summary and corr commands to calculate the slope estimate in terms of the Pearson correlation coefficient and the standard deviations of fev1 and height. State the units for the slope.
5. Calculate the intercept estimate. State the units for the intercept.
6. Write out the equation for the fitted regression line.
7. Explain why it is not of any concern that the least squares line has a negative value for the intercept.
8. The investigators want to know if these data provide evidence to support the hypothesis that mean FEV1 is higher for children who are taller.
State the null and alternative hypotheses for the investigator's question.
9. Calculate the test statistic.
10. What are the degrees of freedom for the test?
11. Sketch the area under the t curve that represents the P value.
12. Based on your sketch and the Stata output, explain how you know the P value is very small.
13. Using P < .01 as the criterion for statistical significance, state the test conclusion in a sentence for the investigators.
14. What is the standard deviation of the residuals, and what are its units?
What does the standard deviation of the residuals tell us about the closeness of the observed points to the fitted line?
15. CalculateR2fromthe sums of squares and interpret it.
16. Explain why the standard error for the estimated population mean FEV1 for height = 60 inches is smaller than the standard error for the estimated population mean FEV1 for height = 48 inches.
17. Calculate a 95% confidence interval for the population mean FEV1 for all children with height = 54 inches.
18. Calculate a 95% prediction interval for the FEV1 for a child with height = 66 inches.
19. Explain why it would not be wise to use your regression model to estimate the mean FEV1 of children 42 inches in height.
Part 6.conceptual foundations
A. what does the central limit theorem say?
One of your colleagues has a simple random sample of 500 mothers who gave birth to their first child in 2013 at a certain hospital. The study was restricted to women who gave birth to a single infant close to full term, 37 weeks gestation or more. (In other words, mothers who gave birth to twins or triplets were not included in the study.) Your colleague has made a histogram of the weight these mothers gained during their pregnancy. The histogram shows that the distribution is skewed and has outliers at the high end.
Your colleague says that "Because the sample size is 500, the Central Limit Theorem says that we may assume that the distribution of the population of gestational weight gains for women giving birth to a first child, a singleton infant, who was born at full term follows a normal curve.
1.Explain why this statement is wrong.
2. What does the Central Limit Theorem say about this simple random sample of 500 that might be useful for your colleague's analysis of these data?
B. what do P values and confidence intervals tell us?
A study was conducted to compare the percent of individuals receiving treatment for Hepatitis C who experience major side effects under the standard treatment vs. a newly-developed treatment regimen. 200 patients were recruited for the study. They were randomly assigned to the standard treatment or the newly-developed treatment, keeping the sample sizes equal. Research team members, who did not know which treatment the patients had received, interviewed the patients after three months about their treatment experience.
30 of the patients who received the standard treatment reported experiencing major side effects, compared to 22 of the patients who received the newly-developed treatment.
The P value for the test for equality of proportions using the normal approximation and a two-sided alternative hypothesis is 0.197.
The 95% confidence interval for the difference of the population percents, standard treatment minus the new treatment is - 4 % to 20% .
3. Explain why the investigators should not conclude that the population proportions for the 2 treatment regimens are equal.
4. One of the research team says,"The probability is 95 % that the difference in population percents is between - 4 % and 20%." Why is this statement incorrect?
5. Another member of the research team says, "95 % of future samples will give a difference in percents between - 4 % and 20%." Why is this statement incorrect?
6. Write a sentence explaining exactly what 95 % confidence means in the context of this example.
C. comparing two groups
We analyzed data in problem set 7 to compare the number of days migraine sufferers who received 8 weeks of acupuncture needed pain medication in a 3 week monitoring period compared to similar patients on the waitlist.
The investigators calculated the following 95 % confidence intervals for the mean numbers of days pain medication was needed:
waitlist 3.5 to 5.3 days
acupuncture 2.7 to 3.7 days
7. One of the members of the research team says,"Because these confidence intervals overlap, we can conclude that the null hypothesis that population means are the same is not rejected at α = 0.05."Why is this statement incorrect?
Attachment:- final-exam.rar