Question 1 -
The following table lists some variables that might be of interest in your next data analysis. For each variable, complete the associated table indicating whether it is categorical (and if so, is it nominal or ordinal) or numerical (and if so, is it discrete or continuous).
Variable
|
Categorical
|
Continuous
|
Example
|
Eye Color
|
nominal X
|
ordinal
|
discrete
|
continuous
|
1a
|
Sex
|
|
|
|
|
1b
|
Number of runs scored in a baseball game
|
|
|
|
|
1c
|
Profession
|
|
|
|
|
1d
|
Temperature, measured in Farenheit
|
|
|
|
|
1e
|
Confidence in one's ability to to statistics as measured by "yes/no" to the statement: "I will do well"
|
|
|
|
|
1f
|
Number of siblings
|
|
|
|
|
1g
|
Distance an individual can run in five minutes
|
|
|
|
|
1h
|
Ethnicity
|
|
|
|
|
1i
|
Number of MD's - who also have a PhD
|
|
|
|
|
1j
|
Lack of coordination as measured by time it takes an individual to complete a certain puzzle.
|
|
|
|
|
Question 2 -
Here is a hypothetical situation. In 2015 a program aimed at reducing infant mortality was implemented in two regions, Pepi and Quepi. The following table (this is hypothetical, sorry) shows the numbers of births and infant deaths in two regions (Pepi and Quepi) in each of two years: 2014 and 2016.
|
Pepi
|
Quepi
|
|
Births
|
Infant Deaths
|
Births
|
Infant Deaths
|
2014
|
100,000
|
300
|
1,000,000
|
5000
|
2016
|
100,000
|
60
|
1,000,000
|
4000
|
2a. In which region is there more convincing evidence that the reduction in mortality was caused by the program?
2b. If the program can be continued in one region ONLY, which would you choose? In developing your answer, you may assume that the reductions shown were in fact caused by the program.
Question 3 -
The following are some data on some famous statisticians. Yes! Florence Nightingale, among her other talents, was a statistician!
Statistician
|
Gender
|
Year of Birth
|
Year of Death
|
Sir Francis Galton
|
2
|
1822
|
1911
|
Karl Pearson
|
2
|
1857
|
1936
|
William Sealy Gosset
|
2
|
1876
|
1937
|
Ronald Aylmer Fisher
|
2
|
1890
|
1962
|
Harald Cramer
|
2
|
1893
|
1985
|
Prasanta Mahalanobis
|
2
|
1893
|
1972
|
Jerzy Neyman
|
2
|
1894
|
1981
|
Egon S. Pearson
|
2
|
1895
|
1980
|
Gertrude Cox
|
1
|
1900
|
1978
|
Samuel S Wilks
|
2
|
1906
|
1964
|
Florence Nightingale
|
1
|
1909
|
1995
|
David John Tukey
|
2
|
1915
|
2000
|
3a. By any means you like (by hand is just fine), create a stem-and-leaf summary of the data on the variable YEAR OF BIRTH. Display it here. Then use this visual summary to answer questions #3b - #3e below.
3b. Are there any outliers (i.e., extreme values) in this distribution? Explain.
3c. How would you describe the shape of this distribution? Explain.
3d. What is/are the most frequently occurring score(s) in this distribution? How many times does it/do they occur?
3e. Can we use this stem-and-leaf to obtain the original set of values for this variable? Explain.
Question 4 -
4a. When a distribution is skewed to the right
i) TRUE or FALSE: The median is greater than the mean.
ii) TRUE or FALSE: The distribution is uni-modal
iii) TRUE or FALSE: The majority of observations are less than the mean.
4b. The shape of a frequency distribution can be described using:
i) TRUE or FALSE: A box and whisker plot.
ii) TRUE or FALSE: A table of frequencies
iii) TRUE or FALSE: A histogram
4c. For the sample 3, 1, 7, 2 and 2:
i) TRUE or FALSE: The sample mean is 3
ii) TRUE or FALSE: The sample median is 7
iii) TRUE or FALSE: The range is 1
iv) TRUE or FA.LSE: The sample variance is 5.5
Question 5 -
The following table shows the numbers of geriatric admissions, each week from May through September, to a certain facility in each of two years, 2012 and 2013.
Week
|
# Admissions 2012
|
# Admissions 2013
|
Week
|
# Admissions 2012
|
# Admissions 2013
|
1
|
24
|
20
|
12
|
11
|
25
|
2
|
22
|
17
|
13
|
6
|
22
|
3
|
21
|
21
|
14
|
10
|
26
|
4
|
22
|
17
|
15
|
13
|
12
|
5
|
24
|
22
|
16
|
19
|
33
|
6
|
15
|
23
|
17
|
13
|
19
|
7
|
23
|
20
|
18
|
17
|
21
|
8
|
21
|
16
|
19
|
10
|
28
|
9
|
18
|
24
|
20
|
16
|
19
|
10
|
21
|
21
|
21
|
24
|
13
|
11
|
17
|
20
|
22
|
15
|
29
|
5a. By any means you like (by hand is just fine), summarize these data graphically. Display it here. Then use this visual summary to answer question #5b.
5b. Why do you think these two years were different? Note - There is no single correct answer here. I will accept any well-reasoned interpretation. I'm looking for you to think about what you see!
Question 6 -
6a. You read that the median income of U.S. households in 2010 was $49,455. In 1-2 sentences at most, explain in plain language what "the median income" is.
6b. The Census Bureau website gives several choices for "average income" in its historical income data. In 2010, the median income of American households was $49,455. The mean household income was $67,530. The median income of families was $60,395, and the mean family income was $78,361. The Census Bureau says, "Households consist of all people who occupy a housing unit. The term family' refers to a group of two or more people related by birth, marriage, or adoption who reside together". In at most 5 sentences, explain carefully why mean incomes are higher than median incomes and why family incomes are higher than household incomes.
6c. A January 2012 magazine article reported that the average income for readers of the business magazine Forbes was $217,000. In your opinion, is the median wealth of these readers greater or less than $217,000? In at most 1-2 sentences, explain your reasoning.
6d. The distribution of individual incomes in the United States is strongly skewed to the right. In 2008, the mean and median incomes of the top 1% of Americans were $558,726 and $1,137,680. Which of these numbers is the mean and which is the median? In at most 1-2 sentences, explain your reasoning.
6e. By any means you like (by hand is fine) which of the following two data sets is more spread out? Show your work. In at most 1-2 sentences, explain your reasoning.
Data set "A": 4 0 1 4 3 6
Data set "B": 5 3 1 3 4 2
Question 7 -
A box plot is the graph of a five number summary. The central box spans the quartiles. The line in the box mark the median. The size of the box is a measure of spread. The lines extending out from the box give an indication of extremes, if any. Side-by-side box plots are useful for comparing two distributions. As an example, consider the following table. It lists the average month's temperature (Farenheit) of Springfield, Massachusetts and San Francisco, California.
Month
|
Ave Temp (F) Springfield
|
Month
|
Ave Temp (F) San Francisco
|
January
|
32
|
January
|
49
|
February
|
36
|
February
|
52
|
March
|
45
|
March
|
53
|
April
|
56
|
April
|
55
|
May
|
65
|
May
|
58
|
June
|
73
|
June
|
61
|
July
|
78
|
July
|
62
|
August
|
77
|
August
|
63
|
September
|
70
|
September
|
64
|
October
|
58
|
October
|
61
|
November
|
45
|
November
|
55
|
December
|
36
|
December
|
49
|
7a. Obtain the five number summary for the average monthly temperatures, separately for each data set, Springfield versus San Francisco. Use these values to complete the following table.
|
Springfield
|
San Francisco
|
Minimum
|
|
|
Q1
|
|
|
Q2 = median
|
|
|
Q3
|
|
|
Maximum
|
|
|
7b. By any means you like (by hand is fine), produce a side-by-side box and whisker plot of the two distributions of average monthly temperatures. You will use this visual to answer question #7c.
7c. i) Are the 2 cities similar in their typical (median) average temp?
ii) Are the 2 cities similar in terms of temperature spread? Explain
iii) Which city requires owning a larger wardrobe of clothes?
Question 8 -
This last exercise gives you practice working with the fundamentals of calculations of the sample mean, the sample variance and the sample standard deviation. It also gives you practice producing and interpreting a histogram.
On the next page is a table of data on X = blood glucose levels (mmol/L) obtained from a simple random sample of n=40 first year medical students. The students are indexed using a subscript "i" that ranges from i = 1 to i = 40.
8a. First calculate the sample mean. To do this, obtain the sum of the individual blood glucose values and divide this by the sample size.
i) i=1∑40 xi =
ii) n =
iii) Sample mean = i=1∑40xi/n = fill in/fill in =
8b. Next, calculate the individual squared values of individual blood glucose levels. In developing your answer complete the entries to the 3rd column of the table. All done? Now obtain the sum of the squared values of the individual blood glucose levels. Enter this total at the bottom.
8c. Next, calculate the individual squared values of the deviations of the individual blood glucose levels about the sample mean. In developing your answer complete the entries to the 4th and 5th columns of the table. All done? Now obtain the sum of the individual squared values of the deviations of the individual blood glucose values about the sample mean. Enter this total at the bottom of the 5th column.
i
|
xi
|
xi2
|
(xi - x-)
|
(xi - x-)2
|
1
|
4.7
|
|
|
|
2
|
4.2
|
|
|
|
3
|
3.9
|
|
|
|
4
|
3.4
|
|
|
|
5
|
3.6
|
|
|
|
6
|
4.1
|
|
|
|
7
|
4.8
|
|
|
|
8
|
4.0
|
|
|
|
9
|
3.8
|
|
|
|
10
|
4.4
|
|
|
|
11
|
3.3
|
|
|
|
12
|
3.8
|
|
|
|
13
|
2.2
|
|
|
|
14
|
5.0
|
|
|
|
15
|
3.3
|
|
|
|
16
|
4.1
|
|
|
|
17
|
4.7
|
|
|
|
18
|
3.7
|
|
|
|
19
|
3.6
|
|
|
|
20
|
3.8
|
|
|
|
21
|
4.1
|
|
|
|
22
|
3.6
|
|
|
|
23
|
4.6
|
|
|
|
24
|
4.4
|
|
|
|
25
|
3.6
|
|
|
|
26
|
2.9
|
|
|
|
27
|
3.4
|
|
|
|
28
|
4.9
|
|
|
|
29
|
4.0
|
|
|
|
30
|
3.7
|
|
|
|
31
|
4.5
|
|
|
|
32
|
4.9
|
|
|
|
33
|
4.4
|
|
|
|
34
|
4.7
|
|
|
|
35
|
3.3
|
|
|
|
36
|
4.3
|
|
|
|
37
|
5.1
|
|
|
|
38
|
3.4
|
|
|
|
39
|
4.0
|
|
|
|
40
|
6.0
|
|
|
|
Total of column
|
|
|
|
|
8d. Calculate the sample variance using the appropriate column totals in TWO ways. Show your work. Tip - You should get the same answer, thus illustrating a shortcut when doing calculations by hand and clarifying the confusion you might have encountered when encountering more than one formula for this calculation.
i) s2 = i=1∑40(xi -x-)2/(n-1)
ii) s2 = [i=1∑40xi2] - [n][x-2]/(n-1)
8e. Finally, calculated the sample standard deviation.
8f. By any means you like (by hand is fine), produce a histogram of these data.
8g. Calculate the mean ±1 standard deviation and the mean ±2 standard deviations. Indicate these points on your histogram.
8h. What term best describes the shape of the distribution of blood glucose in this sample: symmetrical, skewed to the right, or skewed to the left?