Question 1. Inferential statistics (choose each correct answer)
a. Describe your dataset
b. Draw conclusions about the populations
c. Identify patterns
d. Predict future observations
Question 2. Quartiles are the 75th (Q3) and 25th (Q1) percentiles.
Question 3. Why are squared values used so often when calculating statistics? Because that's what's used for further interpretation of results. It's also been known to be more accurate. Squaring makes it much easier to work with.
Question 4. Variance is a description of
a. Range
b. Standard deviation
c. How dispersed the data is around the mean
d. Error
Question 5. An estimated mean value from your sample is inferring thetrue meanvaluefrom the population.
Question 6. The value describing the most frequent response of a variable is the
a. Mean
b. Median
c. Mode
d. Weighted mean
Question 7. There can be more than one mode in a dataset (T/F). TRUE
Question 8. In a symmetrically distributed dataset, the mean is the same as the median. (T/F) FALSE
Question 9. Data that is skewed is best described by the
a. Mean
b. Median
c. Mode
d. Weighted mean
Question 10. When setting alpha for a hypothesis test, are you controlling the Type I or Type II error rate? ______ A type 1 error is when we wrongly reject the null hypothesis. Type 1 errors can not be avoided they are part of the design of the statistical test but they can be made less common by setting the significance (that's the 5% above) at a lower level. However if we do do set the significance level lower say 1% that increases the chance of a type 2 error.
A type 2 error is when you wrongly fail to reject the null hypothesis.
Question 11. Which of the following statements is true?
a. Linear Regression errors values has to be normally distributed but in case of Logistic Regression it is not the case
b. Logistic Regression errors values has to be normally distributed but in case of Linear Regression it is not the case
c. Both Linear Regression and Logistic Regression error values have to be normally distributed
d. Both Linear Regression and Logistic Regression error values have not to be normally distributed
Question 12. A boxplot graphically displays
a. Median, interquartile range, outliers
b. Mean, +/- 2 standard deviations, min and max
c. Mean, interquartile range, min and max
d. Median, +/- 1.96 standard deviations, outliers
For questions 13-15, Consider a dataset with 1000 observations with the variables Treatment (placebo, drug), Response (improved, did not improve), age, weight, zip code, gender, and race.
Question 13. Write null and alternative hypothesis to test for an association between Response and Treatment:
Null hypothesis: µ1 2 - µ = 0 Alternative hypothesis: µ1 2 < µ or µ1 2 - µ < 0 [Mean reaction time is lower for the young group]
Question 14. What is the best test statistic to use for this hypothesis? H0: p = .15 Ha: p > .15
Question 15. What is the best test statistic to identify what factors influence treatment response?
First, compute 80 ˆ .20 400 p = = . Then 0 0 0 ˆ .20 .15 .05 2.80 (1 ) .15(.85) .01785 400 ________________
Question 16. Blood type (A, B, AB, and O) is what type of data?
a. Scaled
b. Categorical
c. Continuous
d. Ordinal
Question 17. Examples of a population include (choose all that apply)
a. People who received a kidney transplant in the United States
b. Medical Devices manufactured by Company X during 2017
c. Patients visiting a primary care physician whose social security number ends in ‘9'
d. 1000 children with Type I diabetes
Question 18. Effect size is defined as:
a. The difference you need to measure in order to reject your null hypothesis
b. The variability in your measurement of the dependent variable
c. The true difference between the parameters of your populations being compared
d. The clinically interesting difference between your populations being compared
Question 19. If the residuals (ε) from a regression model are random, they will follow a _______ distribution
a. Uniform
b. Independent
c. Normal
d. Poisson
Question 20. What does it mean that observations must be independent?
a. Observations are randomly sampled
b. The X and Y variable are uncorrelated
c. The measurement of one subject has no relationship to results from the other subjects in your sample
d. Observations are normally distributed
Question 21. Which are some actions you can take to improve the power of your analysis (choose all that apply)?
a. Increase sample size
b. Decrease effect size
c. Improve your measurement precision to decrease variance
d. Choose a different model
e. Do a one-sided test
Question 22. If the true value of (population) mean age of people who live in nursing homes is 80 years, and your sample of nursing homes in Phoenix yields an estimate of 85.3 years and confidence interval [84.1,86.4], this estimate is (choose all that apply)
a. An example of a type I error
b. An example of a type II error
c. Accurate, but not precise
d. Precise, but not accurate
Question 23. You run a study to compare 2 doses of Drug X to placebo. To test whether any dose of Drug X is better than placebo, use _____ test. In order to tell which dose is better than placebo, use __________ test.
Question 24. If you use pooled variances for an ANOVA test and the variance of group B is actually much larger than that of group A, are you more likely to generate a Type I or a Type II error? _____. How could you avoid this error?
Question 25. Suppose you have MRI results that estimate tumor size before and after treatment with chemotherapy (If the tumor is not seen on the "after" MRI, tumor size is recorded as 0). Choose the best test statistic to demonstrate chemotherapy effectiveness (if any):
a. Use a two-sided t-test to compare the mean tumor size after treatment to the mean tumor size before treatment
b. Use a one-sided t-test comparing the "after" measurement for difference from 0 because you only care if the tumor disappears
c. Use a paired t-test to compare the mean difference between before and after measurements
d. Use an ANOVA with ‘before' size as the covariate and ‘after' size as the dependent variable
Question 26. One method to analyze the performance of Logistic Regression is AIC, which is similar to R-Squared in Linear Regression. Which of the following is true about AIC?
a. We prefer a model with minimum AIC value
b. We prefer a model with maximum AIC value
c. Both but depend on the situation
d. None of these
You are analyzing the results of a clinical trial intended to show that your company's new drug X is superior to the standard treatment. You collect data from 100 subjects in an "open-label" study (patients and doctors know which treatment was given). You run your test, reject the null hypothesis with p = 0.023, and conclude that your drug is superior. However, two other independent double-blind trials testing 50 subjects each, failed to show superiority. Are the following statements (true or false?
Question 27. ___ Your study had a Type I error
Question 28. ___ Your study was under-powered
Question 29. ___ The competitor's studies may have been under-powered
Question 30. ___ Your study failed to account for the placebo effect
Question 31. In multiple linear regression (choose all that apply):
a. The relationship between Xi and Y is linear
b. Some transformation of Xi is has a linear relationship with Y
c. Each unit change in Xi causes a unit change of βi in Y
d. The relationship between Xi and Xj must be linear
e. The intercept (β0) must be 0.
Question 32. In a coin-tossing trial with a fair coin, the probability of getting heads is 0.5. What are the odds of getting heads?
a. 0
b. 0.5
c. 1
d. 2
Question 33. What is the odds ratio for variable X if the calculated coefficient (from a logistic regression) is .037? ________
List 3 reasons data might be censored:
Question 34. __________________
Question 35. __________________
Question 36. __________________
Question 37. Why is the logit link used in logistic regression? ____
For questions 38-42, consider the case where a multiple linear regression model predicting patient weight yields the following results. The variables are Age (in years), Gender(0=female, 1=male), fat intake (average daily intake where 1 = 10-19 g/day, 2=20-29 g/day, 3=30-39 g/day, etc), and exercise frequency (0= <5 days a week, 1= >5 days a week)
Variable
|
Parameter estimate
|
Standard error
|
t-value
|
Pr>|t|
|
Intercept
|
106.3
|
10.2
|
3.9
|
.003
|
Age
|
1.4
|
.36
|
7.5
|
<0.001
|
Gender
|
20.4
|
4.5
|
3.3
|
.01
|
Fat intake
|
3.8
|
5.6
|
.5
|
.70
|
Exercise
|
-10.7
|
2.8
|
6.7
|
0.001
|
P=0.0078
Adjusted R-square=0.48
|
Question 38. Which variables should be included In the final model? ______________________________
Question 39. What is the predicted weight of a 35 year old male with fat intake=6 and exercise=1? __________________
Question 40. Why is fat intake not significant even though it has a higher coefficient than age? _______
Question 41. What is the difference between the overall p-value and the individual p-values? _______________
Question 42. Based on the adjusted R-square of 0.48, is this a useful model to predict patient weight? ______
Question 43. Which of the following statements is true?
a. Linear Regression residuals values must be normally distributed, but it is not true for Logistic Regression
b. Logistic Regression errors values must be normally distributed, but this is not true for Linear Regression
c. Both Linear Regression and Logistic Regression error values have to be normally distributed
d. Neither Linear Regression and Logistic Regression error values have to be normally distributed
Question 44. Examples of "time 0" in a Survival analysis could include:
a. Date of diagnosis
b. Date of surgery
c. Date of onset of disease
d. Date of consent to participate in a clinical trial
Question 45. In Cox modelling, the baseline hazard is taken from
a. Sum of Squares of the residuals
b. The intercept
c. Solving the final model using the mean value of each independent variable as each Xi
d. Exp(coef(b1))
For questions 46-46 consider a dataset Teen_Drinking with the following variables:
Drinks
|
0=no, 1=yes
|
Gender
|
1=male, 2=female
|
Depressed
|
0=no, 1=yes
|
Age
|
|
Attends college
|
0=no, 1=yes
|
Parents divorced
|
0=no, 1=yes
|
Question 46. Do any variables need to be transformed prior to performing logistic analysis? If so, explain ____________
Question 47. Write out the R code to predict Drinks (assuming no multicollinearity) ______
Question 48. In the model results from (b), the odds ratio for ‘Depressed' is 0.5. What is the relationship between Depression and Drinking? ___
Mark the remaining statements (Questions 49-64) as true or false:
Question 49. ___ Scatterplots are a good way to tell if 2 variables are correlated
Question 50. ___ If X is correlated with Y, then a change in X causes a change in Y
Question 51. ___ A correlation of 0.2 is unimportant even if the results of a cor.test is significant
Question 52. ___ A correlation of 1.1 is a "strong" correlation
Question 53. ___ A correlation of 0 means that there is no relationship between X and Y
Regarding logistic regression, ___
Question 54. ___ The independent variables must be binary
Question 55. ___ Measures the probability that a person will experience the event of interest
Question 56. ___ Measures the proportion of your sample that experienced the event of interest
Question 57. ___ A negative estimated coefficient means that decreasing values of X cause a decrease in the odds of experiencing the event of interest
Question 58. ___ The outcome variable must be binary
Regarding Survival analysis,
Question 59. ___ Kaplan-Meier curves graphically present the results of a Cox model
Question 60. ___ Significance of a log-rank test to compare Kaplan-Meier curves is based on the Chi-square test.
If the Kaplan-Meier curves for a survival dataset CROSS,
Question 61. ___ The proportional hazards assumption has been violated for the grouping variable
Question 62. ___ The log-rank test can be used to test if the curves are different
Question 63. ___ The grouping variable should be excluded from Cox proportional hazards analysis
Question 64. ___ The Cox model could be run in two parts depending on where the curves cross
EXTRA CREDIT:
Question 1. In DNA or RNA sequencing, "mapping" refers to:
a. Determining which gene each read came from
b. Aligning each read to its chromosomal location according to the reference genome
c. Removing non-human sequences
d. Identifying what species your sample came from
Question 2. Phred scores are (choose all that apply)
a. A way to identify low vs. high confidence base calls
b. Based on the intensity of the fluoresced light signals from labelled nucleotides
c. On a linear scale
d. Reported as a single quality score for each read in your sample
Question 3. What are some reasons why the number of RNA reads may not be comparable across samples?
a. Samples are taken from different tissue types
b. Proteins underwent alternative splicing
c. One of the samples was degraded and read depth is low
d. The samples have a different mix of cell types