Assignment:
1. You will explore the techniques for the course by examining data on the number of visits to a health care professional in Australia from 1977-78. The data have been placed on Wattle. The variables are:
- sex: 1 if female, 0 if male
- age: Age in years divided by 100 (measured as mid-point of 10 age groups from 15-19 years to 65-69 with 70 or more coded treated as 72)
- income: Annual income in Australian dollars divided by 1000 (measured as mid-point of coded ranges Nil, less than 200, 200-1000, 1001-, 2001-, 3001-, 4001-, 5001-, 6001-, 7001-, 8001-10000, 10001-12000, 12001-14000, with 14001- treated as 15000
- insurance: insurance contract (medlevy : medibanl levy, levyplus : private health insurance, freepoor : government insurance due to low income, freerepa : government insurance due to old age disability or veteran status
- illness: number of illness in past 2 weeks
- actdays: number of days of reduced activity in past 2 weeks due to illness or injury
- hscore: general health score using Goldberg's method (from 0 to 12). High score indicates bad health
- chcond: tonic condition (np : no problem, la : limiting activity, nla : not limiting activity)
- doctorco: number of consultations with a doctor or specialist in the past 2 weeks
- nondocco: number of consultations with non-doctor health professionals (chemist, optician, physiotherapist, social worker, district community nurse, chiropodist or chiropractor) in the past 2 weeks
- hospadmi: number of admissions to a hospital, psychiatric hospital, nursing or convalescent home in the past 12 months (up to 5 or more admissions which is coded as 5)
- hospdays: number of nights in a hospital, etc. during most recent admission: taken, where appropriate, as the mid-point of the intervals 1, 2, 3, 4, 5, 6, 7, 8-14, 15-30, 31-60, 61-79 with 80 or more admissions coded as 80. If no admission in past 12 months then equals zero.
- prescrib: total number of prescribed medications used in past 2 days
- nonpresc: total number of non-prescribed medications used in past 2 days
(a) Conduct an exploratory data analysis, where the response y = doctorco+nondocco (i.e. the total number of visits to health care professional in the past two weeks) in relation to the other variables, which should be considered explanatory variables (covariates). In doing your analysis make sure to identify any unusual points and discuss why they are unusual. For this assignment do not remove any unusual points, only comment on them (if they exist).
(b) Fit a multiple linear regression model with the response variable and with the other variables in the data as explanatory variables. Do not consider any transformations of the covariates or interactions. Present the main residual plot of the residuals against the fitted values for this model, along with a lowess smoother. Are there are any obvious problems with underlying assumptions?
(c) Consider a few transformations of y, such as log (y + 1), √Y, y1/4. Fit a multiple linear regression model with the response variable and with the other variables in the data as explanatory variables. Do not consider any transformations of the covariates or interactions. Again present the main residual plot of the residuals against the fitted values for this new model, along with a lowess smoother. Do any of the transformation applied to the response variable appear to have corrected any problems you identified in part (b)?
(d) Try using the Box-Cox approach to find a transformation. Again present the main residual plot of the residuals against the fitted values for this new model, along with a lowess smoother. Do any of the transformation applied to the response variable appear to have corrected any problems you identified in part (b) and (c)? Based on your analysis, decide whether a transformation should be considered and if so clearly state which one. Use this transformation through the rest of the assignment.
(e) Construct two added variable plots: one for income and one for age. Comment on the plots.
(f) Construct confidence intervals for all pairwise differences for the factor insurance with a family level α = 0.05. Which differences are statistically significant, if any?
(g) Construct confidence intervals for all pairwise differences for the factor chcond with a family level α = 0.05. Which differences are statistically significant, if any?
(h) Examine (but do not present) the ANOVA (Analysis of Variance) table and summary output for the model which you chose in (d). Now adjust the order of the explanatory variables so that you can test the following nested hypotheses.
Ho : Binsurance = Bsex = Bage = Bincome = Bnonpresc = 0
Ho : Binsurance = Bsex = Bage = 0
Ho : Binsurance = 0
Present the ANOVA table for the re-ordered model and discuss the result of the partial (nested) F-tests for the above hypotheses. Fully write out the tests. Do your results suggest some possible modification(s) you could make to the model? If so then make those modifications.
(i) Investigate whether the variable sex has an interaction effect with any of the other variables.
(j) For your model, construct a plot of the internally Studentized residuals against the fitted values, a normal Q-Q plot of the residuals, and a bar plot of Cook's distances for each observation. Use these plots (and other means) to comment on the model assumptions and on any unusual data points.
(k) Fully interpret the results of your final model. Provide plots of (y against age), (y against income), and (y against hscore), with regression lines for the different levels of the factor insurance. Additionally, add 95% point-wise confidence intervals for the regression lines (each confidence interval can have an a = 0.05). Finally, use different plotting symbols for male and female.