Question 1:
A key problem facing the UK health service is to estimate the cost of treating individual patients, so that Hospital Trusts can charge Clinical Care Groups and other Purchasers for treating their patients. (Costs have to be estimated because management information systems are not yet capable of producing the true costs.)
Two scoring systems that can be calculated from readily available information have been developed outside the UK and it is thought that they may be good indicators of patient costs in the UK. In order to test these indicators actual costs (in £’s) have been collected for a sample of 60 patients together with their scores on the two indicators (ind_A and ind_B).
This data is summarised at the end of the question in the form of descriptive statistics, produced using SPSS. The cost of hospital treatment (ActualCost) has been regressed separately against the two indicators using SPSS, with the results shown in the following pages under the headings MODEL 1 and MODEL 2 respectively.
(i) Which of these models show evidence of a significant linear relationship between treatment cost and the explanatory variable? Justify your answer. Express any significant relationship(s) in words.
(ii) On the basis of the output provided, comment on the apparent quality of the two models.
(iii) The results of using SPSS to predict the cost of treating a patient whose scores on indicators A and B are 4000 and 1000 respectively using the two models are shown below.
Cost (£)
Prediction LMCI UMCI LICI UICI
MODEL 1 517.2 487.0 547.5 287.9 746.6
MODEL 2 557.0 531.7 582.3 360.2 753.8
Use your preferred model to estimate the average cost of a patient with these characteristics, and of an individual patient with these characteristics. Comment on the levels of accuracy of your predictions. Explain, as if to a hospital manager, the sources of inaccuracy in your predictions.
(iv) Assuming that the model implied by your preferred regression is appropriate, sketch a diagram showing how the residuals would be spread in comparison to your chosen indicator variable. Indicate the correct scales on your diagram, as far as possible.
(v) Suppose the data used to build your regression model came from two different hospitals, one of which had ‘unit costs’ (i.e. costs for comparable patients) that were 20% higher than those of the other. How would you detect this problem during your residuals analysis?
MODEL 1 (Indicator A)
MODEL 2 (Indicator B)
Question 2:
A triathlon event consists of three stages, swim, cycle ride and run. Competitors undertake the three stages in that order and are timed from the start of their swim to finishing the run. The competitor who takes the shortest time wins. Data has been collected from four previous events each of which restricted entry to 200 male competitors between the ages of 18 and 55 (see file TriathlonCWA2014.sav). The available data is as follows:
Time Total race time (seconds)
Event (the four previous triathlons have been labelled A, B, C and D)
Swim (distance in metres)
Cycle (distance in miles)
Run (distance in miles)
Age (years)
The four events were over similar terrain, but differed in the lengths of their parts.
Event Swim (metres) Cycle (miles) Run (miles)
A 400 25 5
B 500 15 3.5
C 300 20 5
D 400 15 6
In stamina events of this sort, competitors usually achieve their best times between the ages of 25 and 35.
(a) Carry out a preliminary analysis of the data using Scatterplots, Correlations, and anything else you think appropriate. Report your preliminary findings.
(b) Use multiple linear regression to investigate the relationship between Total Race Time and the explanatory variables: swim distance, cycle distance, run distance and age. Justify your choice of model.
(c) Carry out a residuals analysis to check whether or not the usual regression assumptions seem to hold for your preferred model. Carefully justify your conclusions, noting any reservations you have about your model.
(d) In the light of your answer to (c) carry out any further improvements to your model you think appropriate (if any), and explain why you believe it is an improvement.
(e) Use your preferred model to predict the Total Race Time of ‘average’ 35 year old and 60 year old competitors taking part in a triathlon with a 450 metre swim, 18 mile cycle and 5 mile run. Do you have any reservations about your predictions? How well would you expect the 35 year old to do in comparison to other 35 year olds if he completes the event in 82 minutes?