The variables from the data set are:
Variable Name
|
Description
|
Identification number
|
1-113
|
Length of stay
|
Average length of stay in hospital (in days)
|
Age
|
Average age of patients (in years)
|
Infection risk
|
Average estimated probability of acquiring infection in hospital (in percent)
|
Routing culturing ratio
|
Ratio of number of cultures performed to number of patients without signs or symptoms of pneumonia, times 100
|
Routine chest X-ray ratio
|
Ratio of number of X-rays performed to number of patients without signs or symptoms of pneumonia, times 100
|
Number of beds
|
Average number of beds in hospital
|
Medical school affiliation
|
0 = Yes, 1 = No
|
Average daily census
|
Average number of patients in hospital per day
|
Number of nurses
|
Average number of full-time licensed practical nurses
|
Available facilities and services
|
Percent of 35 potential facilities and services that are provided by the hospital
|
The goal is to fit the best multiple regression model to the response (infection risk).
Do an analysis using the first 108 observations.
Use the stepwise regression method to see which model is the best. Repeat using subset regression.
Do they agree?
Are there any outliers in the data?
Look for x-outliers, y-outliers, and high-influence points.
Come up with one model that you think best describes the data and can be used for future predictions.
Show the residual plot for this one. Does the model seem appropriate?
Use this model to predict (using prediction interval) y for the last 5 observations of the data and see if the model is doing well.