Portfolio task 1 - Regression
In the real world, you will be expected to communicate the results from a statistical analysis you perform to non-statisticians, so you should conclude each task with a brief explanation of your results, presented in terms a lay person would understand.
Task 1 The following dataset is taken from a paper by Stanley and Miller (1979) regarding the values of various variables for 22 US aircraft. The variables are as follows:
FFD first flight date, in months after January 1940. This variable is a proxy for how technologically advanced the aircraft is.
SPR specific power, proportional to power per unit weight
RGF flight range factor, categorised into three levels: low: flight range factor 4.2 med: 4.2 = flight range factor = 4.87 high: flight range factor 4.87
PLF payload as a fraction of gross weight of aircraft
SLF sustained load factor
CAR a binary variable which takes the value 1 if the aircraft can land on a carrier, 0 otherwise.
The data are available in the files jet2.csv, jet2.mtw and jet2.sav.
You are required to analyse these data using appropriate regression technniques.
a) Perform an initial exploratory data anlysis using plots and the usual descriptive statistical methods. This should include individual plots of FFD against the three continuous variables, with the data grouped by the two categorical variables.
As a consequence, discuss the suitablility of the data for regression analysis.
b) Fit a regression model using all the main effects (you will need to produce suitable dummy variables), and in addition, the following 2-way interactions:
RGF×SLF; RGF×SPR; RGF×PLF; RGF×CAR; CAR×SPR; CAR×PLF; CAR×SLF
N.B. Interactions involving RGF require two dummy variable levels.
Fully discuss your results.
Produce an overlay plot of the Student and deleted residuals against the predicted values. Discuss your results. scatterplot and correlation coefficient using Minitab, and use these to assess the suitablility for regression analysis.
c) Attempt to reduce this model by comparing R2 and adjusted R2 results for the best model of each size. (I recommend using a Best Subsets procedure and plotting your results in Excel.) Is this appropriate? What difficulties do you encounter?
d) Use your results above and your knowledge of other aspects of regression modelling techniques to explain why the approach of fitting all interactions is inappropriate for these data.
e) Find a suitable model for these data using only the main list of variables (SPR, RGF, PLF, SLF and CAR). Use appropriate methodology to identify this model.
This should include:
• Use of Standard Regression output and appropriate interpretation
• Residual diagnostics
• Influence diagnostics
• Other appropriate diagnostics
• Use of transformations
• Use of selection methods [R2, Cp, stepwise etc.]
• Use of suitable overall approach
f) Use you model to find the predicted FFD of a US aircraft, with SPR 4.0, RGF medium, PLF 0.16, SLF =3.0 and CAR = 0.
Calculate a suitable confidence interval for your prediction.
Portfolio tasks 2 and 3
Task 2 involves using Analysis of Variance to explore the factors which are likely to affect the value of customer transactions made to a bank in the Czech Republic.
You are required to investigate both main effects of the explanatory variables, and their interactions.
To complete it, you will need to refer to material from the third and fourth lectures and the relevant chapters of Field (2013).
Task 3 is a logistic regression exercise, based on lectures 7 and 8, as well as the appropriate chapter from Field (2013).
Task 2 The data credit contains data on transactions by account holders to a bank in the Czech Republic. The data is adapted from a dataset given for a data mining competition prior to the third international conference of Principles and Practices of knowledge discovery in data bases (PKDD). This conference was held in Prague in 1999 . The variables are:
Tcredit a transformation of the average value of credits made per day
Sex F/M
Second Y if there is a second account holder
N if there is not a second account holder
Loan yes if the account holder has a loan with the bank no if the account holder does not have a loan with the bank
Card yes if the account holder has a credit card with this bank no if the account holder does not have a credit card with this bank
Region one of:
Prague; south Bohemia; north Bohemia; west Bohemia; central Bohemia; east Bohemia; north Moravia and south Moravia.
The data are available in the files credit.csv, credit.mtw and credit.sav. There are 4,500 observations.
a) Produce suitable plots to investigate the relationship between each of the explanatory variables and Tcredit. Comment on your results.
b) Fit models in order to predict Tcredit using:
i. The model containing all main effects.
ii. The model containing all main effects and all two-way interactions.
iii. The model containing all main effects, all two-way interactions and all threeway interactions.
iv. The model containing all main effects, all two-way interactions, all threeway interactions and all four-way interactions.
c) Carry out suitable tests to compare the models fitted above, in particular you should compare:
i. model iii) with model iv)
ii. model ii) with model iii)
iii. model i) with model ii)
Hence, explain which of the four models fitted in b) above is the most appropriate.
d) Using the model you selected in c) above, reduce it by removing components one at a time (e.g. one interaction term) as appropriate.
Continue with this reduction until it is inappropriate to further reduce the model.
e) Validate the final model you found in d) above. Produce suitable residual plots/analysis (NOTE: No further analysis e.g. Influential analyses, transformations etc. are required for these data).
Comment on all your results.
f) Fully discuss all your findings and the appropriateness of your final model.
Task 3 The following data are taken from a survey to explore the factors influencing the pattern of consumption of psychotropic drugs. The following data are an extract taken from this survey and published in Murray et al (1981) .
The variables are as follows:
Sex 0 male
1 female
Agegr age group (one of 16-29; 30-44; 45-64; 65-74; 74)
GHQ score on the General Health Question 0 low
1 high
Drug number taking drugs
Tot total number in each variable combination
The data are available as psy.csv, psy.mtw and psy.sav.
N.B. For SPSS the data have been stacked in the required form with Drug being the number either taking or not taking the drug and Resp being the response (=1 if taken drugs, = 0 if not taken drugs)
Printouts of the data in the CSV/Minitab and SPSS formats is given in the appendices.
a) Explain why the assumption of a binomial distribution is suitable to use in modelling these data. Also, without any calculation, explain in what way you might expect each of the variables Sex, Agegr and GHQ to effect the dependent variable?
b) We wish to investigate how the variables (i.e. age group, sex and position on general health (GHQ)) may effect the chance of being a drug user. For this purpose you should fit various models to the Drug variable (Resp in SPSS) using appropriate software.
You should begin by fitting the following models:
i. A model with all main effects, two-way interactions and three-way interactions (i.e. the saturated model)
ii. A model with all main effects and all two-way interactions
Utilising the output from these models, explain why we can conclude that the three-way interaction is not needed.
c) Starting with model bii, remove each two-way interaction (not each individual term) one at a time.
Utilising the output from these models, explain which if any such terms may be removed.
Continue reducing the model until no more terms may be removed.
d) Using the best model that you selected in part c, produce suitable residual plots to investigate the adequacy of your model.
Clearly discuss your findings.
e) Using the best model that you selected in c), interpret clearly all the parameter estimates in the model.
State clearly whether they are as you would expect.
f) Use your model to predict the probability that an individual, who is female, scored high on the GHQ and is age 48 will be a psychotropic drug user.
Comment on the validity of your prediction in this case.
Portfolio task 4- Loglinear modelling Page 1
In the real world, you will be expected to communicate the results from a statistical analysis you perform to non-statisticians, so you should conclude each task with a brief explanation of your results, presented in terms a lay person would understand.
All data is SPSS format. N.B. Minitab doesn't perform loglinear analysis.
Note that in the lecture we followed Field's use of the Model Selection function within Analyze? Loglinear. This allowed us to evaluate the significance of the different levels of interaction available. However, it didn't allow us to refne the model further.
In this coursework you may begin with an overall assessment of the saturated model using this option, but you will need to use the General method to explore and refine the model to produce one that best fits the data.
Statistical Modelling Gary Hearne (adapted from Dr. Theresa Brunsdon)
2014 / 2015 School of Science and Technology
Portfolio task 4- Loglinear modelling Page 2
Task 4 A manufacturer of VLSI chips has been acting as exclusive supplier to three computer assembler firms (M1, M2 and M3). Over the last three years, faulty chips in operational machines have been returned via the assemblers to the manufacturers, together with information regarding the usage of the machine (low to average, above average).
Chips tend to fail because of one of two problems: insulation breakdown or circuit fracture. The numbers of returned faulty chips, categorised by type of fault, numerically coded variables.
a) Load the data into SPSS. Perform an exploratory analysis on the structure of the datra and the relationships within it using chi-squared tests. Carefully note your findings.
b) Use Analyze ? Loglinear ? Model selection to build a saturated model. Interpret the output, in particular the table of K-Way and Higher-Order Effects to determine whether the model is a good fit to the data. Explain your findings.
c) Repeat the procedure, but this time using Analyze ? Loglinear ? General. Ensure that in the Options menu you select Estimates.
Note that for this procedure, you will not have to declare the value ranges for the variables, but you will have to use the Weight Cases function.
Use the Parameter Estimates table to check that the output matches the result for the Model selection procedure.
d) Repeat the procedure, removing insignificant terms hierarchically until you have the model which best fits the data. Use this model to explain the relationships and interactions within the data.
Statistical Modelling Gary Hearne (adapted from Dr. Theresa Brunsdon)
2014 / 2015 School of Science and Technology