Analyse the data for multicollinearity


Assignment:

The file Hepatitis SPSS contains information on 152patients. The goal is to use these data to build a model that will predict whether a patient will be diagnosed with Diabetes or not (Class Variable Histology value = 1 or 0 correspondingly). The data include attributes that describe the class variable: Bilirubin, Phosphate, Albumin and Protime.

Split the data into training and test datasets using a 70%:30% ratio.

Answer the following questions:

a) Why should the data be partitioned into training and validation sets? For what will the training set be used? For what will the test set be used?

b) What level of measurement data do the influencing variables have?

c) Explore the data set by running descriptive analysis, boxplots (and histograms - optional). Based on these methods describe the data and make relevant conclusions about missing values, outliers and data distribution. Do the data need cleaning? Why?

d) Are there any missing values? If there are missing values, take an action to manage them. Explain what you did.

e) Analyse the data for multicollinearity and report if there are any attributes that may create this problem.

f) Run a logistic regression modelon a training data, using an ‘Entry' attribute selection method (all attributes participate are included in the model). Interpret the outputs of the regression analysis.

g) Repeat the logistic regression on the training data usinga ‘stepwise'attribute selection method to find the model with the best fit to the data (in SPSS: Backward LR). Explain what does this type of the attribute selection model does. Which predictors are now used in the Logistic regression model?

h) Represent the Logistic regression model mathematically, report on the statistical significance of the attributes.Report on the accuracy of the model. What is the danger in the best predictive model that you found using training data?

i) Using the attributes that were selected for the Logistic Regression Model in the training data perform the regression analysis on the Test Data.

j) Report on the final accuracy of the logistic regression model and compare it to the accuracy of the model based on the training data.

Solution Preview :

Prepared by a verified Expert
Basic Statistics: Analyse the data for multicollinearity
Reference No:- TGS01955611

Now Priced at $70 (50% Discount)

Recommended (99%)

Rated (4.3/5)