Sta3031002 fit a lasso logistic regression ie logistic


Abalone data for Q1

In this problem, we are going to analyze a datasets with 4177 subjects data with 8 variables, and will try to predict whether or not the ring of abalone is greater 9 or not. The complete dataset description can be found at https://archive.ics.uci.edu/ml/datasets/Abalone Below are the list of all variables in the dataset are :

- Sex:nominal variable - takes levels of M, F, and I (infant).
- Length:continuous variable (mm) - Longest shell measurement
- Diameter:continuous variable (mm) - perpendicular to length
- Height:continuous variable (mm) - with meat in shell
- Whole weight:continuous variable (grams) - whole abalone
- Shucked weight:continuous variable (grams) - weight of meat
- Viscera weight:continuous variable (grams) - gut weight (after bleeding)
- Shell weight:continuous variable (grams) - after being dried
- Rings:integer

We are interested in predicting the rings variable is greater than 9 or not. So you need to create the binary response based on it,

faba <- read.table("abalone.data",sep=",")

faba$y <- ifelse(faba$V9>8,1,0)

head(faba)

##

 

V1

V2

V3

V4

V5

V6

V7

V8

V9

y

##

1

M

0.455

0.365

0.095

0.5140

0.2245

0.1010

0.150

15

1

##

2

M

0.350

0.265

0.090

0.2255

0.0995

0.0485

0.070

7

0

##

3

F

0.530

0.420

0.135

0.6770

0.2565

0.1415

0.210

9

1

##

4

M

0.440

0.365

0.125

0.5160

0.2155

0.1140

0.155

10

1

##

5

I

0.330

0.255

0.080

0.2050

0.0895

0.0395

0.055

7

0

##

6

I

0.425

0.300

0.095

0.3515

0.1410

0.0775

0.120

8

0

Ships data for Q2

We are interested in the number of accidents per month for a sample of ships (a classic example given by McCullagh & Nelder, 1989). The data can be found in the file "ships.csv" and it contains 40 subjects data with 14 variables. The response variable is called ACC. The explicative variables are:

- TYPE: there are 5 ships, labelled as 1-2-3-4-5. Type is a categorical variable, and 5 dummyTA, TB, TC, TD, TE.
- CONSTRUCTION YEAR: the ships are constructed in one of four periods, leading to the dummy variablesT6064, T6569, T7074, T7579.
- MONTHS: a measure for the amount of service months that the ship has already carried out.

ships <- read.table("ships.csv",header=T,sep=",") str(ships)

head(ships)

##

 

TYPE

TA

TB

TC

TD

TE

T6064

T6569

T7074

T7579

O6074

O7579

MONTHS

ACC

##

1

1

1

0

0

0

0

1

0

0

0

1

0

127

0

##

2

1

1

0

0

0

0

1

0

0

0

0

1

63

0

##

3

1

1

0

0

0

0

0

1

0

0

1

0

1095

3

##

4

1

1

0

0

0

0

0

1

0

0

0

1

1095

4

##

5

1

1

0

0

0

0

0

0

1

0

1

0

1512

6

##

6

1

1

0

0

0

0

0

0

1

0

0

1

3353

18

Q1. Binary classiftcation of Abalone data.

1(a) We are going to use the first 3133 samples to train the model, and the rest will be used as the test set. Show your R code to get the training data and testing data. Find the mean and standard error of the continous variables (V2-V8). Standardize all the continous predictors (V2-V8) in the training set using formula (X - X¯ )/sd(X). Use the mean and sd in the training set to standardize the corresponding predictor in the testing data set.

xtrain <-faba[1:3133,1:8]
ytrain <- as.factor( faba[1:3133,10] ) xtest <-faba[- c(1:3133),1:8]
ytest <- as.factor( faba[-c(1:3133),10] )
# continue to write your code

(1b) Fit a LASSO logistic regression (i.e., logistic regression with a LASSO penalty) model using glmnet. Use 10-fold cross-validation to choose the optimal value of the regularizer, show your R code and print the optimal λ obtained from the cross-validation. Predicting with the training and testing data set, print the confusion matrix and report mean error rate (fraction of incorrect labels), respectively.

# Training the model on the standardized training set
# alpha=0 for ridge penalty; alpha=1 for the LASSO penalty
library(glmnet)

# .....

1(c) Plot the receiver operating characteristic (ROC) curve on the test data. Use package ROCR to get the ROC curve and use ggplot2 to plot the ROC curves. Report the area under the ROC curve (AUC).

1(d) Plot the receiver operating characteristic (ROC) curve on the test data using ridge penalty. Also, report the area under the ROC curve (AUC).

Q2. Analysis of ships data.

(2a) Make a histogram of the variable ACC. Comment on its form.

ships=read.table("ships.csv",header=T,sep=",")

# ...

Comments:
. . .
(2b) Estimate the Poisson regression model including all explicative variables and a constant term.Show your R code and summary output, comment on the coefficient for the variables MONTHS, is it significant?
Be careful on fitting the Poisson model. Note that if you include all the Type (TA-TE) and years (T6569- T7579) dummy variables, an error message would be generated, and no estimation would be performed. To avoid it, TA was chosen to be the reference category for type, and T6064 was chosen to be the reference category for construction year.

ships=read.table("ships.csv",header=T,sep=",")
options(scipen=5)

#...

Comments on the coefficient for the variable MONTH:
. . .
(2c) Perform a Wald test for the joint significance of all the type dummy variables. Specify the H0
and Ha, and your conclusion.
#....

(2d) Given a ship of category TA, constructed in the year period 65-69, with MONTHS=1000. Predict the number of accidents per month. Also, estimate (1) The probability that no accidents will occur for this ship. (2) the probability that at least two accidents will occur.

#..
# prob of (1)
#..
# prob (2)

Q3. Analysis of 3-way contingency table

 

 

Heart disease

 

Gender

Cholesterol

Yes     No

Total

Male

 

High

16    256

272

Low

28    2897

2925

Female

 

High

13    319

332

Low

23   2565

2588

 

Total

80    6037

6117

You investigate the relationship between serum cholesterol (C), gender (G) and heart disease (H), and acquire the following data.

(3a) State the loglinear model that only expresses the main effects of the three characteristics on the expected counts. Interpret the assumption of the model, and compute the fitted values in the top left count of the table, i.e. (male, high cholesterol, with the disease) according to the model.

(3b) State the loglinear model that expresses all the main effects, and also an interaction between Cholesterol and Gender, and an interaction between Cholesterol and Heart disease. Interpret the assumption of the model, and compute the fitted values in the top left count of the table, i.e. (male, high cholesterol, with the disease) according to the model.

For model in (a) and (b), which one is better? Make your conclusion based on AIC and likelihood ratio test.

Solution Preview :

Prepared by a verified Expert
Applied Statistics: Sta3031002 fit a lasso logistic regression ie logistic
Reference No:- TGS02239197

Now Priced at $55 (50% Discount)

Recommended (95%)

Rated (4.7/5)