Solved: Sta3031002 fit a lasso logistic regression ie logistic, Applied Statistics

Sta3031002 fit a lasso logistic regression ie logistic

Abalone data for Q1

In this problem, we are going to analyze a datasets with 4177 subjects data with 8 variables, and will try to predict whether or not the ring of abalone is greater 9 or not. The complete dataset description can be found at https://archive.ics.uci.edu/ml/datasets/Abalone Below are the list of all variables in the dataset are :

- Sex:nominal variable - takes levels of M, F, and I (infant).
- Length:continuous variable (mm) - Longest shell measurement
- Diameter:continuous variable (mm) - perpendicular to length
- Height:continuous variable (mm) - with meat in shell
- Whole weight:continuous variable (grams) - whole abalone
- Shucked weight:continuous variable (grams) - weight of meat
- Viscera weight:continuous variable (grams) - gut weight (after bleeding)
- Shell weight:continuous variable (grams) - after being dried
- Rings:integer

We are interested in predicting the rings variable is greater than 9 or not. So you need to create the binary response based on it,

faba <- read.table("abalone.data",sep=",")

faba$y <- ifelse(faba$V9>8,1,0)

head(faba)

##		V1	V2	V3	V4	V5	V6	V7	V8	V9	y
##	1	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15	1
##	2	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7	0
##	3	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9	1
##	4	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10	1
##	5	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7	0
##	6	I	0.425	0.300	0.095	0.3515	0.1410	0.0775	0.120	8	0

Ships data for Q2

We are interested in the number of accidents per month for a sample of ships (a classic example given by McCullagh & Nelder, 1989). The data can be found in the file "ships.csv" and it contains 40 subjects data with 14 variables. The response variable is called ACC. The explicative variables are:

- TYPE: there are 5 ships, labelled as 1-2-3-4-5. Type is a categorical variable, and 5 dummyTA, TB, TC, TD, TE.
- CONSTRUCTION YEAR: the ships are constructed in one of four periods, leading to the dummy variablesT6064, T6569, T7074, T7579.
- MONTHS: a measure for the amount of service months that the ship has already carried out.

ships <- read.table("ships.csv",header=T,sep=",") str(ships)

head(ships)

##		TYPE	TA	TB	TC	TD	TE	T6064	T6569	T7074	T7579	O6074	O7579	MONTHS	ACC
##	1	1	1	0	0	0	0	1	0	0	0	1	0	127	0
##	2	1	1	0	0	0	0	1	0	0	0	0	1	63	0
##	3	1	1	0	0	0	0	0	1	0	0	1	0	1095	3
##	4	1	1	0	0	0	0	0	1	0	0	0	1	1095	4
##	5	1	1	0	0	0	0	0	0	1	0	1	0	1512	6
##	6	1	1	0	0	0	0	0	0	1	0	0	1	3353	18

Q1. Binary classiftcation of Abalone data.

1(a) We are going to use the first 3133 samples to train the model, and the rest will be used as the test set. Show your R code to get the training data and testing data. Find the mean and standard error of the continous variables (V2-V8). Standardize all the continous predictors (V2-V8) in the training set using formula (X - X¯ )/sd(X). Use the mean and sd in the training set to standardize the corresponding predictor in the testing data set.

xtrain <-faba[1:3133,1:8]
ytrain <- as.factor( faba[1:3133,10] ) xtest <-faba[- c(1:3133),1:8]
ytest <- as.factor( faba[-c(1:3133),10] )
# continue to write your code

(1b) Fit a LASSO logistic regression (i.e., logistic regression with a LASSO penalty) model using glmnet. Use 10-fold cross-validation to choose the optimal value of the regularizer, show your R code and print the optimal λ obtained from the cross-validation. Predicting with the training and testing data set, print the confusion matrix and report mean error rate (fraction of incorrect labels), respectively.

# Training the model on the standardized training set
# alpha=0 for ridge penalty; alpha=1 for the LASSO penalty
library(glmnet)

# .....

1(c) Plot the receiver operating characteristic (ROC) curve on the test data. Use package ROCR to get the ROC curve and use ggplot2 to plot the ROC curves. Report the area under the ROC curve (AUC).

1(d) Plot the receiver operating characteristic (ROC) curve on the test data using ridge penalty. Also, report the area under the ROC curve (AUC).

Q2. Analysis of ships data.

(2a) Make a histogram of the variable ACC. Comment on its form.

ships=read.table("ships.csv",header=T,sep=",")

# ...

Comments:
. . .
(2b) Estimate the Poisson regression model including all explicative variables and a constant term.Show your R code and summary output, comment on the coefficient for the variables MONTHS, is it significant?
Be careful on fitting the Poisson model. Note that if you include all the Type (TA-TE) and years (T6569- T7579) dummy variables, an error message would be generated, and no estimation would be performed. To avoid it, TA was chosen to be the reference category for type, and T6064 was chosen to be the reference category for construction year.

ships=read.table("ships.csv",header=T,sep=",")
options(scipen=5)

#...

Comments on the coefficient for the variable MONTH:
. . .
(2c) Perform a Wald test for the joint significance of all the type dummy variables. Specify the H0
and Ha, and your conclusion.
#....

(2d) Given a ship of category TA, constructed in the year period 65-69, with MONTHS=1000. Predict the number of accidents per month. Also, estimate (1) The probability that no accidents will occur for this ship. (2) the probability that at least two accidents will occur.

#..
# prob of (1)
#..
# prob (2)

Q3. Analysis of 3-way contingency table

		Heart disease
Gender	Cholesterol	Yes No	Total
Male	High	16 256	272
Male	Low	28 2897	2925
Female	High	13 319	332
Female	Low	23 2565	2588
	Total	80 6037	6117

You investigate the relationship between serum cholesterol (C), gender (G) and heart disease (H), and acquire the following data.

(3a) State the loglinear model that only expresses the main effects of the three characteristics on the expected counts. Interpret the assumption of the model, and compute the fitted values in the top left count of the table, i.e. (male, high cholesterol, with the disease) according to the model.

(3b) State the loglinear model that expresses all the main effects, and also an interaction between Cholesterol and Gender, and an interaction between Cholesterol and Heart disease. Interpret the assumption of the model, and compute the fitted values in the top left count of the table, i.e. (male, high cholesterol, with the disease) according to the model.

For model in (a) and (b), which one is better? Make your conclusion based on AIC and likelihood ratio test.

View Complete Question

Solution Preview :

Prepared by a verified Expert

Applied Statistics: Sta3031002 fit a lasso logistic regression ie logistic

Reference No:- TGS02239197

Now Priced at $55 (50% Discount)

Recommended (95%)

Rated (4.7/5)

Have a Question? (oR Write a Review)

Write atleast 100 words!!

Solution Preview :

Prepared by a verified Expert

Applied Statistics: Sta3031002 fit a lasso logistic regression ie logistic

Reference No:- TGS02239197

Have a Question? (oR Write a Review)

Recent Questions Asked Applied Statistics

Q : At what level of the judicial court system did this legal

Q : What decision-making style do you think would be most

Q : The information provided in the tables should not just be a

Q : Other than executory arbitration are there any other

Q : Sta3031002 fit a lasso logistic regression ie logistic

Q : If government purchases increased by 20 billion other

Q : What is a negative stakeholder should a negative

Q : The purpose of the program the target population or

Q : Diagnostic criteria described this would include

Discuss signs and symptoms of hpv related cancer

Describe structured multimodal pain management program

Discuss client with severe atherosclerotic disease

Reflect on the definition and goal of ebp

Examine the process of putting a new policy into place

Essential information for early childhood professionals

Discuss about the value of examining your personal biases

Solution Preview :

Prepared by a verified Expert

Applied Statistics: Sta3031002 fit a lasso logistic regression ie logistic

Reference No:- TGS02239197

Recent Questions Asked Applied Statistics

Q : At what level of the judicial court system did this legal

Q : What decision-making style do you think would be most

Q : The information provided in the tables should not just be a

Q : Other than executory arbitration are there any other

Q : Sta3031002 fit a lasso logistic regression ie logistic

Q : If government purchases increased by 20 billion other

Q : What is a negative stakeholder should a negative

Q : The purpose of the program the target population or

Q : Diagnostic criteria described this would include

Asked Questions