STA Homework
1. This is a basic illustration of using Bootstrap for inference.
(i) Generate a random sample of size n = 100 following the univariate regression model
Yi = -5 + 2Xi + εi
where Xi's are independent Chi-square random variables with degrees of freedom 6, and εi's are i.i.d. N(0, σ2) with σ = 1. Fix a random seed to ensure that the results are reproducible.
(ii) Fit the least squares regression line to the data and obtain the estimate of (β0, β1, σ2).
(iii) Obtain re-sampling-based 95% confidence intervals for β0 and β1 by using a parametric (i.e., residual-base) bootstrap procedure with 400 bootstrap replicates.
(iv) How do the confidence intervals in (iii) compare with the theoretical confidence intervals for β0 and β1? [To compare the accuracy of the confidence intervals, repeat the procedure in steps (i)-(iii) 10 times (using different random seed for each simulation run) and report the average lengths of the bootstrap confidence intervals and that of corresponding theoretical confidence intervals.]
2. In this example, compare k-NN classification method, linear discriminant analysis and logistic regression in a two-class classification problem. For this consider the iris data available in R.
(i) Extract the data corresponding to flower types setosa and versicolor, numbering a total of 100 flowers. Set aside the last 10 measurements for each flower type as test data and use the remaining data consisting of 80 measurements as training data.
(ii) Fit a logistic regression model to the training data, using the variable Sepal.Length as predictor. Obtain the estimates of the model parameters. Compute the confusion matrix for the test data set.
(iii) Compute the decision boundary for linear discriminant analysis, using Sepal.Length as the predictor variable. Compute the confusion matrix for the test data set.
(iv) Use k-nearest neighbors classification method with k = 3, 4, 5, again using Sepal.Length as the predictor variable. In each case, confusion matrix for the test data set.
(v) Write a very brief summary of the comparative performance of different classification procedures.
Reference -
1. James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Springer. [Chapters 3, 4 & 5].