Describe advantages of a parametric approach to regression


Problem 1: Provide detailed answers to the following questions:

a) Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)"? What are its disadvantages?

b) A dataset has two features: a prerlictor X and a quantitative response Y. Two models are fitted to the relation: a linear regression model Y = α0 + α1X + ∈, and a quadratic regression model Y = β0 + β1X + β2X2 + β3X3 + β4X4 + ∈

Suppose that the true relationship between A and V is linear. Would we expect the training RSS for one of the models he lower than the other. Would we expect them to be the same or is there not enough information to tell? What about for the test RSS? What if the true relationship between X and Y is not linear, but we don't know how far it is from being linear? Does the number of observations matter?

c) Suppose that some statistical learning method is used to make a prediction for the response Y for a particular value of the predictor X. Carefully describe how to estimate the standard deviation of the prediction (use mathematical formalism).

Problem 2: True or False? (Only a short justification is required; 1 or 2 sentences at most)

a) The k-means algorithm for clustering is guaranteed to converge to a local optimum.

b) Increasing the depth of a decision tree cannot increase its training error.

c) With infinite data and infinitely fast computers, kNN is the only algorithm needed for classification tasks.

d) For datasets with high label noise (i.e. many training instances have incorrect labels), boosted decision trees would generally perform better than random forests.

e) Support vector machines provide calibrated posterior classification probabilities P(y = 1|x) and P(y = -1|k).

f) In logistic regression, we model the odds ratio p/1-p s a linear function.

g) A 7NN classifier has higher variance than a 12NN classifier.

h) Using cross validation to select hyper parameters will guarantee that the model does not over fit.

i) Hierarchical clustering methods require a predefined number of clusters.

j) A random forest is an ensemble learning method that attempts to lower the bias error of decision trees.

k) The number of parameters in a parametric model is fixed, while the number of parameters in a nonparametric model grows with the amount of training data.

l) The largest Eigen value of the covariance matrix is associated with the direction of maximum variance in the data.

Problem 3: Multiple Choice Questions: select ALL answers that apply (no justification required).

a) Which of the following are true of binary classification/regression trees?

i. Bagging decision trees is likely to increase the model variance.

ii. The deeper the decision tree is, the more likely it is to overfit.

iii. They are robust to small changes in the data.

iv. Random forests are less likely to overfit than decision trees.

b) A regression tree has substantially lower validation IvISE than expected. Which of the following is likely to improve validation MSE in most real-world applications?

i. Adding quadratic features (i.e. XiXj, i, j = 1, ... , p) to the predictor space.

ii. Selecting a random subset of the features and using those in the regression tree.

iii. Pruning the tree, using cross-validation to decide how to prune.

iv. Normalizing each feature to have variance 1.

c) A decision tree is getting abnormally bad performance on both the training and test sets. What could be causing the problem?

i. The decision tree is too shallow.

ii. The number of features must be decreased.

iii. The model suffers from overfitting.

iv. None of the above.

d) A dataset has 3 pts: A = (0, 2), B = (0, 1), C = (1, 0). The 2-means clustering algorithm is initialized with centers at A and B. Where will the centers converge?

i. A and C.

ii. A and the midpoint of the segment BC.

iii. C and the midpoint of the segment AB.

iv. B and the midpoint of the segment AC.

e) Consider T1, a decision stump (i.e a tree with with one layer below the root) and T2, a decision tree that is grown till a maximum depth of 4 (at most 3 layers below the root). Which of the following is/are correct?

i. Bias(T1) < Bias(T2).

ii. Bias(T1) > Bias(T2).

iii. Variance(T1) < Variance(T2).

iv. Variance(T1) > Variance(T2).

f) Which of the following are true about subset selection?

i. Subset selection is not necessary in general.

ii. Ridge regression frequently eliminates some of the features.

iii. Subset selection can reduce overfitting.

iv. The number of models to train in best subset selection increases exponentially with the number of features.

g) How does the bias-variance decomposition of a ridge regression estimator compare with that of ordinary least squares regression?

i. Ridge regression has larger bias, larger variance

ii. Ridge regression has larger bias, smaller variance

iii. Ridge regression has smaller bias, larger variance

iv. Ridge regression has smaller bias, smaller variance

h) Both PCA and Lasso can be used for feature selection and/or dimension reduction. Which of the following statements are true?

i. Lasso selects a subset (potentially the full set) of the original features

ii. PCA produces features that are linear combinations of the original features

iii. PCA and Lasso both allow you to specify how many features are chosen

iv. PCA and Lasso are the same if you use a decision tree

i) Why would we use a random forest instead of a decision tree?

i. To reduce the training error.

ii. To reduce the variance of the model.

iii. To reduce the bias of the model.

iv. To obtain a model that is easier for a human to interpret.

j) The optimal Bayes decision rule with the indicator function:

i. is the best that a classifier can achieve, on average.

ii. can be computed exactly from a large sample

iii. selects the class with the greatest posterior probability

iv. produces the smallest error rate among all classifiers. 4. [3 marks] Provide a short answer to the following questions (about a paragraph each).

a) When is ridge regression preferable to LASSO regression?

b) What is the naive assumption in the naive Bayes classifier?

c) A classifier is trained on a cancer dataset, and achieves 96% accuracy on new observations. Why might this not be considered a good classifier? How could it be improved?

d) A regression model has low bias and high variance. How can it be improved?

e) How is kNN different from k-means clustering?

f) List 6 feature selection/dimension reduction methods.

k) Consider a probability-based binary classifier. Which of the following statement(s) is/are always true about the ROC curve, and the area under the ROC curve (AUC):

i. An AUC of 0.5 represents a classifier that performs worse than a random clas-sifier, on average.

ii. The ROC curve is generated by varying the discriminative threshold of the classifier.

iii. The ROC curve can be used to visualize the tradeoff between true positive and false positive classifications.

iv. The ROC curve increases monotonically.

1) Which of the following algorithms can learn nonlinear decision boundaries?

i. Quadratic discriminant analysis.

ii. Support vector machine with a Gaussian kernel.

iii. Logistic regression.

iv. Decision stump (a tree with at most 1 layer below the root).

Part C: refer to the printout of Workflow: Predicting Algae Blooms, pp. 8-11.

Problem 4: For this question, the focus is on your ability to interpret the various outputs of a machine learning workflow.

a) Prediction Models

i. Explain briefly why the linear model is not a great fit for a2.

ii. What variables are retained in the final learn model for a2?

iii. Give the decision rules (in the format IF ... THEN ... ) provided by the pruned regression tree for a2.

iv. What is the relative importance of the variables in that pruned tree?

b) Model Evaluation

i. Why do we use cross-validation in this problem?

ii. Briefly describe replicated k-fold cross-validation.

iii. NMSE is used to evaluate the various model performances. What is the range of good NMSE values? What is the range of bad NMSE values?

iv. According to the Bonferonni-Dunn CD diagram, what are the 5 best predictive algorithms for this task and dataset?

c) Model Prediction

i. For which target variable(s) does the model provide the best predictions? Justify your answer.

ii. For which target variable(s) does the model provide the worst predictions? Justify your answer.

iii. Why are there vertical lines of predictions in some of the scatter plots?

iv. For the variable(s) identified in part i., does the model make good predictions according to the problem description?

Problem 5: How could one attempt to improve on the results of the workflow? Provide 6 suggestions using course concepts, with justification.

Download:- Machine learning assessment.rar

Request for Solution File

Ask an Expert for Answer!!
Mathematics: Describe advantages of a parametric approach to regression
Reference No:- TGS03051861

Expected delivery within 24 Hours