Problem 1: Explain why and when one would want to use k-means clustering, furthermore, give an explanation of how the algorithm works given a set of data points
Problem 2: Explain one method of picking the `k' in k-means clustering
Problem 3: What is the problem with picking such `k' centroids randomly? Can you devise a better method to pick k that could resolve this problem?
Problem 4: How can we evaluate the k-means model on data ?
Problem 5: Explain the similarities and differences between K-means and Linear Regression. When would you use linear regression instead of k-means?
Problem 6: Consider the following problem context:
We want to model the relationship between the number of students that complain about fees to the department head, with the time spent by the head to deal with such student complaints. We know that there are 4 classes in the department, `information science 101' (11 students), `programming 101' (5 students) and `statistics 202' (3 students) and 'distributed systems 402 (12 students)'. There are no common students between these classes.
- Suggest what the the outcome and input variables could be and whether the latter should be understood as categorical or numerical
- Write a mathematical expression for the regression line for this problem (see online help about how to write mathematical statements in latex)
- What would be the input variables for the above problem context?
- State how the answer from (b) and (c) would then be used to complete the model.
- Discuss briefly your strategy for validating the above model
Problem 7: Explain the difference between observed outcome, line fitting error, estimated/predicted values, and the residuals.
Problem 8: After designing a linear regression model for two variables, you discover the following residual distribution ref fig1. What does this mean?
Give an example of a plot that would correspond to this residual graph.
Classification and Validation
Problem 9: Explain the use case for logistic regression, and state at least one similarity and one difference between logistic regression and linear regression
Problem 10: Explain the function of the ROC curve for logistic regression
Your answer should mention null, alternative hypothesis, true positives and false positives and classifier thresholds.