Homework: Data Mining
I. What's noise? How can noise be reduced in a dataset?
II. Define outlier. Describe 2 different approaches to detect outliers in a dataset.
III. Give 2 examples in which aggregation is useful.
IV. What's stratified sampling? Why is it preferred?
V. Provide a brief description of what Principal Components Analysis (PCA) does. [Hint: See Appendix A and your lecture notes.] State what's the input and what the output of PCA is.
VI. What's the difference between dimensionality reduction and feature selection?
VII. What's the difference between feature selection and feature extraction?
VIII. Give two examples of data in which feature extraction would be useful.
IX. What's data discretization and when is it needed?
X. How are the Correlation and Covariance, used in data pre-processing?
Textbook: Tan, P., Steinbach, M. & Kumar, V. (2019). Introduction to data mining. 2nd Edition. Boston: Pearson Addison Wesley. ISBN 0-13-312890-3. Chapter 2.
Format your homework according to the give formatting requirements:
a. The answer must be double spaced, typed, using Times New Roman font (size 12), with one-inch margins on all sides.
b. The response also includes a cover page containing the title of the homework, the course title, the student's name, and the date. The cover page is not included in the required page length.
c. Also include a reference page. The references and Citations should follow APA format. The reference page is not included in the required page length.