Problem 1: This problem is an example of data preprocessing needed in a data mining process.
Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with the following results:
Age
|
23
|
23
|
27
|
27
|
39
|
41
|
47
|
49
|
50
|
%fat
|
9.5
|
26.5
|
7.8
|
17.8
|
31.4
|
25.9
|
27.4
|
27.2
|
31.2
|
Age
|
52
|
54
|
54
|
56
|
57
|
58
|
58
|
60
|
61
|
%fat
|
34.6
|
42.5
|
28.8
|
33.4
|
30.2
|
34.1
|
32.9
|
41.2
|
35.7
|
a. Draw the box-plots for age and %fat. Interpret the distribution of the data
b. Normalize the two attributes based on z-score normalization.
c. Regardless of the original ranges of the variables, normalization techniques transform the data into new ranges that allow to compare and use variables on the same scales. What are the values ranges of the following normalization methods? Explain your answer.
i. Min-max normalization
ii. Z-score normalization
iii. Normalization by decimal scaling.
d. Draw a scatter-plot based on the two variables and interpret the relationship between the two variables.
e. Calculate the correlation coefficient. Are these two attributes positively or negatively correlated? Compute the covariance matrix.
Problem 2: This problem is an example of data preprocessing needed in a data mining process.
Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50,55,72,92,204,215
Partition them into bins by each of the following method, smooth the data and interpret the results:
a. equal-depth partitioning with 3 values per bin
b. equal-width partitioning with 3 bins
Problem 3 a) Figure 1 illustrates the plots for some data with respect to two variables: balance and employment status. If you have to select one of these two variables to classify the data into two classes (circle class and plus class), which one would you select? Is there any approach/criterion that you can use to support your selection? Explain your answer.
Figure 1: Data Plots for Problem 3.a.
b) For the data in Figure 2 with three variables and two classes: which variable you would choose to classify the data? Show all the steps of your calculations and interpret your answer.
Figure 2: Data for Problem 3.b