Question 1
A dataset has 1000 records and 2 variables with 5% of the values missing, spread randomly throughout the records and variables. An analysis decides to remove records that have missing values. About how many records would you expect would be removed?
Question 2
Which of the following statement(s) is(are) correct?
a. The sensitivity of a classifier measures the false negative rate.
b. The specificity of a classifier measures the true negative rate.
c. Neither a. nor b.
d. Both a. and b.
Question 3
For classification and regression trees (CART), which of the following ways can be used to avoid overfitting?
a. Setting rules to stop tree growth.
b. Pruning the full-grown tree back to a level where it does not overfit.
c. Both a. and b.
d. Neither a. nor b.
Question 4
Given a customer buying electronics database, and 10,000 transactions are analyzed and the data show:
6,000 of customer transactions included computer games (game)
7,500 of them included videos (video),
4,000 of them included both computer games and videos
The generated association rule is: video -> game [support = ?%, confidence = ?%]
Which one of the following statements is correct?
a. The customers who buy computer video also buy game.
b. It shows the video and game are not positively associated or correlated.
c. It shows the video and game are independent to each other.
d. It shows the video and game are not negatively associated or correlated.
Question 5
The following questions are related to similarity measurement, please match each expression with the correct corresponding term.
a. Mahalanobis distance
b. Maximum coordinate distance
c. Correlation-based similarity
d. Manhattan distance
Question 6
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome).
For the given sample, X = (Outlook = ‘Sunny', Temperature = ‘Mild' , Humidity = ‘High' , Windy = ‘False'), using naïve Bayes classification method to classify the sample indicates to play.
True
False
Question 7
Which of the following statement(s) is(are) correct?
a. Each branch from the root to a leave node in a classification tree represents a classification rule.
b. Each branch from the root to a leave node in a classification tree is associated with a partitioned data set with a class label.
c. Both a. and b.
d. Neither a. nor b.
Question 8
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Windy = ‘False'| PLAY='Yes').
Please give keep 3 digits after decimal, for example. 0.521.
Question 9
Given a transaction database for mining association rule as follows:
TID
|
Items
|
100
|
A C D
|
200
|
B C E
|
300
|
A B C E
|
400
|
B E
|
which one of the following statement is correct?
a. There are 32 non-empty item-set that can be generated from the set of items {A, B, C, D, E}.
b. There are 5 non-empty item-set that can be generated from the set of items {A, B, C, D, E}.
c. There are 31 non-empty item-set that can be generated from the set of items {A, B, C, D, E}.
d. There are 5 item-set that can be generated from the set of items {A, B, C, D, E}.
Question 10
Which of the following statement(s) is(are) correct?
a. In multiple linear regression, dropping predictors that are uncorrelated with the dependent variable may decrease the variance of predictions.
b. In multiple linear regression, using predictors that are actually uncorrelated with the
dependent variable may decrease the variance of predictions.
c. Both a. and b.
d. Neither a. nor b.
Question 11
Which of the following statement(s) is(are) correct?
a. When the number of neurons at hidden layer increases, the chance of the neural network overfits the training data decreases.
b. When the number of neurons at hidden layer increases, the chance of the neural network overfits the training data increases.
c. When the number of neurons at hidden layer decreases, the chance of the neural network overfits the training data increases.
d. All of a., b., and c. are correct.
Question 12
Given a transaction database for mining association rule as follows:
TID
|
Items
|
100
|
A C D
|
200
|
B C E
|
300
|
A B C E
|
400
|
B E
|
For given support count = 2, which one of the following statement is incorrect?
a. The rule A->C and C->A have the same confidence value.
b. The item-set in rule A->C is a frequent item-set.
c. The rule A->C and C->A have the same support value.
d. The item-set in rule C->A is a frequent item-set.
Question 13
Given a customer buying electronics database, and 10,000 transactions are analyzed and the data show:
6,000 of customer transactions included computer games (game)
7,500 of them included videos (video),
4,000 of them included both computer games and videos
The generated association rule is: game -> video [support = ?%, confidence = ?%]
What is the confidence of the rule? (please enter the value with the only integer part, for example, 50%, enter 50.
Question 14
Which of the following statement(s) is(are) correct?
a. When the number of neurons at hidden layer increases, the chance of the neural network underfits the training data increases.
b. When the number of neurons at hidden layer decreases, the chance of the neural network underfits the training data decreases.
c. When the number of neurons at hidden layer increases, the chance of the neural network underfits the training data decreases.
d. All of a., b., and c. are correct.
Question 15
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Outlook='Sunny'|PLAY='Yes'). Please keep 3 digits after the decimal points, for example, 0.123.
Question 16
The prediction error for record i is defined as the difference between its actual value and its predicted value: ,
please select one of the appropriate acronyms or the correct answer in the following:
a. MAPE
b. RMSE
c. MAE or MAD
d. Total SSE
e. Average Error
Question 17
Which of the following statement(s) is(are) correct?
a. Outliers are the values that lie far away from the bulk of the data.
b. An outlier whose value is over 3 standard deviation away from the mean.
c. An outlier is an invalid data point.
d. Both a. and b.
Question 18
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome).
For the given sample, X = (Outlook = ‘Sunny', Temperature = ‘Mild' , Humidity = ‘High' , Windy = ‘False')
Please compute the conditional probability P(X|PLAY='Yes').
Please give keep 3 digits after decimal, for example. 0.521.
Question 19
If the probable nature of the cluster is unknown, which cluster distance function will be good choice to cluster the data?
a. Single linkage distance
b. Complete linkage distance
c. Centroid distance
d. Average linkage distance
Question 20
The prediction error for record i is defined as the difference between its actual value and its predicted value: , for the given expression, , please select one of the appropriate acronyms or the correct answer in the following:
MAE or MAD
RMSE
Total SSE
MAPE
Average Error
Question 21
The difference between the statistical regression models and the neural network model is(are)
a. The neural network model uses hidden layers.
b. The regression models have no input layer.
c. The regression models have no output layer
d. All of a., b., and c.
Question 22
Given a customer buying electronics database, and 10,000 transactions are analyzed and the data show:
6,000 of customer transactions included computer games (game)
7,500 of them included videos (video),
4,000 of them included both computer games and videos
The generated association rule is; video-> game [support = ?%, confidence = ?%]
What is the confidence of the rule? (Please keep 2 digits after the decimal point, for example, 025).
Question 23
Which of the following statement(s) is(are) correct?
a. Each node in a classification tree is corresponding to a dimension (column) of a data table.
b. Each node with its associated value in a classification tree is used to partition the data set along its corresponding dimension.
c. Both a. and b.
d. Neither a. nor b
Question 24
In terms of input variables/predictors and output variable/response, there are four combinations in the following:
continuous input variables/predictors - continuous output variable/response
continuous input variables/predictors - categorical output variable/response
categorical s input variables/predictors - categorical output variable/response
categorical s input variables/predictors - continuous output variable/response
Which of the following data mining method can be used for any one of the four combinations in XLMINER?
a. Neural network
b. Linear regression
c. Naïve Bayes method
d. Logistic regression
Question 25
Given a customer buying electronics database, and 10,000 transactions are analyzed and the data show:
6,000 of customer transactions included computer games (game)
7,500 of them included videos (video),
4,000 of them included both computer games and videos
The generated association rule is: game -> video [support = ?%, confidence = ?%]
What is the lift of the rule? (please keep 2 digits after the decimal, for example, 0.25)
Question 26
Which of the following statement(s) is(are) correct?
a. Each node with its associated value in a classification tree defines a linear function.
b. A classification tree consists of many linear functions.
c. Both a. and b.
d. Neither a. nor b.
Question 27
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Outlook='Sunny'|PLAY='No'). Please keep 3 digits after the decimal points, for example, 0.123.
Question 28
Alternatives to maximize accuracy of a classifier or a data mining model is(are):
a. Maximizing sensitivity subject to some minimum level of specificity.
b. Minimizing false positive s subject to some maximum level of false negatives.
c. Neither a. nor b.
d. Both a. and b.
Question 29
The difference between the multiple linear regression model and the neural network model is(are)?
The neural network model uses hidden layers.
The neural network model uses activation function.
Both a. and b.
Neither a. nor b.
Question 30
The prediction error for record i is defined as the difference between its actual value and its predicted value: ,
please select one of the appropriate acronyms or the correct answer in the following:
a. TOTAL SSE
b. RMSE
c. MAE or MAD
d. MAPE
e. Average Error
Question 31
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Windy = ‘False'| PLAY='No').
Question 32
Given a customer buying electronics database, and 10,000 transactions are analyzed and the data show:
6,000 of customer transactions included computer games (game)
7,500 of them included videos (video),
4,000 of them included both computer games and videos
The generated association rule is: game -> video [support = ?%, confidence = ?%]
What is the support of the rule?
Question 33
For the given item-set {A, B, C, D}, how many number of valid rules can be generated from the item-set {A, B, C, D}?
a. 15
b. 11
c. 50
d. 4
Question 34
Which of the following statement(s) is(are) correct?
a. Multiple linear regression can be used to predict the value of continuous dependent variable for new observation.
b. Logistic regression can be used to classify a new observation into one of the specific classes.
c. Both a. and b.
d. Neither a. nor b.
Question 35
Which of the following data mining methods in XLMINER is especially suited for (and limited to) both categorical predictor and outcome variable?
a. Neural Network
b. K-Nearest Neighbor method.
c. Regression
d. Naïve Bayes method.
Question 36
For cluster analysis, which of the following statement(s) is(are) correct?
a. K-means clustering method is not a centroid based approach.
b. K-means clustering method is centroid based approach.
c. K-means clustering method is used to form the cluster into hierarchy.
d. K-means clustering method is a hierarchical clustering method.
Question 37
One of the ways to handle missing values in preprocessing of data mining is
a. to drop the columns with missing values.
b. to replace the missing values with imputed value.
c. Both a. and b.
d. Neither a. nor b.
Question 38
Which of the following statement is correct in association rule mining or affinity analysis?
a. A strong rule with low support leads to its high confidence.
b. A strong rule with high support does not necessarily lead to its high confidence.
c. A strong rule with high support always leads to its high confidence.
d. A strong rule with low support leads to its low confidence.
Question 39
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome), please compute the prior probability P(PLAY='Yes'). Please give keep 3 digits after the decimal point, for example. 0.521.
Question 40
The following questions are related to distance measurement between two clusters, please match each expression with the correct corresponding term.
where is the distance between two data points such that p belongs to cluster and p' belongs to cluster , and is the distance between cluster and cluster .
where is the distance between two data points such that p belongs to cluster and p' belongs to cluster , and is the distance between cluster and cluster .
where is the center of cluster , is the center or centroid of cluster , is the distance between and , and is the distance between cluster and cluster, where is the distance between two data points such that p belongs to cluster and p' belongs to cluster, and is the distance between cluster and cluster is the number of data points in cluster , and is the number of data points in cluster .
A. Centroid Distance
B. Single Linkage Distance
C. Average Distance
D. Complete Linkage Distance
Question 41
A dataset has 1000 records and one variable with 5% of the values missing, spread randomly throughout the records in the variable column. An analysis decides to remove records that have missing values. About how many records would you expect would be removed?
Question 42
Which of the following(s) is(are) used to measure the impurity of data in the process of constructing CART (classification and regression tree)?
a. Gini Index
b. Entropy
c. Both a. and b.
d. Neither a. nor b.
Question 43
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome).
For the given sample, X = (Outlook = ‘Sunny', Temperature = ‘Mild' , Humidity = ‘High' , Windy = ‘False')
Please compute the conditional probability P(X|PLAY='No') * P(PLAY='No') . (* is the multiplication)
Please give keep 3 digits after decimal, for example. 0.521.
Question 44
A dataset has 1000 records and 50 variables with 5% of the values missing, spread randomly throughout the records and variables. An analysis decides to remove records that have missing values. About how many records would you expect would be removed?
Question 45
Which of the following statement(s) is(are) correct in association rule mining or affinity analysis?
a. If an itemset is frequent, then its subset is also frequent.
b. If an itemset is frequent, then its super set is also frequent.
c. If an itemset is infrequent, then its super set is also frequent.
d. If an itemset is frequent, then its subset is also infrequent.
Question 46
The prediction error for record i is defined as the difference between its actual value and its predicted value: , for the given expression, , please select one of the appropriate acronyms or the correct answer in the following:
a. RMSE
b. Total SSE
c. MAE or MAD
d. Average error
e. MAPE
Question 47
Given a customer buying electronics database, and 10,000 transactions are analyzed and the data show:
6,000 of customer transactions included computer games (game)
7,500 of them included videos (video),
4,000 of them included both computer games and videos
The generated association rule is: game -> video [support = ?%, confidence = ?%]
Which one of the following statements is correct?
a. It shows the game and video are not positively associated or correlated.
b. It shows the game and video are independent to each other.
c. The customers who buy computer games also buy video.
d. It shows the game and video are not negatively associated or correlated.
Question 48
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome).
For the given sample, X = (Outlook = ‘Sunny', Temperature = ‘Mild' , Humidity = ‘High' , Windy = ‘False')
Please compute the conditional probability P(X|PLAY='Yes') * P(PLAY='Yes') . (* is the multiplication)
Please give keep 3 digits after decimal, for example. 0.521.
Question 49
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Temperature = ‘Mild'|PLAY='No').
Question 50
The prediction error for record i is defined as the difference between its actual value and its predicted value: , for the given expression , please select one of the appropriate acronyms or the correct answer in the following.
a. RMSE
b. Total SSE
c. MAE or MAD
d. Average error
e. MAPE
Question 51
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome).
For the given sample, X = (Outlook = ‘Sunny', Temperature = ‘Mild' , Humidity = ‘High' , Windy = ‘False')
Please compute the conditional probability P(X|PLAY='No').
Please give keep 3 digits after decimal, for example. 0.521.
Question 52
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the prior probability, P(PLAY='No'). Please keep 3 digits after the decimal point.
Question 53
The overfitting problem can be detected
a. When the performance on both the training data set and validation data set improve.
b. When the performance on the training data set improves and validation data set deteriorates.
c. When the performance on the training data set deteriorates and validation data set improves.
d. When the performance on both the training data set and validation data set deteriorate.
Question 54
Given a transaction database for mining association rule as follows:
TID
|
Items
|
100
|
A C D
|
200
|
B C E
|
300
|
A B C E
|
400
|
B E
|
For given support count = 2, which one of the following statement is incorrect?
a. The rule B,C -> E and the rule C-> B,E- have the same confidence value.
b. The rule B, C -> E and the rule C,E -> B have the same support value.
c. The rule B, C -> E and the rule C,E -> B have the same confidence value.
d. The rule B,C -> E and the rule B,E->C have the same support value.
Question 55
In terms of the number of variables involved in the training process of supervised learning, which of the following statement is correct?
a. The more variables we include in the trained model, the greater the risk of overfitting.
b. The number of variables included in trained model has no impact to the risk of overfitting.
c. The less variables we include in the trained model, the greater the risk of overfitting.
d. The more variables we include in the trained model, the less the risk of overfitting.
Question 56
In order to achieve a given degree of reliability with a given data set and a given data mining model, what is the good rules of thumb for the ratio between the number of input variables (predictor variables) and the number of records?
a. to have 10 records for each input variable (predictor variable).
b. to have at least 6 × m × p records, m is the number of outcome classes, and p is the number of input variables.
c. Either a. or b.
d. Neither a. nor b.
Question 57
Which of the following statement(s) is(are) correct?
a. The classification tree is used to generate class label for categorical output.
b. The regression tree is used to generate the numerical output for prediction or estimation.
c. Both a. and b.
d. Neither a. nor b.
Question 58
Which of the following statement(s) is(are) correct?
a. The sensitivity of a classifier measures the true positive rate.
b. The specificity of a classifier measures the false positive rate.
c. Both a. and b.
d. Neither a. nor b.
Question 59
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Humidity = ‘High'| PLAY='No').
Question 60
The purpose of normalizing data value into unit value is
a. To reduce the impact or dominance of the data with large scale value.
b. To reduce the bias caused by the data with large scale value.
c. Both a. and b.
d. Neither a. nor b.
Question 61
The difference between the logistic regression model and the neural network model is(are)
a. The neural network model uses hidden layers.
b. The neural network model uses activation function.
c. Both a. and b.
d. Neither a. nor b.
Question 62
The purpose(s) of dimension reduction is(are)
a. reducing effects of curse dimensionality.
b. eliminating the input variables/predictors that are uncorrelated to the output variables/response.
c. reducing the possibility of overfitting.
d. All of a., b., and c.
Question 63
The number of generated item-sets from the set of items {A, B, C, D, E, F} can be used to formulate association rules is
a. 6
b. 63
c. 57
d. 602
Question 64
Given a transaction database for mining association rule as follows:
TID
|
Items
|
100
|
A C D
|
200
|
B C E
|
300
|
A B C E
|
400
|
B E
|
For given support count = 2, which one of the following item-sets is not a frequent item-set?
a. {B, E}
b. {A, C, D}
c. {B, C, E}
d. {C, E}
Question 65
Given a transaction database for mining association rule as follows:
TID
|
Items
|
100
|
A C D
|
200
|
B C E
|
300
|
A B C E
|
400
|
B E
|
For given support count = 2, which one of the following statement is incorrect?
a. The rule B,E ->C and the rule C->B,E have the same confidence value.
b. The rule B,C-> E and the rule E->B,C have the same confidence value.
c. The rule E ->B,C and the rule C->B,E have the same confidence value.
d. The rule B->C, E and the rule E->B,C have the same confidence value.
Question 66
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome).
Please compute the conditional probability P(Temperature = ‘Mild'|PLAY='Yes'). Please keep 3 digits after the decimal point, for example, 0.123.
Question 67
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Humidity = ‘High'| PLAY='Yes'). Please keep 3 digits after the decimal point, for example, 0.123.
Question 68
In the Neural network model, what kind of parameter(s) is(are) used to avoid overfitting?
a. The learning rate
b. The momentum
c. Both a. and b.
d. Neither a. nor b.
Question 69
In the Neural network model, what kind parameter(s) is(are) used to avoid getting stuck in local optimum?
a. The learning rate
b. The momentum
c. Both a. and b.
d. Neither a. nor b.
Question 70
A good classifier will give a high lift chart
a. when the classifier acts on a lot of cases.
b. when the classifier acts on only a few cases.
c. Both a. and b.
d. Neither a. nor b.