Question 1
Which of the following statement(s) is(are) correct?
a. In multiple linear regression, dropping predictors that are uncorrelated with the dependent variable may decrease the variance of predictions.
b. In multiple linear regression, using predictors that are actually uncorrelated with the dependent variable may decrease the variance of predictions.
c. Both a. and b.
d. Neither a. nor b.
Question 2
The prediction error for record i is defined as the difference between its actual yi value and its predicted ^yi value: ei = yi - ^yi, ∑i=1nei2
please select one of the appropriate acronyms or the correct answer in the following:
a. MAPE
b. RMSE
c. MAE or MAD
d. Average Error
e. Total SSE
Question 3
Given a transaction database for mining association rule as follows:
TID
|
Items
|
100
|
A C D
|
200
|
B C E
|
300
|
A B C E
|
The number of generated item-sets from the set of items {A, B, C, D, E} can be used to formulate association rules is
a. 26
b. 5
c. 6
d. 12
Question 4
Which of the following statement(s) is(are) correct?
a. Each node in a classification tree is corresponding to a dimension (column) of a data table.
b. Each node with its associated value in a classification tree is used to partition the data set along its corresponding dimension.
c. Both a. and b.
d. Neither a. nor b
Question 5
A dataset has 1000 records and one variable with 5% of the values missing, spread randomly throughout the records in the variable column. An analysis decides to remove records that have missing values. About how many records would you expect would be removed?
Question 6
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome).
For the given sample, X = (Outlook = ‘Sunny', Temperature = ‘Mild' , Humidity = ‘High' , Windy = ‘False'), using naïve Bayes classification method to classify the sample indicates to play.
True
False
Question 7
The difference between the statistical regression models and the neural network model is(are)
a. The neural network model uses hidden layers.
b. The regression models have no input layer.
c. The regression models have no output layer
d. All of a., b., and c.
Question 8
A dataset has 1000 records and 50 variables with 5% of the values missing, spread randomly throughout the records and variables. An analysis decides to remove records that have missing values. About how many records would you expect would be removed?
Question 9
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome).
For the given sample, X = (Outlook = ‘Sunny', Temperature = ‘Mild' , Humidity = ‘High' , Windy = ‘False')
Please compute the conditional probability P(X|PLAY='Yes')
* P(PLAY='Yes') . (* is the multiplication)
Please give keep 3 digits after decimal, for example. 0.521.
Question 10
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Humidity = ‘High'| PLAY='No').
Question 11
Which of the following statement(s) is(are) correct in association rule mining or affinity analysis?
a. If an itemset is frequent, then its subset is also frequent.
b. If an itemset is frequent, then its super set is also frequent.
c. If an itemset is infrequent, then its super set is also frequent.
d. If an itemset is frequent, then its subset is also infrequent.
Question 12
Which of the following statement is correct in association rule mining or affinity analysis?
a. A strong rule with high support does not necessarily lead to its high confidence.
b. A strong rule with low support leads to its low confidence.
c. A strong rule with low support leads to its high confidence.
d. A strong rule with high support always leads to its high confidence.
Question 13
Which of the following statement(s) is(are) correct?
a. Each node with its associated value in a classification tree defines a linear function.
b. A classification tree consists of many linear functions.
c. Both a. and b.
d. Neither a. nor b.
Question 14
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Outlook='Sunny'|PLAY='Yes'). Please keep 3 digits after the decimal points, for example, 0.123.
Question 15
Alternatives to maximize accuracy of a classifier or a data mining model is(are):
a. Maximizing sensitivity subject to some minimum level of specificity.
b. Minimizing false positive s subject to some maximum level of false negatives.
c. Neither a. nor b.
d. Both a. and b.
Question 16
The following questions are related to similarity measurement, please match each expression with the correct corresponding term.
a. Mahalanobis distance
b. Correlation-based similarity
c. Maximum coordinate distance
d. Manhattan distance
Question 17
Which of the following statement(s) is(are) correct?
a. The sensitivity of a classifier measures the true positive rate.
b. The specificity of a classifier measures the false positive rate.
c. Both a. and b.
d. Neither a. nor b.
Question 18
Which of the following(s) is(are) used to measure the impurity of data in the process of constructing CART (classification and regression tree)?
a. Gini Index
b. Entropy
c. Both a. and b.
d. Neither a. nor b.
Question 19
Given a transaction database for mining association rule as follows:
TID
|
Items
|
100
|
A C D
|
200
|
B C E
|
300
|
A B C E
|
which one of the following statement is correct?
a. There are 5 item-set that can be generated from the set of items {A, B, C, D, E}.
b. There are 5 non-empty item-set that can be generated from the set of items {A, B, C, D, E}.
c. There are 31 non-empty item-set that can be generated from the set of items {A, B, C, D, E}.
d. There are 32 non-empty item-set that can be generated from the set of items {A, B, C, D, E}.
Question 20
The prediction error for record i is defined as the difference between its actual yi value and its predicted ^yi value: ei = yi - ^yi, for the given expression, 100% x 1/n x ∑i=1n|ei/yi|, please select one of the appropriate acronyms or the correct answer in the following:
Total SSE
RMSE
MAPE
MAE or MAD
Average Error
Question 21
Given a transaction database for mining association rule as follows:
TID
|
Items
|
100
|
A C D
|
200
|
B C E
|
300
|
A B C E
|
For given support count = 2, which one of the following item-sets is not a frequent item- set?
a. {B, C, E}
b. {A, C, D}
c. {C, E}
d. {B, E}
Question 22
Given a customer buying electronics database, and 10,000 transactions are analyzed and the data show:
6,000 of customer transactions included computer games (game) 7,500 of them included videos (video),
4,000 of them included both computer games and videos
The generated association rule is; video-> game [support = ?%, confidence = ?%]
What is the confidence of the rule? (Please keep 2 digits after the decimal point, for example, 025).
Question 23
Given a customer buying electronics database, and 10,000 transactions are analyzed and the data show:
6,000 of customer transactions included computer games (game) 7,500 of them included videos (video),
4,000 of them included both computer games and videos
The generated association rule is: game -> video [support = ?%, confidence = ?%] What is the support of the rule?
Question 24
In order to achieve a given degree of reliability with a given data set and a given data mining model, what is the good rules of thumb for the ratio between the number of input variables (predictor variables) and the number of records?
a. to have 10 records for each input variable (predictor variable).
b. to have at least 6 × m × p records, m is the number of outcome classes, and p is the number of input variables.
c. Either a. or b.
d. Neither a. nor b.
Question 25
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Humidity = ‘High'| PLAY='Yes'). Please keep 3 digits after the decimal point, for example, 0.123.
Question 26
Given a customer buying electronics database, and 10,000 transactions are analyzed and the data show:
6,000 of customer transactions included computer games (game) 7,500 of them included videos (video),
4,000 of them included both computer games and videos
The generated association rule is: game -> video [support = ?%, confidence = ?%] What is the lift of the rule? (please keep 2 digits after the decimal, for example, 0.25)
Question 27
Given a customer buying electronics database, and 10,000 transactions are analyzed and the data show:
6,000 of customer transactions included computer games (game) 7,500 of them included videos (video),
4,000 of them included both computer games and videos
The generated association rule is: game -> video [support = ?%, confidence = ?%] Which one of the following statements is correct?
a. It shows the video and game are not positively associated or correlated.
b. It shows the video and game are not negatively associated or correlated.
c. The customers who buy computer video also buy game.
d. It shows the video and game are independent to each other.
Question 28
A good classifier will give a high lift chart
a. when the classifier acts on a lot of cases.
b. when the classifier acts on only a few cases.
c. Both a. and b.
d. Neither a. nor b.
Question 29
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Outlook='Sunny'|PLAY='No'). Please keep 3 digits after the decimal points, for example, 0.123.
Question 30
Which of the following statement(s) is(are) correct?
a. Multiple linear regression can be used to predict the value of continuous dependent variable for new observation.
b. Logistic regression can be used to classify a new observation into one of the specific classes.
c. Both a. and b.
d. Neither a. nor b.
Question 31
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome).
For the given sample, X = (Outlook = ‘Sunny', Temperature = ‘Mild' , Humidity = ‘High' , Windy = ‘False')
Please compute the conditional probability P(X|PLAY='No'). Please give keep 3 digits after decimal, for example. 0.521.
Question 32
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the prior probability, P(PLAY='No'). Please keep 3 digits after the decimal point.
Question 33
The overfitting problem can be detected
a. When the performance on the training data set improves and validation data set deteriorates.
b. When the performance on the training data set deteriorates and validation data set improves.
c. When the performance on both the training data set and validation data set deteriorate.
d. When the performance on both the training data set and validation data set improve.
Question 34
The purpose(s) of dimension reduction is(are)
a. reducing effects of curse dimensionality.
b. eliminating the input variables/predictors that are uncorrelated to the output variables/response.
c. reducing the possibility of overfitting.
d. All of a., b., and c.
Question 35
The prediction error for record i is defined as the difference between its actual yi value and its predicted ^yi value: ei = yi - ^yi, for the given expression 1/n ∑i=1n|ei|, please select one of the appropriate acronyms or the correct answer in the following.
a. Average error
b. MAE or MAD
c. MAPE
d. RMSE
e. Total SSE
Question 36
Given a customer buying electronics database, and 10,000 transactions are analyzed and the data show:
6,000 of customer transactions included computer games (game) 7,500 of them included videos (video),
4,000 of them included both computer games and videos
The generated association rule is: game -> video [support = ?%, confidence = ?%] Which one of the following statements is correct?
a. It shows the game and video are independent to each other.
b. The customers who buy computer games also buy video.
c. It shows the game and video are not positively associated or correlated.
d. It shows the game and video are not negatively associated or correlated.
Question 37
The difference between the logistic regression model and the neural network model is(are)
a. The neural network model uses hidden layers.
b. The neural network model uses activation function.
c. Both a. and b.
d. Neither a. nor b.
Question 38
For cluster analysis, which of the following statement(s) is(are) correct?
a. K-means clustering method is used to form the cluster into hierarchy.
b. K-means clustering method is centroid based approach.
c. K-means clustering method is not a centroid based approach.
d. K-means clustering method is a hierarchical clustering method.
Question 39
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Windy = ‘False'| PLAY='Yes').
Please give keep 3 digits after decimal, for example. 0.521.
Question 40
In the Neural network model, what kind parameter(s) is(are) used to avoid getting stuck in local optimum?
a. The learning rate
b. The momentum
c. Both a. and b.
d. Neither a. nor b.
Question 41
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome).
For the given sample, X = (Outlook = ‘Sunny', Temperature = ‘Mild' , Humidity = ‘High' , Windy = ‘False')
Please compute the conditional probability P(X|PLAY='No')
* P(PLAY='No') . (* is the multiplication)
Please give keep 3 digits after decimal, for example. 0.521.
Question 42
In terms of the number of variables involved in the training process of supervised learning, which of the following statement is correct?
a. The more variables we include in the trained model, the less the risk of overfitting.
b. The more variables we include in the trained model, the greater the risk of overfitting.
c. The number of variables included in trained model has no impact to the risk of overfitting.
d. The less variables we include in the trained model, the greater the risk of overfitting.
Question 43
Given a transaction database for mining association rule as follows:
TID
|
Items
|
100
|
A C D
|
200
|
B C E
|
300
|
A B C E
|
For given support count = 2, which one of the following statement is incorrect?
a. The rule B, C -> E and the rule C,E -> B have the same support value.
b. The rule B,C -> E and the rule C-> B,E- have the same confidence value.
c. The rule B,C -> E and the rule B,E->C have the same support value.
d. The rule B, C -> E and the rule C,E -> B have the same confidence value.
Question 44
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome), please compute the prior probability P(PLAY='Yes'). Please give keep 3 digits after the decimal point, for example. 0.521.
Question 45
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome).
For the given sample, X = (Outlook = ‘Sunny', Temperature = ‘Mild' , Humidity = ‘High' , Windy = ‘False')
Please compute the conditional probability P(X|PLAY='Yes'). Please give keep 3 digits after decimal, for example. 0.521.
Question 46
The purpose of normalizing data value into unit value is
a. To reduce the impact or dominance of the data with large scale value.
b. To reduce the bias caused by the data with large scale value.
c. Both a. and b.
d. Neither a. nor b.
Question 47
Given a customer buying electronics database, and 10,000 transactions are analyzed and the data show:
6,000 of customer transactions included computer games (game) 7,500 of them included videos (video),
4,000 of them included both computer games and videos
The generated association rule is: game -> video [support = ?%, confidence = ?%]
What is the confidence of the rule? (please enter the value with the only integer part, for example, 50%, enter 50.
Question 48
A dataset has 1000 records and 2 variables with 5% of the values missing, spread randomly throughout the records and variables. An analysis decides to remove records that have missing values. About how many records would you expect would be removed?
Question 49
Which of the following statement(s) is(are) correct for multiple linear regression method in XLMiner?
a. The categorical data value must be transformed into binary data.
b. The numerical data value must be transformed into categorical value.
c. Either a. or b.
d. Neither a. nor b.
Question 50
Which of the following statement(s) is(are) correct?
a. The classification tree is used to generate class label for categorical output.
b. The regression tree is used to generate the numerical output for prediction or estimation.
c. Both a. and b.
d. Neither a. nor b.
Question 51
The prediction error for record i is defined as the difference between its actual yi value and
its predicted ^yi value: ei = yi - ^yi, for the given expression, 1/n∑i=1nei, please select one of the appropriate acronyms or the correct answer in the following:
a. MAPE
b. Total SSE
c. RMSE
d. Average error
e. MAE or MAD
Question 52
In the Neural network model, what kind of parameter(s) is(are) used to avoid overfitting?
a. The learning rate
b. The momentum
c. Both a. and b.
d. Neither a. nor b.
Question 53
The prediction error for record i is defined as the difference between its actual yi value and its predicted ^y value: ei = yi - ^yi, √(1/n∑i=1nei2)
please select one of the appropriate acronyms or the correct answer in the following:
a. TOTAL SSE
b. RMSE
c. MAE or MAD
d. MAPE
e. Average Error
Question 54
Which of the following statement(s) is(are) correct?
a. Each branch from the root to a leave node in a classification tree represents a classification rule.
b. Each branch from the root to a leave node in a classification tree is associated with a partitioned data set with a class label.
c. Both a. and b.
d. Neither a. nor b.
Question 55
Which of the following data mining methods in XLMINER is especially suited for (and limited to) both categorical predictor and outcome variable?
a. Naïve Bayes method.
b. Regression
c. K-Nearest Neighbor method.
d. Neural Network
Question 56
In terms of input variables/predictors and output variable/response, there are four combinations in the following:
continuous input variables/predictors - continuous output variable/response
continuous input variables/predictors - categorical output variable/response
categorical s input variables/predictors - categorical output variable/response
categorical s input variables/predictors - continuous output variable/response
Which of the following data mining method can be used for any one of the four combinations in XLMINER?
a. Neural network
b. Naïve Bayes method
c. Logistic regression
d. Linear regression
Question 57
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Windy = ‘False'| PLAY='No').
Question 58
Which of the following statement(s) is(are) correct?
a. The sensitivity of a classifier measures the false negative rate.
b. The specificity of a classifier measures the true negative rate.
c. Neither a. nor b.
d. Both a. and b.
Question 59
Given a database table containing weather data as follows:
Outlook |
Temperature |
Humidity |
Windy |
Class: Play |
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome).
Please compute the conditional probability P(Temperature = ‘Mild'|PLAY='Yes'). Please keep 3 digits after the decimal point, for example, 0.123.
Question 60
Given a transaction database for mining association rule as follows:
TID
|
Items
|
100
|
A C D
|
200
|
B C E
|
300
|
A B C E
|
For given support count = 2, which one of the following statement is incorrect?
a. The rule A->C and C->A have the same confidence value.
b. The item-set in rule C->A is a frequent item-set.
c. The item-set in rule A->C is a frequent item-set.
d. The rule A->C and C->A have the same support value.
Question 61
Given a transaction database for mining association rule as follows:
TID
|
Items
|
100
|
A C D
|
200
|
B C E
|
300
|
A B C E
|
For given support count = 2, which one of the following statement is incorrect?
a. The rule B->C, E and the rule E->B,C have the same confidence value.
b. The rule E ->B,C and the rule C->B,E have the same confidence value.
c. The rule B,C-> E and the rule E->B,C have the same confidence value.
d. The rule B,E ->C and the rule C->B,E have the same confidence value.
Question 62
The difference between the multiple linear regression model and the neural network model is(are)?
The neural network model uses hidden layers.
The neural network model uses activation function.
Both a. and b.
Neither a. nor b.
Question 63
Given a database table containing weather data as follows:
Outlook
|
Temperature
|
Humidity
|
Windy
|
Class: Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Sunny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Where Outlook, Temperature, Humidity, and Windy are the input variables (predictors), and Play is the output variable (response or outcome). Please compute the conditional probability P(Temperature = ‘Mild'|PLAY='No').
Question 64
Given a transaction database for mining association rule as follows:
TID
|
Items
|
100
|
A C D
|
200
|
B C E
|
300
|
A B C E
|
For given support count = 2 and item-set {B, C, E}, how many number of valid rules can be generated from the item-set {B, C, E}?
a. 2
b. 12
c. 3
d. 6
Question 65
Which of the following statement(s) is(are) correct?
a. Outliers are the values that lie far away from the bulk of the data.
b. An outlier whose value is over 3 standard deviation away from the mean.
c. An outlier is an invalid data point.
d. Both a. and b.
Question 66
One of the ways to handle missing values in preprocessing of data mining is
a. to drop the columns with missing values.
b. to replace the missing values with imputed value.
c. Both a. and b.
d. Neither a. nor b.
Question 67
For classification and regression trees (CART), which of the following ways can be used to avoid overfitting?
a. Setting rules to stop tree growth.
b. Pruning the full-grown tree back to a level where it does not overfit.
c. Both a. and b.
d. Neither a. nor b.