Question 1:
LBS is a management investment firmmanaging about $600 million in assets, primarily in stocks and mutual funds, for both institutional and individual investors. It believes that conventional approaches to money management are having an increasingly difficult time meeting or exceed-ing benchmarks. Further, it believes that the new generation of data-mining techniques can capture significant non-linear causal relationships for use in forecasting when market and security price behavior is dominated by non-linearity.
LBS wants to maximize the return on the assets it invests for its clients while minimizing their risk exposure. For LBS, it is not enough just to know which securities to purchase. In order to be successful, the asset management firm must also know when to buy and sell the securities. The firm feels that it can do this through a combination of high-quality analytic tools, highly efficient computer engi¬neering, and market-savvy analysts.
The problem of developing a system to estimate future prices is daunting because financial processes are generally character¬ized by high levels of non-linearity and complexity. The amount of data available to an analyst is overwhelming. Also, financial mar¬kets are constantly evolving so models must adapt to these changes. So,
• The system needs to be able to quickly incorporate knowledge about a domain that often defies explicit definition. On a day-to-day basis, random shocks, crowd psy¬chology, and short-lived trends influence financial markets. Also, different experts have widely varying interpretations of the data even after the fact. Even expert traders sometimes have difficulty explaining what general principle led them to make a specific trade.
• The system needs to be able to deal with and analyze complex data. As a result of the interactions among several different market forces, financial markets can exhibit highly non-linear and highly complex behavior.
• The system needs to be able to deal with the large amounts of economic and financial data that are generated daily. It is difficult or impossible even for the most skilled expert to assimilate this amount of data accurately and consistently. In the words of one experi¬enced trader, "Even the smartest of us is not as smart as the market. In order to make sense of the data, we have little choice but to turn to the computer."
• The system needs to be able to adapt quickly over time.A trading strategy that works in a bull market may not fare well in a bear market. Markets evolve and adapt to different forces over time.
The firm has determined that a meaningful horizon is about 4 weeks. It is an "active" manager that seeks to outperform the mar¬ket, as opposed to a "passive" manager that indexes its portfolio with the market and seeks only to match the market's performance.
The system needs to be able to interpret and analyze large amounts of market data and "update its view of the world" frequently and easily, accessing economic and market data from a variety of sources and, using these data, identifying those stocks that are "likely" to be winners, and those that are more "likely" to be losers, over the next 4 weeks. LBS will use simulated trading systems to test the models.
Models will be tested (or validated) by back testing over several historical years to determine how they would have performed. Models that recommend buying stocks in volumes that were not obtainable or conducting so many trades that transaction fees wiped out profits would not be considered successful.
LBS's data is plentiful, although not necessarily clean.The system does not need to make specific point predictions for prices on a specific date but only to provide the decision maker with estimates of a se¬curity's upside and downside potential. On the other hand, since a decision maker (typically a portfolio manager) would be interpreting the results of a prediction, it would be useful if the model could offer some insight into its analysis. It is also im¬portant that the system fits smoothly into LBS's workflow and current modeling tools. To do this, the system must interface smoothly with the financial databases where the market data are stored.
Since LBS wants a 4-week time horizon, the system need not function in real-time. On the other hand, the system must be able to perform the analysis on each individual security in a reasonable amount of time. The system also must be able to be expanded to accommodate additional securities and input factors.
Inaddition, LBS would like to take up as little of the firm's ex¬pert traders' time as possible. Expert time is valuable; each hour away from market analysis or trading can cost real dollars. Furthermore, and more important, LBS has found that it could be somewhat difficult for their expert traders and analysts to artic-ulate their expertise, especially since the rules are complex and continually evolving.
(a) Briefly describe the modeling problem facing LBS, and identify what type of problem it is in terms of the types of data mining problems discussed in session 1 (prediction, estimation, classification, clustering, association, etc.). Justify your answer.
(b) (What data mining model type would you propose for this problem? Justify your answer.
(c) What are two significant limitations of your proposed approach for the given problem?
Question 2: Assume that you have to build an online recommendation system for buying cars. Cars have hundreds of specifications/features. Comment on whether Naïve Bayes, K Nearest Neighbors or Decision Trees would be the best approach for this type of system. Justify your answer.
Question 3: Assume that using scanner data on customer purchases combined with demographic and behavioral data on customers stored in the corporate data warehouse, you would like to build a predictive model that would help classify customers into one of a set of distinct profitability segments (e.g., high, medium and low). Further, assume that although your company operates across the whole Southern US, you would like to focus on customers spending at least $500 per month on average for the past 12 months, at any of 5 stores in Texas. Discuss whether K-means clustering would be useful to identify the relevant customer set. Justify your answer.
Question 4: Which of the following is a symptom of a decision tree that is "over-fitted"? In each case, briefly justify your answer.
(a) The error rate (misclassification) chart for the model is as in the graphs below (for the training and validation sets):
(b) The tree is unbalanced (i.e., some paths from the root to leaf nodes are long while others are short)
(c) The confusion (classification) matricesfor both the training set and thevalidation sethavelarge valuesin the off-diagonal cells(Hint: In a confusion matrix C, cij indicates the number of cases whose actual output value ri was classified as rj by the tree)
(d) The tree has a highoverall mis-classification rate for the training set but not for the validation set.
(e) A number of the leaf nodes have very low support.
Question 5: Given the following data on purchase transactions expressed as itemsets:
1
|
Bread
|
Juice
|
Ketchup
|
|
2
|
Milk
|
Juice
|
Apples
|
|
3
|
Pepper
|
Apples
|
Juice
|
Wine
|
4
|
Juice
|
Ketchup
|
Wine
|
Salt
|
5
|
Apples
|
Detergent
|
Wine
|
|
6
|
Juice
|
Ketchup
|
Wine
|
Apples
|
7
|
Bread
|
Milk
|
Juice
|
|
8
|
Detergent
|
Wine
|
Apples
|
|
9
|
Salt
|
Wine
|
|
|
10
|
Juice
|
Ketchup
|
Milk
|
Apples
|
11
|
Bread
|
Apples
|
Wine
|
|
12
|
Milk
|
Juice
|
Detergent
|
Ketchup
|
Each row is an itemset (i.e., a collection of items that were bought together).
(a) Identify all the large itemsets with minsup = 0.25 (i.e., 25%). For each large itemset, compute its support as a percentage (%).
(b) Using the results in (a), state one association rule that has a confidence above 80% and acceptable lift. Compute its confidence, support and lift.
(c) If the APriori approach described in class were used to identify association rules for this data set, identify threeitemsetswhose support would not have to becalculatedby the rule mining process (i.e., their support would not have to be computed)? Explain why they would not be considered.
Question 6:
Consider the following dataset about customers of a particular product. The column "Buyer" indicates whether each customer bought the product or not. You have been asked to use Naïve Bayes Classification to identify potential buyers.
Name
|
Married
|
Job
|
Hair
|
Gender
|
Buyer
|
Peter
|
No
|
Manager
|
Short
|
Male
|
Yes
|
Claudia
|
Yes
|
Engineer
|
Long
|
Female
|
No
|
Angela
|
No
|
Lawyer
|
Long
|
Female
|
No
|
Amy
|
No
|
Manager
|
Long
|
Female
|
Yes
|
Albert
|
Yes
|
Engineer
|
Short
|
Male
|
Yes
|
Karin
|
No
|
Manager
|
Long
|
Female
|
No
|
Nina
|
Yes
|
Engineer
|
Short
|
Female
|
Yes
|
Sergio
|
Yes
|
Manager
|
Long
|
Male
|
Yes
|
Would the following person be a buyer or not (show your calculations)?
John
|
Yes
|
Engineer
|
Short
|
Male
|
?
|
Question 7: Assume that you have joined a company that sells disk drives for PCs. It has decided to enter the market for mobile phones starting next year. The CEO has heard that neural nets are powerful tools for building classification and prediction models, and has asked you to build a Neural Network model for classifying mobile phone products proposed by your R&D department into one of the following three market potential categories: Low, Medium, High. You have been given access to detailed data on the company's products and sales for ten of the last eleven years (current year sales have still to be compiled). How would you respond?
Question 8: Your boss has suggested that rather than using a single type of classification model, it might be useful to combine the strengths of different model types. So she has suggested that you initially build a set of neural network models to figure out the key determinants of buying behavior in each segment, and then use these significant variables to build a decision tree model which would provide the key threshold of each variable that influence the important outcomes in future buying behavior. How would you respond?