Assignment
Review Questions
1.1
1. What are the major challenges to large corporations in terms of information management for supporting decision making?
2. What are the main limitations of conventional information systems, i.e. the DBMS technology, in terms of information queries?
3. Name three types of information process which are needed by large corporations, give a brief explanation for each.
4. As a common user of a DB system, such as Dal online or RBC online banking, which type of information process do you deal with & why?
5. For the president of Dalhousie Univ. or the dean of FCS, what type of information they are interested in getting?
6. As a store manager of War-Mart or Superstore, what type of information you need to know all the time?
1.2
1. Use examples to explain the differences between terms Data, Information and Knowledge. How does each term/concept link to business information queries according to three types of information processes?
2. Why IT industry needs to develop DM and DW considering that RDBMS/SQL are already available for storing and querying information?
3. Why and in which way DW model is more advanced than RDB model in supporting business management queries?
1.3
1. What are the simple rules for choosing solution tools for getting different types of business information?
2. What are the two general purposes of DM (0r any scientific research)?
3. How the DM technologies are categorized?
4. Can you name three major DM tasks & what is each task about?
1.4
1. What is the Empirical Cycle Model (ECM) of scientific research, describe each stage of the process?
2. How are ML & DM associated with the ECM?
3. Why it says "a discovered knowledge only has temporary value"?
4. Why a discovered knowledge needs to be corroborated by statistics?
5. What is the main difference and relationship between IR and DM?
6. What are the main differences between statistics analysis and DM?
2.1
1. Give three reasons why data need to be processed for DW and DM tasks.
2. What good properties should "quality data" have before conducting DM?
3. What are the typical tasks for DP (provide a brief description for each)?
3.1
1. Why it was said that businesses are the drivers of DW and OLAP technologies?
2. For decision makers of a large enterprise to see big pictures of the organization and its business, what two general approaches for integrating information from different databases, can you describe each?
3. What are the general DW properties which differ DW model from Relational DB model, and how they fit the main objectives of DW?
4. What is the "Muti-Dimensional Space" model for DW/OLAP technology, and how it is described by the data cube lattice (know how to draw it)?
5. What are "Cuboids" in a MDS model? How to calculate the total number of cuboids in a DW? How to estimate the complexity of a DW design?
6. How different a DW model is compared with a conventional DB model and why?
7. What are the 3 typical DW schemas (describe each)?
3.2
1. What is the "Muti-Dimensional Space" model for DW/OLAP technology, and how it is described by the data cube lattice (know how to draw it)?
2. What are "Cuboids" in a MDS model? How to calculate the total number of cuboids in a DW? How to estimate the complexity of a DW design?
3. How different a DW model is compared with a conventional DB model and why?
4. What are the 3 typical DW schemas (describe each)?
5. What are the main differences between OLAP and OLTP process operations for answering users' queries?
3.3
1. What is the difference between logic dimensional space for OLAP analysis and the physical space for 3D computer graphics?
2. How to define OLAP queries? (It may be answered based on the Starnet query model.)
3. What is the visualization metaphor for displaying OLAP result of multiple logic space (data cube), and how to best use it?
4. Use the demo example of "Wealth vs. Health" analysis
5. Youtube video: "Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four". to explain how DW/OLAP may be able to support DM applications.
4.1
1. Both DW and DM technologies are about deriving new information from the stored operational data, what are the differences between them (Describe at least two aspects)?
2. Name three major general DM tasks. What general property they shared and what are different from each other?
3. Why it is said that association pattern mining is a backbone technology for CRM recommender system?
4. What is frequent pattern analysis? How to define a rule may be derived from a frequent itemset (provide two examples)?
5. What are the usefulness measures for association rules? Define each measure.
4.2
1. How to estimate a search space given n unique items?
2. What is the Apriori knowledge/property, and why it is significant for the algorithm?
3. How is the Apriori property applied for pruning the search space? Trace the Apriori algorithm to identify the places where the Apriori knowledge is applied, and how?
4.3
1. Given an input dataset and the Apriorialgorithm, how to trace the algorithm for intermediate results?
2. How to derive strong rules from the given frequent itemsetsLand a conf_rate?
3. How to improve the efficiency of the rule generation procedure by applying the aprioriproperty?
4. What are the two general purposes of DM, use some examples of mined association patterns to explain for each purpose?
5. How can the association mining process be mapped to the empirical cycle model of scientific research?
5.1
1. Why classification mining is a supervised learning process? How about association mining?
2. What are the major phases of conducting a classification mining application?
3. Can you describe a mapping between a classification application process and the empirical cycle?
4. What is the general idea/strategy/approach of DT induction for classification mining?
5.2
1. What is the general strategy of Inductive Learning (via observing examples)?
2. What are the major technical issues of DT Induction approach for classification mining?
3. What is the heuristic function used in ID3 algorithm for evaluating search directions?
4. What is the notion of Information Gain, and how it is applied in ID3 algorithm?
5. How to convert the ID3 algorithm into an implementation code structure?
5.3
1. Briefly describe how to quantify information contained in a message.
2. How can the concept of 1 be applied to a classification method, such as ID3 algorithm?
3. Definethe concepts: Entropy, and Information Gain.
4. How to use information gain tochoose an attribute?
5. What is ID3's induction bias, explain why?
6. Describe a technical solutiontoreduce the impact of the problem of 5 for solution improvement.
5.4
1. How different a classification task is done by DT induction and by Naïve Bayesclassifier? (*Give 3 differences.)
2. What are the two assumptions for using NB classifier?
3. Why Naïve Bayes algorithm is more suitable to high dimensional data?
4. What is text classification? What is the basic idea to convert unstructured text data into structured for classification?
5. How to estimate the number of prior probabilities which need to be calculated for text classification?
6. Why Naïve Bayes algorithm is more suitable for text data mining? What is its limitation?
5.5
1. What are the main differences between topic oriented and sentiment oriented text classification?
2. Based on what principle all classification methods can be commonly divided into two general categories (name each)? Provide two example classification algorithms to illustrate each category.
3. Comment on ANN classification approach in terms of its principle and trade-offs.
6.1
1. What are the main differences between Classification and 1.Clustering DM (list 3 from different perspectives)?
2. Provide two application examples of clustering DM, 2.explain how the DM result may be used for supporting business decision making.
3. What are the general criteria for judging quality of 3.clustering DM results?
4. What data attribute types can be directly applied for 4.clustering mining? How to prepare your data with various attribute types for clustering?
5. What are the basic data structures for clustering mining?5.
6. How to calculate dissimilarity of object pairs for a dataset 6.with mixed attribute types for clustering?
6.2
1. What are basic data structures for clustering mining?
2. How to calculate dissimilarity of object pairs for a dataset with mixed attribute types for clustering?
3. What are the main categories of clustering mining methods?
4. How to trace K-means algorithm on a given dataset?
5. Explain why clustering DM is said to discovery concepts hidden in large datasets?
6.3
1. What are outliers? How to handle them in applications?
2. What are the main differences between the two partition based methods: k-means and k-medoids?
3. What are the main strength and limitation of k-medoidsalgorithm?
4. Use a small dataset to trace each algorithm.
5. What are the differences between partition based clustering and hierarchical clustering?
6. How are the mined clusters stored in a dendrogramdata structure in a hierarchical clustering result?
7. What kind of constraints may be applied to cluster analysis?
6.4
1. What is the difference of computational complexity of "K-means" and "K-medoids" algorithms? Point out the key steps of the each algorithm which determine this difference.
2. What is the principle for measuring similarity between two documents for text clustering?
3. What is the term weight and how to calculate tf-idfmeasure?
4. How is vector space is formed, how to generate doc-term matrix?
6.5
1. Why Web mining is more challenging than DM on non-Web based data (Provide three main reasons)?
2. What are the three major categories of web data types?
3. Explain briefly on a possible Web mining task for each major Web mining category.