Data Mining Project Assignment
Choice One: Data Analysis
In this project, you are asked to identify a dataset suitable for data mining purposes, and perform data mining tasks, such as classification, association analysis, clustering etc, to the datasets, and report your results and observations.
The following is the step-by-step suggestion to finish the project and the report.
Step 1. Identify suitable datasets and application
In order to identify a suitable dataset to use, start with an application domain that interest you. There are many public available datasets available such as NBA performance data, climate data, intrusion detection benchmark data, manufacturing data, and public wish list data. The class notes contain a collection of websites (in the first self-learning slides) but you are encouraged to use search engine to identify your own datasets that interest you. Be prepared to spend substantial time in preprocessing and exploring the datasets that you choose.
The data set needs to have at least 20 variables and at least 3,000 observations; OR at least 10 variables and at least 30,000 observations. You need to have both categorical and quantitative interval variables.
What to turn in? The final report for this section should contain the following components:
a. An introduction of the application domain that you are interested in
b. A description of the dataset that you selected. It should include details such as the origin and size of the datasets, how the data is represented, e.g. graph, records, attribute, statistics of the values of the attributes...
c. Describe any exploratory analyze and results that you have performed on your datasets.
d. The raw and processed datasets (a brief summary and comparison).
e. A formal problem definition with what is given, what is the goal, and what are the constraints.
f. A plan for your data mining task.
Step 2. Perform data mining tasks on your dataset
In this phase, you will try the intended data mining tasks on your dataset. You can use SAS Enterprise Miner or write your own scripts/codes, or a combination to mine the data. Select an alternative method or several alternative methods to compare your method with. The alternative method(s) could be tree classification or logistic regression, trees with different max number of branches, clustering with different distance choices etc. Compare the results of different methods.
What to turn in? A report describing
a. Your method, e.g. the algorithm, the workflow and any other tasks that you performed.
b. You experimental results.
c. Comparison of the experiment results of your method and the alternative methods.
d. Possible explanations for the experiment results.
Step 3. Make the conclusion
Summarize what you did and what you have learned from this data mining tasks. Describe any future work you think is worthy while.
The final report should be no longer than 25 pages (single-sided, single space, letter size 12). So please pick the most important information to include in the report. Points will be deducted if the report is too long.
Choice One: Free Data Mining Software Evaluation
In this project, you are asked to choose some free data mining software and write to evaluate it or try to write a report to tell us how to use it. Choices of free data mining software and where to download it can be found in Topic 1 folder in D2L.
The following is the step-by-step suggestion to finish the project and the report.
Step 1. Download the software
Include (and not limit to) the following in your report:
- Where to download the software?
- Is there any requirement for the operating system? Can it be run on both Mac and Windows machine, etc.
- How large is the package?
- In general, is the downloading and installation straightforward? Anything need attention during the downloading and installation? If yes, provide step-by-step guidelines.
- The platform (the look) of the software after installation, and the general instruction of each component.
- Some other things you thought of that can show the characteristic of the software.
Step 2. Choose a data set and import it into the software
You can use any data set we used in this class for your project, including the IRIS data. Note that the IRIS data is so popular that it may already exist as one of the built-in data sets in the software.
Include (and not limit to) the following in your report:
- The requirement for the data format, or structure.
- Does the software support for mining very large database? What is the maximum data the software can handle (maximum sample size, maximum number of observations, maximum number of variables, maximum number of levels/categories for a class variable, etc)?
- Does it support for multiple formats? Or is it easy to transform other formats to the formats the software requires?
- Briefly introduce how to import a data into the software. If any format transformation is needed, please explain how to do it also.
- How to set up the modeling rules and measurement levels for variables?
Step 3. Data Exploration
Include (and not limit to) the following in your report:
- Does the software support for graphs, maps, tables, rotation, etc?
- What are the exploration tools available for interval variables?
- What are the exploration tools available for class variables?
- Illustrate some explorations based on your data (refer to homework 2, problem 2).
Step 4. Data Preparation
Include (and not limit to) the following in your report:
- How does the software identify missing, inconsistent, or incorrect data?
- How does the software fix the above problem?
- How does the software perform data conversion and transformations?
- How does the software assist with the sampling process?
- How does the software assist with selection of independent variables (before any modeling)?
Step 5. Modeling
Run at least ONE model for the data, and include the following in your report if it is appropriate for the model you illustrate:
- Does the software support for major prediction (tree, regression, neural network, nearest neighbor, etc.) and description (clustering, principle components, etc.) approaches?
- How many data mining techniques are supported?
- The detailed illustration on how to run the model you choose. How to set up the parameters, how to run the model, how to get the results, and how is the result and running time compare to Enterprise Miner?
- How is the scoring process (scoring means applying the model on a new data set) performed in the software?
If you built more than one model, also consider the following in your report:
- Does the software support model comparison? If yes, how?
Step 6. Conclusion and some additional evaluation
Overall, is the software easy to learn? Is it easy to use? How does it compare to Enterprise Miner in general? Would you recommend it to new data miners? ....
The final report should be no longer than 25 pages (single-sided, single space, letter size 12).
About the Presentation
If you decide to present on SAS day, it's a poster presentation. Please contact Dr. Priestley as soon as possible to reserve a slot and to get some suggestion about the format of the poster.
Other teams, please also prepare the slides for an in-class presentation about what you did.
We will discuss in class how long you have for the presentation. Every person should present part of the project. You will get a separate grading for the presentation. Please refer to the syllabus for the grading rubrics.
Some suggestions for the slides
1. Contain clear outline/agenda/schedule
2. Avoid using more than six lines of text and minimize the number of words on each visual aid
3. Simple is better, avoid a lot of unnecessary formatting. We are interested in your technical content, not your PowerPoint skills.
4. Put your company or university logo on your title slide only; this is a technical presentation to your peers, not a marketing pitch to a customer
5. Use spell check.
6. Avoid flashy Christmas light multiple colors and other distracting means
Some general suggestions for presentations:
• A fast presentation is one slide per minute. A more relaxed pace would be two minutes per slide
• Practice the presentation. There are grading rubrics in the syllabus, which gives the expectation of an outstanding presentation.
• Time your practice sessions to ensure you keep within your allotted time. Remember a team has a time limit and points will be deducted if the presentation is too long.
• Never read the slides verbatim.