Assignment
1. What are the three characteristics of Big Data, and what are the main considerations in processing Big Data?
2. Explain the differences between BI and Data Science?
3. Briefly describe each of the four classifications of Big Data structure types. (i.e. Structured to Unstructured)?
4. List and briefly describe each of the phases in the Data Analytics Lifecycle?
5. In which phase would the team expect to invest most of the project time? Why? Where would the team expect to spend the least time?
6. Which R command would create a scatterplot for the dataframe "df", assuming df contains values for x and y?
7. What is a rug plot used for in a density plot?
8. What is a type I error? What is a type II error? Is one always more serious than the other? Why?
9. Why do we consider K-means clustering as a unsupervised machine learning algorithm?
10. Detail the four steps in the K-means clustering algorithm?
11. List three popular use cases of the Association Rules mining algorithms?
12. In regards to Reasons to Choose and Cautions, what are four decisions questions that most practitioners must consider?
13. Define Support and Confidence?
14. How do you use a "hold-out" dataset to evaluate the effectiveness of the rules generated?
15. List two use cases of linear regression models?
16. Compare and contrast linear and logistic regression methods.?
17. Association rules are commonly used for mining transaction in databases. What are some of the possible questions that association rules can answer?
18. In regards to Diagnostics, list the 5 approaches to improve Apriori'sefficieny?
19. What are some specifics applications of k-means? And what is a brief description of each application?
20. What are two common examples of object attributes of potential customers that can be used in analysis?
21. Apriori is one of the earliest and most fundamental algorithms for generating association rules. What is the most truthful statement about Apriori?
Apriori is the borders of the resulting clusters now that fall between two different association rules.
It uses non-frequent itemsets within association rules that is also known as market basket analysis.
It pioneered the use of support for pruning the itemsets and controlling the exponential growth of candidate itemsets.
It allows association rules to capture data that is frequently brought together by interval testing.
22. K-means is an analytical technique that, for a chosen value of k, identifies k clusters of objects based on the objects' proximity to the center of the k groups.
True
False
23. The Apriori algorithm takes a bottom-up iterative approach to uncovering the frequent itemsets by first determining all the possible items and then by identifying which among them are frequent.
True
False
24. Upon gathering output rules in validation and testing, the first approach to validate the results can be established by measures such as visualization, display itemsets, and threshold targeting.
True
False
25. Within the preceding algorithm, k clusters can be identified in a given dataset, but what value of k should be selected?
The value of k is selected by confidence intervals which provides clusters in the most accurate way.
The value of k can be chosen based on a reasonable guess or some predefined requirement.
The value of k cannot be chosen until the object attributes are provided in the k-means analysis.
None of the above.
Format your assignment according to the following formatting requirements:
1. The answer should be typed, double spaced, using Times New Roman font (size 12), with one-inch margins on all sides.
2. The response also includes a cover page containing the title of the assignment, the student's name, the course title, and the date. The cover page is not included in the required page length.
3. Also include a reference page. The Citations and references should follow APA format. The reference page is not included in the required page length.