Assignment
1. A local retailer has a database that stores 10,000 transactions of last summer. After analyzing the data, a data science team has identified the following statistics:
• {battery} appears in 6,000 transactions.
• {sunscreen} appears in 5,000 transactions.
• {sandals} appears in 4,000 transactions.
• {bowls} appears in 2,000 transactions.
• {battery, sunscreen} appears in 1,500 transactions.
• {battery, sandals} appears in 1,000 transactions.
• {battery, bowls} appears in 250 transactions.
• {battery, sunscreen, sandals} appears in 600 transactions.
Answer the following questions:
a. What are the support values of the preceding itemsets?
b. Assuming the minimum support is 0.05, which itemsets are considered frequent?
c. What are the confidence values of {battery}->{ sunscreen} and {battery, sunscreen}->{ sandals} ? Which of the two rules is more interesting?
d. List all the candidate rules that can be formed from the statistics. Which rules are considered interesting at the minimum confidence 0.25? Out of these interesting rules, which rule is considered the most useful (that is, least coincidental)?
2. Describe how logistic regression can be used as a classifier
3. In a decision tree, how does the algorithm pick the attributes for splitting?
4. A data science team is working on a classification problem in which the dataset contains many correlated variables, and most of them are continuous. The team wants the model to output the probabilities in addition to the class labels. Which classifier should the team consider using? Why?
5. Fit an appropriate ARIMA model on the following datasets included in R. Provide supporting evidence on why the fitted model was selected, and forecast the time series for 12 time periods ahead.
a. faithful: Waiting times (in minutes) between Old Faithful geyser eruptions
b. JohnsonJohnson: Quarterly earnings per J&J share
c. sunspot.month: Monthly sunspot activity from 1749 to 1997
6. Choose a topic of your interest, such as a movie, a celebrity, or any buzz word. Then collect 100 tweets related to this topic. Hand-tag them as positive, neutral, or negative. Next, split them into 80 tweets as the training set and the remaining 20 as the testing set. Run one or more classifiers over these tweets to perform sentiment analysis. What are the precision and recall of these classifiers? Which classifier performs better than the others?
Format your assignment according to the following formatting requirements:
1. The answer should be typed, double spaced, using Times New Roman font (size 12), with one-inch margins on all sides.
2. The response also include a cover page containing the title of the assignment, the student's name, the course title, and the date. The cover page is not included in the required page length.
3. Also Include a reference page. The Citations and references should follow APA format. The reference page is not included in the required page length.