Assignment 1:
1. Using heritage data (release 1) in SQL
a. Find support for all single itemsets
b. List all itemsets with 2 elements and support of at least 0.2
c. List all itemsets with 3 elements and support at least 0.2
2. In Weka
a. Load heritage data (release 1)
b. Apply at least two association rule generation algorithms and compare results
c. Apply FPTree algorithm with at least two measures of rule metrics
Assignment 2:
1. In SQL/Weka:
a. Prepare heritage data for classification learning
b. Load heritage data release 3 (preprocessed to binary representation, including demographics and output attribute(s))
c. Perform exploratory analysis
d. Create at least three classification models for predicting hospitalization based on Year 1 data.
e. Which model performs the best on year 2 data?
f. Create regression model for predicting hospitalization days.
g. What is the difference between regression and classification models?
h. Present your results in a form of short report that includes screenshots, tables, an d needed description.
Assignment 3:
Classification Part 2
1. Using heritage release 3 data prepared last assignment
a. Include drug information into data
b. Include laboratory information into data
c. Import newly created data into Weka and run classification algorithms
d. Does inclusion of the information improve predictions?
There are many ways to complete question 4, so you need to make different decisions.
Try not to overcomplicate the problem.
2. In Weka using heritage 3 dataset
a. Apply kmeans algorithm for k=2, 3, 5, 10
b. Apply EM algorithm. What is the optimal number of clusters obtained by EM?
c. Compare the created clusters to classification based on hospitalization in year 2.
Assignment 4:
3.Using the data table shown below.
a.Calculate distance between all points in 1
-norm, 2
-norm and infinity
-norm. Show dissimilarity matrix.
b. Is there any need to preprocess the data to be more suitable for clustering? If so, describe the operations and show the resulting data table.
c.Apply k
-means clustering algorithm with k=2.
Using the data table shown below.
a. Calculate distance between all points in 1-norm, 2-norm and infinity-norm. Show dissimilarity matrix.
b. Is there any need to preprocess the data to be more suitable for clustering? If so, describe the operations and show the resulting data table.
c. Apply k-means clustering algorithm with k=2.
ID
|
Age
|
BMI
|
Gender
|
Total Cholesterol
|
1
|
30
|
24
|
M
|
180
|
2
|
70
|
19
|
M
|
190
|
3
|
65
|
26
|
M
|
220
|
4
|
40
|
32
|
F
|
260
|
Assignment 5:
-Text Mining
1. Write regular expression to:
a. detect zip codes in text
b. Find last names of all patients whose first name is John (note that regular expressions may have some false positives/false negatives).
2. List challenges in automatically retrieving ICD-9 codes from clinical notes. Search literature for to find relevant published work. Also, include own observations and comments.
3. Using the SMS data
a. Split data into training (80%) and testing (20%) sets
b. Build naïve Bayes classifier for detecting spam based on bag of words
i. List all words in the documents
ii. Count occurrences in spam and ham
iii. Assign likelihoods P(word|spam) and P(word|ham) for all words
iv. Convert test data into list of words. For each message you need, 2 columns: message id and word
v. Classify test data. This can be done by a series of joins with the data prepared in (iii).
vi. Calculate accuracy of your model (accuracy, precision, recall)
Attachment:- Assignment 1.rar