In this programming assignment, I am asked to write Java codes to perform Sentiment Analysis: classifying movie reviews as positive or negative.
You need to do the following tasks:
1. You will build a Naive Bayes model with Laplace (add-1) smoothing based on given training dataset. The training dataset is in the folder of "trainingdata", which contains two subfolders "pos" and "neg".
Your classifier will use words as features. You will also explore the effects of stop-word filtering. This means removing common words like "the", "a" and "it" from your train and testing sets. A stop list has been provided in the file named as "stop.txt".
2. You need to compute the prior probabilities for each class, and the probabilities of the vocabulary for each class. Particularly, you need to print out the following probabilities in your output:
(a) Vocabulary size of this training dataset
(b) Prior probabilities for each class (positive and negative)
(c) P(bad/positive) and P(bad/negative)
(d) P(great/positive) and P(great/negative)
(e) P(performance/positive) and P(performance/negative)
(f) P(life/positive) and P(life/negative)
(g) P(supposed/positive) and P(supposed/negative)
You should print one line for each item. Make sure you label them (positive or negative), and align them well.
3. You will use the Naive Bayes model trained above to test ten testing documents (stored in the folder of "testdata"). Specifically, for each testing document, you need to print out:
(a) document label
(b) probability for positive class (log base of 10)
(c) probability for positive class (log base of 10), and
(d) assignment of class (negative or positive)
The output for four items should be printed in one line, with ten lines in total. Make sure you align them well.
Training Data - https://www.dropbox.com/s/h1lwrxfc853sn7b/Training%20Data.zip?dl=0.