Assignment
You are given the reviews dataset. These are 194439 amazon reviews for cell phones and accessories.
I. Extract the reviewText and overall fields from this file. These are the only two fields we will work with.
II. Take the first 10000 review texts. Perform only these steps as part of pre-processing: lowercasing and removing punctuation. Compute IDF of all words in these reviews. Report the top twenty words and bottom twenty words, based on IDF, with their IDF scores.
III. Take the first 10 review texts. Perform sentence detection using Spacy. Each line should have review ID (i.e., line number from the file) and the sentence itself.
IV. Take the first 10 reviews texts. Perform word tokenization, lemmatization, part-of-speech tagging. Use Spacy. Each line should have review ID (i.e., line number from the file), token (i.e. word), lemma, and POS tag.
V. Take the first 1000 review texts. Using gensim, create LDA model with 10 topics. Report the top Fifty words with probs for each of the ten topics. Each line has topic number, word, prob in that topic.
VI. Use the entire dataset. Take the first 80% dataset for train and remaining 20% for test. On the train set, obtain TFIDF features (with 50K vocabulary) and learn a multinomial Naïve Bayes model. Report the accuracy on the test set for this five-class classification problem. Accuracy should be reported as class-wise precision, recall and F1.
VII. Take the first 1000 "rating-1.0" reviews. Summarize them to 1% (in terms of words) using gensim and send across your summary. Also, take the first 1000 "rating-5.0" reviews. Summarize them to approximately Three Hundred words using gensim and send across your summary.