Problem
Classifying Classified Ads Submitted Online. Consider the case of a website that caters to the needs of a specific farming community, and carries classified ads intended for that community. Anyone, including robots, can post an ad via a web interface, and the site owners have problems with ads that are fraudulent, spam, or simply not relevant to the community. They have provided a file with 4143 ads, each ad in a row, and each ad labeled as either -1 (not relevant) or 1 (relevant). The goal is to develop a predictive model that can classify ads automatically.
• Open the file farm-ads.csv, and briefly review some of the relevant and non-relevant ads to get a flavor for their contents.
• Following the example in the chapter, preprocess the data in R, and create a term document matrix, and a concept matrix. Limit the number of concepts to 20.
a. Examine the term-document matrix. i. Is it sparse or dense? ii. Find two non-zero entries and briefly interpret their meaning, in words (you do not need to derive their calculation)
b. Briefly explain the difference between the term-document matrix and the concept document matrix. Relate the latter to what you learned in the principal components chapter (Chapter 4).
c. Using logistic regression, partition the data (60% training, 40% validation), and develop a model to classify the documents as ‘relevant' or ‘non-relevant.' Comment on its efficacy.
d. Why use the concept-document matrix, and not the term-document matrix, to provide the predictor variables?