In the sample almost 40 of the e-mail messages were tagged


Problem

Detecting Spam E-mail (from the UCI Machine Learning Repository). A team at Hewlett-Packard collected data on a large number of e-mail messages from their postmaster and personal e-mail for the purpose of finding a classifier that can separate e-mail messages that are spam vs. non-spam (a.k.a. "ham"). The spam concept is diverse: It includes advertisements for products or websites, "make money fast" schemes, chain letters, pornography, and so on. The definition used here is "unsolicited commercial e-mail." The file Spambase.csv contains information on 4601 e-mail messages, among which 1813 are tagged "spam." The predictors include 57 attributes, most of them are the average number of times a certain word (e.g., mail, George) or symbol (e.g., #, !) appears in the e-mail. A few predictors are related to the number and length of capitalized words.

a. To reduce the number of predictors to a manageable size, examine how each predictor differs between the spam and non-spam e-mails by comparing the spam-class average and non-spam-class average. Which are the 11 predictors that appear to vary the most between spam and non-spam e-mails? From these 11, which words or signs occur more often in spam?

b. Partition the data into training and validation sets, then perform a discriminant analysis on the training data using only the 11 predictors.

c. If we are interested mainly in detecting spam messages, is this model useful? Use the confusion matrix, lift chart, and decile chart for the validation set for the evaluation.

d. In the sample, almost 40% of the e-mail messages were tagged as spam. However, suppose that the actual proportion of spam messages in these e-mail accounts is 10%. Compute the constants of the classification functions to account for this information.

e. A spam filter that is based on your model is used, so that only messages that are classified as non-spam are delivered, while messages that are classified as spam are quarantined. In this case, misclassifying a non-spam e-mail (as spam) has much heftier results. Suppose that the cost of quarantining a non-spam e-mail is 20 times that of not detecting a spam message. Compute the constants of the classification functions to account for these costs (assume that the proportion of spam is reflected correctly by the sample proportion).

Request for Solution File

Ask an Expert for Answer!!
Computer Engineering: In the sample almost 40 of the e-mail messages were tagged
Reference No:- TGS02721567

Expected delivery within 24 Hours