1. Objective: Named Entity Recognition
In this project, you will use scikit-learn and python 3 to engineer an effective clas- sifier for an important Information Extraction task called Named Entity Recognition.
2. Getting Start
Named Entity Recogonition. The goal of Named Entity Recognition (NER) is to locate segments of text from input document and classify them in one of the pre-defined categories (e.g., PERSON, LOCATION, and ORGNIZATION). In this project, you only need to perform NER for a single category of TITLE. We define a TITLE as an appellation associated with a person by virtue of occupation, office, birth, or as an honorific. For example, in the following sentence, both Prime Minister and MP are TITLEs.
Prime Minister Malcolm Turnbull MP visited UNSW yesterday.
Formulating NER as a Classiftcation Problem. We can cast the TITLE NER problem as a binary classification problem as follows: for each token w in the input document, we construct a feature vector ?x, and classify it into two classes: TITLE or O (meaning other ), using the logistic regression classifier.
The goal of the project is to achieve the best classfication accuracy by configuring your classifier and performing feature engineering.
In this project, you must use the LogisticRegression classifier in scikit-learn ver 0.17.1 in the implementation.
3. Your Tasks
Training and Testing Dataset. We will publish a training dataset on the project page to help you build the classifier. The training dataset contains a set of sentences, where each sentence is already tokenized into tokens. We also provide the part-of-speech tags (e.g., , and the correct named entity class (i.e., "TITLE" or "O") for each token1. The data structure in python for the above sentence is:
[[('Prime', 'NNP', 'TITLE'), ('Minister', 'NNP', 'TITLE'),
('Malcolm', 'NNP', 'O'), ('Turnbull', 'NNP', 'O'), ('MP', 'NNP', 'TITLE'),
('visited', 'VBD', 'O'), ('UNSW', 'NNP', 'O'),
('yesterday', 'NN', 'O'), ('.', '.', 'O')]]
Therefore, the training dataset is formed as a list of sentences. Each sentence is formed as a list of 3-tuple, which contains to the token itself, its POS tag, and the named entity class.
In python, you could just use the following code to get the training dataset (where training data is the path of the provided training dataset file).
with open(training_data, 'rb') as f: training_set = pickle.load(f)
The final test dataset (which will not be provided to you) will be similarly formed as the training dataset, except each sentence will be formed as a list of 2-tuple (only the token itself and its PoS tag). For example, if the testing dataset only contains one sentence which is "Prime Minister Malcolm Turnbull MP visited UNSW yesterday.", then it will be format as:
[[('Prime', 'NNP'), ('Minister', 'NNP'), ('Malcolm', 'NNP'),
('Turnbull', 'NNP'), ('MP', 'NNP'), ('visit', 'NN'), ('UNSW', 'NNP'), ('yesterday', 'NN'), ('.', '.')]]
Feature Engineering. In order to build the classifier, you need to firstly extract features for each token. In this project, using token itself as a feature usually achieves a high accuracy on the training dataset, however it could result in low testing accuracy on the test dataset due to overfitting. For example, it is not uncommon that the test dataset contains titles that do not exist in the training dataset. Therefore, we encourage you to find/engineer meanningful/strong features and build a more robust classifier.
You will need to describe all the features that you have used in this project, and justify your choice in your report.
Build logistic regression classifter. In this project, you need to use the logistic regression classifier (i.e., sklearn.linear model.LogisticRegression) from the scikit- learn package. For more details, you could refer to the scikit-learn page2 and the relevant course materials.
You also need to dump the trained classifier using the following code (where classifier is the trained classifier and and classifier path is the path of the output file):
with open(classifier_path, 'wb') as f: pickle.dump(classifier, f)
You are also required to submit the dumped classifier (which must be named as classifier.dat).
Suggestion Steps. You may want to implement a basic, initial NER classifier. Then you can further improve its performance in multiple ways. For example, you can find best setting of hyper-parameters of the your model/features; you can design and test different sets of features. It is recommended that you use cross validation and construct your own testing datasets.
You need to describe how you have improved the performance of your classifier in your report.
Trainer. You need to submit a python file named trainer.py. It receieves two command line arguments which specify the path to the training dataset and the path to a file where your trained classifier will be dumped to. Your program will be excuted as below:
python trainer.py
Tester. You need to submit a python file named tester.py. It recieves three com- mand line arguments which specify the path to the testing dataset, the path to the dumped classifier, and the path to a file where your output results will be dumped to. Your program will be executed as below:
python tester.py
You should, for each token in the testing dataset, output the named entity class of it (i.e., TITLE or O). Your output, in python internally, is a list sentences, where each sentence is a list of (TOKEN, CLASS) tuples. For example, a possible output for the example in Section 3.1 is:
[[('Prime', 'TITLE'), ('Minister', 'TITLE'), ('Malcolm', 'O'),
('Turnbull', 'O'), ('MP, 'O'), ('visit', 'O'), ('UNSW', 'O'),
('yesterday', 'O'), ('.', 'O')]]
Then you should dump it to a file using the following code (where result is your result list and path to results is the path of the dumped file):
with open(path_to_results, 'wb') as f: pickle.dump(result, f)
Report. You need to submit a report (named report.pdf) which answers the fol- lowing two questions:
- What feature do you use in your classifier? Why are they important and what information do you expect them to capture?
- How do you experiment and improve your classifier?
4. Execution
Your program will be tested automatically on a CSE Linux machine as follows:
- the following command will be used to build the classifier: python trainer.py where
- path to training data indicates the path to the training dataset
- path to classifier indicates the path to the dumped classifier
- the following command will be used to test the classifier:
python tester.py
where
- path to testing data indicates the path to the testing dataset
- path to classifier indicates the path to the dumped classifier
- path to result indicates the path to the dumped result
Your program will be executed using python 3 with only the following packages available:
- nltk 3.2.1
- numpy 1.11.1
- scipy 0.18.0
- scikit-learn 0.17.1
- pandas 0.18.1
5. Evaluation
We will use F1-measure to evaluate the performance of your classifier. We will test your classifier using two test datasets,
- the first one is sampled from the same domain as the given training dataset, and
- the second one is sampled from a domain different from that of the given training dataset.
Each test will contribute 40 points to your final score. Your report contributes the rest 20 points.
In order to minimize the effect of randomness, we will execute trainer.py three times, and use the best performance achieved.
For each test dataset, your best performed classifier will be compared with a reference classifier C. While the detailed scheme will be published later on the project website, basically, the marking scheme will reward you with more points if your classifier works no worse than the reference classifier.
Neither the reference classifier C nor the two test datasets will be given to you. But you can create your own test dataset, submit it and get the performance of the reference classifier C on it. For more details, please refer to Section 6.
6. Customized Datasets
You are encouraged to construct your own test datasets to test your classifier and help improve its performance. You can submit your datasets to get the F1-measure of the pre- trained classifier C on your datasets. Furthermore, all the submitted datasets is accessible by every student in the class. You should also benefit from such experience because this will give you many initial ideas about the meaningful features to use in your own classifier. You can upload your test cases to through the data submission website (url will be sent via email). You will need to login before uploading your datasets or downloading other datasets. The id is your student number (e.g., z1234567) and your password will be sent to you by email.
Once you have logged into the system, you can:
- submit your dataset
- view the performance of all the datasets
- download a chosen dataset
Your submitted test datasets should be in the same format as the training dataset.3 You can submit at most 10 datasets to the system. We will release more details about how to use the system and some tools/tips to help you visualize and tag TITLE occurrences later on the project website.