Assignment Specification and Deliverables:
You are given a dataset that contains a survey for the distribution of income of households. The survey collects a mix of continuous and discrete data values on source and amount of income, labour force information, and general demographic characteristics. The full data set is given in two Excel files income_data and income_test. The files contain one table providing the training data (income_data table with 32561 rows) and one table providing the data that could be used for testing (income_test table with 16281 rows).
You are asked to complete the following task:
Predict the income of individuals in the income_test table. The prediction task is to determine whether a person makes over 50K a year.
You need to use the techniques discussed in the lectures using Modeler to complete the task.
The deliverables that you should produce are:
• A report describing the approach that you have followed, the study scenarios or streams that you have attempted, the pre-processing of the data and the best predictions that you have achieved.
• The predictions of the income achieved (stored as a table), the streams that you have created (stored as .str files) and the models that you have generated (stored as .gen files). All deliverables should be included in a zip file and submitted through Blackboard.
Below is an indication of the parts that your report should include, together with an indication of the overall weighting attached to each part.
1. A Cover page gives the title of your report, your name, student number and degree programme.
2. An Introduction Section of at most two pages introduces the problem and the approach followed, introduces the analysis and outlines the contents of each section of the report (overall weighting 5%)
3. The Main Section of at most ten pages develops the approach that you have followed, the study scenarios or streams that you have attempted, the pre-processing of the data and the best predictions that you have achieved (overall weighting 65% of which 15% will be allocated to the accuracy of the predictions)
4. A Conclusion Section of at most two pages summarises the main points of the report and draws your overall conclusions. Assume that a bank had requested this survey in order to offer low interest loans to the households with income over 50K. What would be your overall recommendations to the bank? Justify your answer based on your data mining experiments and results (overall weighting 20%)
5. Use of sources, presentation and language and referencing, if needed (overall weighting 10%)
Advice on the presentation of the report:
- Your report should not be longer than 15 A4 pages, including title page and all sections (except references and bibliography)
- Text - 12 points of Times New Roman and justified within margins
- Margins – approximately 2cm
- Use italics only when you wish to emphasise a statement or citation
- Use underscores sparingly and only if they improve the readability of the text
- If you use references, use the reference style and format as in a published book or article
- Make good use of bullets and numbering to highlight the important points
- Use figures and examples to illustrate your work. Figures may be from Modeler or (if appropriate) drawn using applications such MS Power Point, Excel, and then inserted into the Word document.
Dataset Description
Below is given a description of the attributes in the dataset
Attribute
|
Type
|
Description
|
age
|
continuous
|
The age of the household income earner
|
workclass
|
Private, Self-emp-not-inc,
Self-emp-inc, Federal-gov,
Local-gov, State-gov,
Without-pay, Never-worked
|
Employment status of the household income earner
|
education
|
Bachelors, Some-college,
11th, HS-grad, Prof-school,
Assoc-acdm, Assoc-voc,
9th, 7th-8th, 12th, Masters,
1st-4th, 10th, Doctorate,
5th-6th, Preschool
|
Information about the highest level of school completed or degree received from the household income earner
|
education-num
|
continuous
|
Number of years in education of the household income earner
|
marital-status
|
Married-civ-spouse,
Divorced,
Never-married,
Separated,
Widowed,
Married-spouse-absent,
Married-AF-spouse
|
Marital status of the household income earner
|
occupation
|
Tech-support,
Craft-repair,
Other-service,
Sales,
Exec-managerial,
Prof-specialty,
Handlers-cleaners,
Machine-op-inspct,
Adm-clerical,
Farming-fishing,
Transport-moving,
Priv-house-serv,
Protective-serv,
Armed-Forces
|
The occupation of the household income earner
|
relationship
|
Wife, Own-child, Husband,
Not-in-family, Other-relative,
Unmarried
|
The relationship of the interviewee with the income earner
|
race
|
White, Asian-Pac-Islander,
Amer-Indian-Eskimo, Other,
Black
|
The race of the household income earner
|
Sex
|
Female, Male
|
The sex of the household income earner
|
capital-gain
|
continuous
|
The increase of the income from the previous year when the last survey was carried out
|
capital-loss
|
continuous
|
The decrease of the income from the previous year when the last survey was carried out
|
hours-per-week
|
continuous
|
The number of hours worked on average each week by the household income earner
|