You will be using the Real Estate data set that you used for last weeks descriptive statistics assignment.

You will be using the Real Estate data set to build a model to predict what a house should sell for. This model will be used by a real estate agency to help their clients understand what their house should sell for so they can make an educated decision about listing price. Secondarily, the model will be used by a home contractor. S/he would like to be able to tell clients the selling value of adding an additional bathroom.

Part 1 of the project involves the first three steps in the data mining process: sample, explore and modify. You will be preparing the data for model building, which will be done in Part 2 of the project.

-You will need to make decisions regarding data that is in text form, missing data, potentially incorrect data, the inclusion of potential outliers, binning strategy and variable transformation. Please make sure your decisions are justified. Note that the specific requirements and relative weights are outlined in the rubric.

What is my first step?

You can't have text in the Excel file. The first step should be decided which columns you want to delete (like Chester because all of the homes are in Chester County). YOU NEED A GOOD REASON FOR DELETING DATA.
Next, you have to decide how to code the text data. Is it continuous or categorical? If it is categorical, are you coding it as binary, nominal or ordinal?

You can get creative! There are different styles for homes. There is not "order" to traditional, colonial, farmhouse, etc. so you can NOT code it 1, 2, 3. Coding is as nominal would mean adding LOTS of new columns. Which style is most prevalent - colonial, I think? You could code it as binary with 1=colonial and 0=all other styles
Keep as much data as you can but don't make yourself nuts!

What comes after coding all of the text data? Now you have to deal with missing values. Sometimes 0 was filled in when the information wasn't known. I can promise that square foot and taxes are not 0!

What should I bin? How should I do it?

Binning is taking a continuous variable (like age or acres or square foot) and turning it into a categorical - ordinal variable (1=0-15, 2=16-30, 3=31-45, etc.)
How would I do that? I would probably add a column and then use Filters to do the coding. Sorting the data would be another way. But you are adding a NEW column that will have the ordinal 1, 2, 3, etc. value.

What are we turning in for Data Mining - Part 1?

You are turning in a Word (or PDF) file that will contain you discussions and any relevant Excel output (descriptive statistics, frequencies and correlation) in labeled, professional tables. I do NOT need the raw data in the Word doc.
You are asked to post the Excel file but I will only be looking at it if there is a problem. Anything you want me to grade should be in the Word doc.
Please turn in a printed copy of the Word doc. It is easier for me to give feedback in that form.

