BHV Estimation project proposal
Member 4 recently dropped, need one more member
B. What would be the important predictors for estimating the Boston housing value and how would these predictors affect the housing value of Boston? We want to figure out the relationship between these predictor variables and the response variable (home value).
C. Description of the dataset:
There are 506 observations of 14 variables in this dataset. These 14 variables contain 13 continuous attributes and 1 binary-valued attribute.
1) CRIM refers to per capita crime rate by town
2) ZN refers to the proportion of residential land zoned for lots over 25,000 square feet
3) INDUS refers to the proportion of non-retail business acres per town
4) CHAS is a qualitative variable that refers to Charles River dummy variable (= 1 if tract bounds river; 0 otherwise
5) NOX refers to the nitric oxides concentration (parts per 10 million)
6) RM refers to average number of rooms per dwelling
7) AGE refers to the proportion of owner-occupied units built prior to 1940
8) DIS refers to weighted distances to five Boston employment centres
9) RAD refers to the index of accessibility to radial highways
10) TAX refers to the full-value property-tax rate per $10,000
11) PTRATIO refers to pupil-teacher ratio by town
12) B refers to 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13) LSTAT refers to % lower status of the population
14) MEDV refers to Median value of owner-occupied homes in $1000's
D. The techniques we think would be useful are cross-validation, PCA and LDA. For the cross-validation, we need to first separate the data into train set and test set with proper percentages. Then we use set.seed() method to get the corresponding train set and test set, we can then use the train set and test set to find the misclassification test error using the KNN-Fold Cross-Validation strategy. We can then compare the test errors at each K value and find the minimal test error and the best K value for the number of folds. For the PCA, we need to draw histograms for the response variable(s) to check for their skewness and normality. If the data is normal, we need to scale the data using mean and standard deviation. If the data is skewed, we need to scale the data using median and median absolute deviation.
We can then look at the loadings for each Principal Component and find the best PC's for our predictor variables. For further investigation, we can use biplot to visualize the PC's and determine which ones work the best. We can then apply the LDA test to do the logistic discrimination analysis for the data. We then can compare the LDA and PCA to find the best estimators for our predictor variables. These are the ones we learned in class so far. There might be some more useful techniques we can apply after getting further in the course, so our techniques for this data might change in the future.
E. How many PC's should we select or use?
What should we do if there are more than one indicator variable?
How should we treat the outliers?
Do we need to use Box-Cox transformation for the data?