The city tax assessor was interested in predicting residential home sales prices in a midwestem city as a function of a various characteristics of the home and surrounding property. Data on 300 arm-length transactions were obtained for home sales during the year 2002. Each line of the data set provides information on 8 variables. The dataset is "house.txt".
The variables are:
price           Sales price of residence (dollars)
sqft              Finished area of residence (square feet)
bed               Total number of bedrooms in residence
air                      Categorical variable indicating presence or absence of air conditioning: 1 if yes; 0 otherwise
garage Number of cars the garage will hold
year                    Year property was originally constructed
quality Categorical variable indicating quality of construction: 1 indicates high quality; 0 indicates low qu lot                  
Lot size (square feet)
1. Perform Exploratory Data Analysis.
(a)	Read the problem above carefully. Understand all variables.
(b)	Refer to the information from the introduction, check/change the variable types in R. Find the five number summaries of all continuous variables and for categorical variables, count how many observations are there in each category.
(c)	Check the relation among variables graphically and numerically. Which variables seem to have a strong relation with the response variable "price"?
(d)	Based on the results from previous questions, do you think there exists multicollinearity among predictors?
2. Suppose only main effect of the predictor variables are included in the model. Perform backward selection via AIC and BIC respectively, report the chosen model respectively. (When report the chosen model, you may just report like this way: Y Xl + X2)
3. Compare the model selected by the above methods, using AIC, BIC, PRESS, and adjusted R squared. Which model do you think is better?
4. Use partial F test to decide which model is better. Write the Full model, null/alternative hypothesis, reduced model, test statistic value, p-value and conclusion.
5. For whichever model you find is better in question 4, check the Normality assumption and constant variance assumption using numeric tests. Discuss if there are any violations of the assump¬tions.
6. Is multicollinearity an issue here in the above model?
7. Follow question 5, conduct a boxcox transformation, choose a number from -1, 0, 0.5 and 2, that is closest to A found for the model. Refit the model, and re-check the normality and constant assumption. Has the violations been modified to some extent?
8. Suppose people are also interested in predicting whether a property has high quality. Fit a logistic regression model using "price", "sqft", "bed" and "year" as predictor variables. State the fitted logistic regression function.
9. What is the estimated probability that a property built in 1980, which has 4 bedrooms, total finished area is 1700 square feet and whose sales price is $200000 will be a high quality property?
Attachment:- house.rar