Note: All dataset files have a heading, so be sure to indicate so in R/RStudio when loading each.
1. Suppose we have postulated the model
Y = βsin(X).
To estimate β, a random sample (x1, y1), (xn, yn) is obtained. Then, the equation y^i = β^sin(xi) denotes the fitted value of Y, when X = xi.
Derive a formula for the least-squares estimate of β, i.e., β^.
2. A campaign manager conducts a survey to gauge voter support for his candidate Lopez. She gathers data on the age of a registered voter (x) and whether this person supports Lopez (Y = 1) or somebody else (Y = 0).
An analysis yields the following logistic equation:
In. P^(x)/(1-P^(x)) = -0.324 + 0.012x.
where p(x) is the probability of a vote for Lopez.
(a) Find the estimated probability that a 21-year-old voter will vote for Lopez.
(b) Compare the odds of support for Lopez between two people that are 10 years apart in age.
(c) Interpret the coefficient on x (0.012) in the logistic equation.
(d) At what age is the estimated probability of a vote for Lopez equal to 0.5?
3. In your Own words, describe the difference in the classification methods given by logistic regression and linear discriminant analysis.
4. Is there a relationship between female illiteracy and birth rate? In particular, can the birth rate in a given country be effectively predicted using the illiteracy rate? The file ILLiteracy contains data On a sample of countries where female illiteracy is more than 5%.. The variable Illit is the percentage of women over 15 years of age that are illiterate, and the variable Births is the number of births per woman in that country.
(a) Estimate the correlation between the illiteracy and birth rates. Interpret the value. What does it say about a possible relationship between the variables.
(b) Create a scatter plot of birth rate against illiteracy rate and comment on the relationship.
(c) Give the estimated regression equation and interpret the slope coefficient and R2 statistic.
(d) Create a residuals plot and comment on the appropriateness of a linear model.
(e) Can we say that improving literacy (i.e., reducing illiteracy) will result in a lower birth rate? Justify your answer with an appropriate hypothesis test.
5. The data set Wal lee from the Minnesota Pollution Control Agency contains data on length (inches) and weight (pounds) measurements for a sample of 60 walleye caught in Minnesota lakes.
(a) Fit a linear model to the data that predicts weight based on length. Provide visual evidence that this model is not appropriate for the relationship between length and weight of fish.
(b) Applying an appropriate transformation to the data, fit a power model to the data, i.e., W = aLb. Give the estimated values of a and b. as well as a 95% confidence interval for b.
(c) Applying an appropriate transformation to the data, fit an exponential model to the data. i.e., W = aeu . Give the estimated values of a and b, as well as a 95% confidence interval for b.
(d) Which model, power or exponential, provides the best fit to the data? Justify your answer.
(e) Using the model selected in (d), give a 95% bootstrap percentile interval for the b. Compare this interval to the confidence interval found in either (b) or (c), depending on which model you selected.
6. The data set Carseat s contains sales information for child car seats at 400 different stores.
(a) Fit a multiple linear regression model to predict Sales using the following predictors:
Income - community income, level (in thousands of dollars)
c Advertising - local advertising budget for company at each location (in thousands of dollars}
o Price - price company charges for car seats at each site
o ShelveLoc - a factor with levels Bad, Good, and Medium indicating the quality of the shelving location for the car seats at each site
o Age - average age of the local population
o Urban - a factor with levels NO and Yes to indicate whether the store is in an urban or rural location
c US - a factor with levels No and Yes to indicate whether the store is in the U.S. or not
Also include an interaction between Income and Advert is i ng, and between Price and Age. Is the model significant overall in predicting sales?
(b) Provide an interpretation of each coefficient in the model you fit in (a).
(c) For which of the predictors in (a) can you reject the null hypothesis H0 : = o? Justify your
answer and explain what it means to reject H1.
(d) Comment on the results of (c). Do they make intuitive sense?
(e) On the basis of your response to (c), fit a smaller model that only uses the predictors for which there is evidence of association with the response.
(f) How well do the models in (a) and (e) fit the data?
7. The data set Boston contains the following information about 506 neighborhoods around Boston.
o c r im - per capita crime rate by neighborhood
o z n - proportion of residential land zoned for lots over 25. 000 sq.ft.
o Indus - proportion of non-retail business acres per neighborhood
o char - Charles River dummy variable (1 if neighborhood touches river; 0 otherwise)
o nox - nitrogen oxides concentration (parts per 10 million)
o rm - average number of rooms per dwelling
o age - proportion of owner-occupied units built prior to 1940
o di s - weighted mean of distances to five Boston employment centers
o rad - index of accessibility to radial highways
o t ax - full-value property-tax rate per $10. 000
o pt rat - pupil-teacher ratio by neighborhood
o black - 1000(Bk - 0.63)2, where Bk is the proportion of black residents by neighborhood
o 1 s t at - percentage of households with low socioeconomic status
o medv - median value of owner-occupied homes (in thousands of dollars)
In this question, you will develop models to predict whether a given neighborhood has a crime rate above or below the median.
(a) Create a binary variable crim01 that codes whether or not a neighborhood is above (1) or below (0) the median crime rate given by the data set.
(b) Fit a logistic regression model that predicts whether a neighborhood has a crime rate above or below the median using all other variables in Boston as predictors.
(c) In the model fit in (b), note that P. codes the individual significance of nox at the highest level ( **1 ). Given an explanation as to why this variable would be significant in predicting the probability that a neighborhood would have crime rate above the median.
(d) One by one, remove predictors from the model fit in (b) until only predictors with the highest level of individual significance ( 1***/ ) remain.
(e.) Compare the deviance of the full model fit in (b) with the final model fit in previous question. Com¬ment on the difference between the deviance for these models and the differences when compared to the null model.
(f) Split the data into a training set and a test set. Use set.seed ( 352017) .
(g) Using as predictors the variables identified as having the highest level of individual significance in (d), fit a logistic regression model to the training set and estimate the test error of the model using the test set.
(h) Using the same predictors used in (g), perform LDA on the training data to predict crim01 and estimate the test error of the model using the test set.
(i) Using the same predictors used in (0, perform QDA on the training data to predict crim01 and estimate the test error of the model using the test set.
(I) Which of these methods, logistic regression, LDA, orQDA, provides the best model for classifying neighborhoods with a high crime rate. Compare the performance of that model to that of the null model.
(k) Summarize your findings regarding predicting whether a given neighborhood has a crime rate above or below the median. What advice, based on these findings, would you give to a family moving to Boston on selecting a neighborhood to live in? Use appropriate visualizations to support your summary and advice.
Attachment:- datasets.zip