1.- In a local map, the location of a toxic waste site has coordinates x0 = 0.60, y0 = 0.65, and the locations of cases of cancer disease or death are given by the coordinates x and y below (n=18):
x = c(0.25,0.84,0.39,0.59,0.76,0.40,0.26,0.41,0.47,0.75,0.40,0.34,0.72,0.61,0.31,0.90,0.90,0.82)
y = c(0.47,0.83,0.38,0.80,0.67,0.68,0.57,0.63,0.38,0.12,0.21,0.28,0.36,0.87,0.28,0.24,0.88,0.14).
Is the configuration of observed points likely to have occurred by chance or is there evidence of clustering associated with the toxic waste site?
a) Calculate the observed mean distance from the point (x0, y0) to each of the 18 points. You can call it d.obs
b) To judge whether this mean distance is small (evidence of clustering) or occurred by chance, you need to simulate the unknown distribution of distances assuming no spatial pattern exists. Generate 5,000 samples of n=18 points, and produce the 5,000 mean distances, say, d.sim. Then calculate the difference between the 5,000 simulated means versus the observed mean from (a), i.e., d.sim - d.obs. Count the number of times this difference is negative, i.e., the simulated distance is less than the observed distance. This count divided by 5,000 will be your p-value. What is your conclusion?
c) Calculate the mean and variance of the 5,000 simulated means, say m.d.sim and v.d.sim.
Check whether these 5,000 simulated means have a normal distribution (use qqplot or a histogram or a normality test). Then calculate the Z statistics to compare d.obs versus m.d.sim and get the p-value. Is this result similar to your conclusion in (b)?
2.- Plot the scatter diagram and obtain the estimate of the simple linear regression models for the datasets contained in the text file "4 regressions.txt" included in sakai-Resources. Which models are appropriate for the given data?
3.- For the data below obtain a simple linear regression of y on x, and a quadratic regression of y on x and x2.
a) How you decide which model is more appropriate?
x = c(-4, -3, -2, -1, 0, 1) ; y = c(2.48, 0.73, -0.04, -1.44, -1.32, 0.00)
b) If the y data changes by taking y[6] = -11.40, answer the same question as in (a).
c) Is there a high leverage point in this new data, or is it just an outlier?
lm.influence(lmod)$hat to get hii the element in the diagonal of H=X(X'X)-1X'
4.- Generalized Regression Equation: For some data the assumption Var(ε) =σ2I needs to be replaced by Var(ε) =σ2Σ. Then the estimator of the regression coefficients become β = (X'Σ-1X)-1X'Σ-1y. The dataset cleaning.txt in sakai-Resources has the variables crews (X), rooms (Y)and sd. Use weights w = 1/(sd)^2 to obtain the regression analysis and obtain a 95% prediction intervals for Y when x = 4 and 16.