Problem 1: Suppose that X1, X2, X3 are i.i.d. normal random variables with mean 0 and variance 1.
(a) Define a random vector x= . What is the distribution of X?
Note: characterizing a multivariate distribution means providing its parameters and name of the distribution, in this case the parameters will be the mean vector and the variance covariance matrix.)
(b) What is the distribution of X¯ = 1 .3
Xi?
. X1 .
3 i=1
(c) Let Y = . Find the distribution of Y.
(d) Suppose that Z ∼ N(1, 22) and is independent of all Xi. Define Zi = Z + Xi for i = 1, 2, 3. What is the distribution of the random vector
? Determine the correlation coefficient between Z1 and Z2 and Z.
Problem 2. (a) Generate 200 random observations (a random sample) from the 3-dimensional multivariate normal distribution having mean vector µ = and covariance matrix
Σ =
using the Choleski factorization method (Use the program given below). Use the R pairs plot to graph an array of scatter plots for each pair of variables. For each pair of variables, (visually) check that the location and correlation approximately agree with the theoretical parameters of the corresponding bivariate normal distribution. What should those parameters be for each of the scatterplots? Write them down near the corresponding plot. And figure out (with the program below) what the sample values are for them. Write down: the theoretical values of means and variance-covariances and the sample values of means and variance-covariances. Compare them. No need to turn in the program for this, since I am giving it to you.
(b) Repeat the exercise in part (a) but now for µ = and covariance matrix
Σ =
Turn in an R script for the program used for this part.
Problem 3. For each of the following bivariate normal distribution, find its principal components (showing work), sketch a typical scatter plot of data from the distribution, and label the eigenvectors of its covariance matrix that you found on the scatter plot.
Problem 4. In this problem we will perform principal component analysis on the sepal and petal measurements of the first 50 flowers of the Iris data, i.e, iris[1:50, 1:4].
(a) Center the data matrix and denote the centered data matrix by X. Find the covariance matrix Sx of X.
(b) Report the eigenvectors of Sx and the variance of X along each eigenvector direction.
(c) Calculate the principal components of the first two flowers.
(d) Make a scatterplot of the first two principal components of the data. Do you see any correlation between the two principal components?
(e) If we wish to keep at least 85% of the total variance, how many principal components do we need to keep?
Problem 5. Suppose that Z1, Z2, Z3 are the principal components of a data set and Y is a vector of the response variable. The correlation coefficients between Y and Z1, Z2, Z3 are 0.25, -0.4, 0.7, respectively.
(a) If we decide to use only two principal components in the PC regression of Y, which two principal components should we choose? Why?
(b) If || Z1 ||= 2, || Z2 ||= 1, || Z3 || 5, || Y ||= 4, find the coefficients in the PC regression in (a).