Assignment: Multivariate Data Analysis
Part A
Refer to the data in FoodConsumptionNutrients en.xls. It has information for about 175 countries. Choose 30 or so countries that interest you to work on. Be sure that you use countries from at least three different country groups from different regions (see the sheet CountryGroupComposition to get some ideas for groupings that you might use). Collect the information on energy consumption, fat consumption and protein consumption for your chosen countries onto a single sheet. Create a variable for the country group.
1. Choose two of the three original variables. Draw a scatterplot with the country group of each point indicated. Comment.
2. Generate classification rules using
• Linear discriminant analysis
• Quadratic discriminant analysis
• Multinomial logistic regression
• Classification trees
3. Using the confusion matrix and the apparent error rate, compare the effectiveness of each of the classifi- cation rules.
4. Assume that you did not know which countries were in which groups. Use the following methods to group the observations.
• One hierarchical implementation of cluster analysis
• K-means cluster analysis
• Multidimensional scaling
Do any of these correctly divide all the observations into the original groups?
Part B appears overleaf.
Find two datasets using online sources that you can use to demonstrate the techniques that you have learned in this subject. Some good places to find interesting data are:
• https://blog.visual.ly/data-sources/
• https://blog.bigml.com/2013/02/28/data-data-data-thousands-of-public-data-sources/
• https://www.tableausoftware.com/public/community/sample-data-sets
• https://www.kaggle.com/
• https://lib.stat.cmu.edu/DASL/
• https://www.models.kvl.dk/datasets
• https://research.library.gsu.edu/c.php?g=115854&p=754836
• https://www.stat.ufl.edu/ winner/datasets.html
• https://www.statsci.org/data/
You must get approval from me for your datasets before you begin. I may not approve two students using the same dataset.
Some datasets are quite extensive and you may feel that you can illustrate a range of techniques with different subsets of the same dataset. If you think this applies to your chosen dataset talk to me about this when you are getting approval for your dataset.
If you are having trouble thinking about what you need to be able to do, think back over the broad areas that we have covered in class - inferences about mean vectors, MANOVA (one- and two-way), multivariate linear regression, PCA and factor analysis, canonical correlation, discrimination and classification including clustering. You don't need to show that you can do all of these but I would hope (read expect) to see at least 5 of these broad areas represented in your answer.
For each of your chosen datasets, you need to pose one or more questions that you believe you can (try to) address using the dataset. You then need to use appropriate techniques to analyse the data to address the research question(s) that you have posed. Finally, you will need to reflect on the adequacy of the dataset to address the questions that you have posed, and make suggestions about how you might collect the data differently to better address your question (consider what to collect or how to collect, for instance).
Your answer to this question should include (separately for each of the two datasets, if appropriate):
• A report that describes the data, poses the research question(s), analyses the research question(s) and reflects on the usefulness of the data to answer the question(s). This should be in a report format, with essential output in the report and any other output that you use in an appendix. You should also indicate where you obtained the data from (e.g. reference to a paper or URL).
• A .R file containing your code.
• A .csv file containing the data set (if it is not already in your .R file)
Attachment:- FoodConsumptionNutrients en.xlsx