Statistics for Geographers
Assignment : Descriptive Statistics and Statistical Distributions Using R
Introduction
This assignment is an opportunity to continue to develop your skills in reporting descriptive statistics and in using the software package R.
As in assignment #1, you will be analyzing three variables of your choosing from the class survey data for the first few questions. Please do your best to apply the principles and techniques we have learned in class.
You are welcome to use your notes or any educational resources available on the internet, but please ensure that your output and interpretation are your original work. This assignment will continue beyond our exploration of descriptive statistics into exploration of statistical distributions.
The flexibility and open-source nature of R means that there are many possible ways to generate the correct answers to each question. Be sure to document your work as best you can to maximize your opportunities to earn points. Comments in your R scripts will help you, and those evaluating your work, understand what you are trying to accomplish!
How to complete this assignment
R andRStudio software can be freely downloaded from the internet. The software packages are also available on many ASU computers. If you have any difficulty accessing the software, please contact the instructor or TA.
Recall that most installations of RStudio require R also to be downloaded onto the machine, although you will only need to interact with RStudio to complete this assignment.
The best way to ensure you receive full credit on this assignment is to work from this document as a template. To provide answers to the questions below, you can copy and paste output from the RStudio command window, or from a .csv file that you output using script . Figures can be copied and pasted, or saved, from the plots window in RStudio.
There are six questions that total 100 points, plus opportunities for bonus points.
All students will need to upload two files to Blackboard:
1. Their completed assignment worksheet (this document, with your content added)
2. A saved version of the R Script that contains all commands you used for your analysis (.R file)
1. Identify three variables you would like to analyze from the class survey data. (Recall that a clean version of the class survey data, with metadata, is available on Blackboard).
You should select one categorical variable, one ordinal variable, and one continuous variable. Try to pick variables that you think will have interesting differences between the groups of the categorical variable you select. Please do not use more than one of the three in-class variables (Taco Bell/Chipotle, Spiciness Preference, and Number of Countries) in your selection. You may use the same three variables you examined in assignment #1, so long as they meet the requirements above.
Variable Type Variable Name
Categorical Note: it will be easier to complete this assignment if you choose a categorical variable with 2-5 categories.
Ordinal Note: it will be easier to complete this assignment if you choose an ordinal variable with 2-5 categories.
Continuous
2. Download and open the "DSandProb_Lab.R" script that is available on the Blackboard site. You will be making modifications to this script to examine your variables of interest. Please pay careful attention to all of the elements of the script for each line of the code you are working on, such that the names of variables, numbers of categories, labels for tables and figures, etc., are appropriate for your data.
Open the existing R script, adjust the working directories, and load the class survey data set. Also run the lines of code that activate the ggplot2, e1071, and dplyrlibraries. (You may need to install the associated packages if they are not already on your machine).
Did you load all three libraries successfully?
When you initially read in the survey data from the .csv file, how many rows and columns did the data set have?
Create a subset of the survey data that only contains the columns relevant for your analysis. Remove extra rows from the initial read in of the data that only contain "NA" values.
After you created your subset of the survey data and remove empty rows, how many rows and columns did the data set have?
3. For each of your three variables, generate a table of summary statistics (either a frequency table or a table of descriptive statistics) and an appropriate visual representation of the data (either a bar graph or histogram). On each graph, adjust at least one of the aesthetic elements (such as the bar color or font choice) to something other than the default setting. Ensure that the plots are appropriately labeled. If generating a histogram, ensure that the binning scheme is logical.
Frequency tables can be generated using either the "dplyr" or "table" options from the code. If you are generating a table of descriptive statistics, please show the mean, median, standard deviation, skewness, and interquartile range.
Categorical variable:
Copy and paste your summary statistics table here (you can copy directly from the command window to receive full credit, or earn a bonus point by exporting your table as a .csv file and then copying and pasting in a table from Excel or other software):
Insert your graph below. Ensure that the labels on the graph are appropriate for your variable!
Ordinal variable:
Copy and paste your summary statistics table here (you can copy directly from the command window to receive full credit, or earn a bonus point by exporting your table as a .csv file and then copying and pasting in a table from Excel or other software):
Insert your graph below. Ensure that the labels on the graph are appropriate for your variable!
Continuous variable:
Copy and paste your summary statistics table here (you can copy directly from the command window to receive full credit, or earn a bonus point by exporting your table as a .csv file and then copying and pasting in a table from Excel or other software). For this variable, be sure that your descriptive statistics include (at least) the mean, median, standard deviation, and skewness.
What is the interval of values that is within one standard deviation of the mean? (e.g., "-5.4 to +3.2")
What is the interval of values that is within two standard deviations of the mean? (e.g., "-10.7 to +9.9")
Insert your graph below. Ensure that the labels on the graph are appropriate for your variable!
4. Examine how the descriptive statistics of your continuous and ordinal variables vary across the groups of your categorical variable.
First, generate a table that shows the descriptive statistics of your continuous variable within each of the groups of your categorical variable. You can copy directly from the command window to receive full credit, or earn a bonus point by exporting your table as a .csv file and then copying and pasting in a table from Excel or other software. Ensure that your table has appropriate column headers. Insert the table below.
Next, generate a split histogram that shows the distribution of the continuous variable within each of the groups of your categorical variable. Ensure that your figure has appropriate labels. Insert the split histogram below.
Next, generate a frequency table that shows the counts of responses for each of the ordinal variable categories within each of the groups of your categorical variable. You can copy directly from the command window to receive full credit, or earn a bonus point by exporting your table as a .csv file and then copying and pasting in a table from Excel or other software. Ensure that your table has appropriate column headers. Insert the table below.
Finally, generate a split bar graph that shows the frequency of responses for the ordinal variable within each of the groups of your categorical variable. Ensure that your figure has appropriate labels. Insert the split bar graph below.
5. We're now going to revisit our work with the binomial distribution. We will continue with our example where we are estimating the probability that a student who is making random guesses on a multiple choice exam answers a certain number of questions correctly. Be sure to follow the code carefully and make adjustments as needed to ensure you are calculating the probabilities correctly for the parameters you specify below. This will be especially important when creating the blank matrix for yourfor loop in the last step!
Let's set some parameters for our work.
Choose a number of choices that are available on each question (2 through 5):
Divide one by the number above. This is the probability of success on each trial (variable m in the code):
On what day of the month (1-31) does your birthday fall?
Let's use that number as the number of questions on the exam (the number of trials, variable n)
Assume that a student needs to get at least 50% of the questions right to pass the exam. How many questions do they need to get correct?
How many combinations are possible to get the number of questions correct you specified above given the total number of questions n?
(For example, there are 252 ways to get 5 questions right on a 10 question exam).
Use the binomial probability distribution formula to estimate the probability that they get EXACTLY the needed number of questions correct. What is that probability?
Next, use a for loop to estimate the probability that a randomly guessing student gets AT LEAST the needed number of questions correct. What is that probability?
6. We'll wrap up the assignment with a look at the count data you collected a few weeks ago. As you recall, many "true" count processes can be modeled by the Poisson distribution.
Read in the "GIS470_CountData_Spr2018.csv" file that is available on Blackboard. This file contains one column for each student and one row for each of the 20 observations you made. (If you did not complete the assignment, you can use another student's data, but be sure that you understand what the data are that they collected). Recall that the metadata are also available on Blackboard by following the link to "Feb 26 Class Activities Spreadsheet."
Which column/variable will you be examining?
What do the values in this column represent?
First, create a simple bar graph that shows the number of occurrences you recorded in each interval over the ten minute time period. Insert the graph below. (Note: there is not a template code for generating this bar graph at this point in the script - you may have to borrow some ideas from earlier in the code, or use your well-honed skills of searching the internet for guidance).
Next, create a histogram that has the probability of each count occurring on the vertical axis (rather than the count itself - use the example provided in the code). Insert that histogram below.
Recall that only one parameter is needed to specify the Poisson distribution. What is that parameter?
What is the value of that parameter for your variable of interest?
Generate a histogram for a pure Poisson distribution that has the parameter you specified above. The x-axis for your histogram should span the range of values that you observed in your data (in the version of the script you downloaded, the range is set from 0 to 12, but this might not be appropriate for your data). Insert that histogram below.
How well do the data you collected match what would be expected for data that meet the assumptions for a Poisson distribution? If the distributions are not particularly well matched, what are some factors that might have contributed to that discrepancy?
7. [necessary to receive credit]Be sure to complete the following actions before closing R:
• Save your R script
• Save this word document
• Upload both of those documents to Assignment #2 on Blackboard.
Please use intuitive file names.
8. This assignment will be scored out of 100 total points. How many points do you think you earned?
Attachment:- Assignment Workbook.rar