You will be leveraging the "R in Action" book by Robert Kabacoff extensively in this assignment.
Preparing the Data
1. The first thing you need to do is convert the following variables to factors: record_type, day, state, homeowner, car_value, and married_couple. Refer to the variable descriptions to understand the meaning of the variables and know how to apply the appropriate labels to the factors, where needed. If the variable description specifies labels - for instance, for married_couple 1=no and 2=yes - then include them in the factor. Again, ?factorwill be helpful here.
IMPORTANT: In order to make the factor changes stick (i.e., have the factor names you assign be permanent), you have to assign the results of the factor() function back to the column in the allstate data frame, like this: allstate$record_type <- factor().
Exploring structure and summary statistics
2. Now view the structure of your data frame with the str() function to insure that record_type, day, state, homeowner, car_value, and married_couple are factors with the correct levels. Notice that's the same view contained in the Environment tab in the upper-right quadrant of RStudio.
3. Produce a summary of the Allstate data frame. Is the car_age skewed? Is cost skewed? If they are, in what direction are they skewed (e.g., left or right). Explain how you know the answers to these questions.
4. In a two dimensional table, display the count, mean, std dev, skew, kurtosis, and standard error of these variables: group_size, car_age, age_oldest, age_youngest, duration_previous, and cost. HINT: See Listing 7.2 of Kabacoff pg. 139. This is a great example of how to use sapply. It also provides you the formulas for skew and kurtosis. If you don't remember the formula for standard error, "google" it.
If you don't understand what the e means in the numbers in your output, google "scientific notation".
5. If you setup your mystats function like Kabacoff did on pg. 139, you probably noticed that the duration_previous column contains NA values. You can pass a True value to the na.omit parameter of the mystats() function by including na.omit=T as the third parameter in the sapply function. Try it and notice how it affects the count of duration_previous.
6. As you'll discover, there are a plethora of packages and ways to generate simple descriptive statistics in R. One that Kabacoff does not cover is the ddply() function found in the plyr package. Install the plyr package with this command: install.packages("plyr"). Then load it in with the library() function and display the cost mean and standard error for each of the married_couple by homeowner groups using ddply(). Remember! Help is your friend: ?ddply. If you find help unhelpful (which fairly normal), then google r ddply.
Creating count tables
7. Create a table to display the number of records in each state. Which state is most represented in this data set?
8. Create a table to display how many shopping points and purchase points are in the data. What's the approximate ratio of purchase points to shopping points? Hopefully, you have noticed that the table() function is useful for creating count tables.
9. Now, create a two-way table showing the counts of days of week and states, BUT only include purchases (i.e., exclude shopping points).
2
10. Create a three-way of table of counts using group_size, homeowner and risk_factor using the xtabs() function.
NOTE: The xtabs() function employs R's formula notation which takes on the pattern of y x1 + x2
+ ... + xn, where y is the dependent variable, and xn are the independent variables. It is common
when using xtabs() to leave out the left-hand side of the equation if you just want to generate counts for each of the cross-tabulated groups. For example, with this data set, you might specify ~ risk_factor + day to get a two-way table. See Listing 7.11 on pg. 149 in the Kabacoff book for an example. If you want want to sum the data in each cross-tabulated group, then you can specify what variable you want to sum on the left side of the equation. For example, if you wanted a sum of the costs in the previously specified two-way table, your formula would look like this cost ~ risk_factor + day.
11. You probably noticed that the third dimension in that table is displayed kind of clunky. You can fix this by wrapping your table in the ftable() function. Again, refer to Listing 7.11 for an example. Go ahead and clean up your table with the ftable() function.
Creating Other Aggregated Tables
12. Create a table showing the average car age for each of the car_value levels. NOTE: Prior questions are dealing with counts, this is dealing with means. You'll need to use another function, try aggregate(). If you look in the examples for aggregate(), you'll notice that you can use the R formula notation to aggregate the data.
Creating plots and graphs
We'll start with bar plots. You've all seen them, but have you really thought about them. For instance, what kind of variable (i.e., categorical or continuous) do bar plots display? If you are thinking, "Hmmm, the different bars on the x-axis have to be driven by a categorical variable...", then you are absolutely correct. Well, that takes care of the x-axis, what about the y-axis? You might be tempted to think "Easy! Continuous!", but you would not be completely correct. The y-axis usually represents some aggregated value (e.g., a sum or a mean). So with that in mind, let's get going!
13. Create a bar plot displaying the number of records that are shopping points and the number of records that are purchase points. Give it an appropriate "main" title and axis labels. NOTE: You won't be able to feed the raw data frame into the barplot() function. You'll need to create a table first to create the aggregated values you want to plot. See pg. 118-120 of Kabacoff for examples.
14. Now add some color to the bar ploti. Make shopping points blue and purchase points green. HINT: You can use "blue" and "green" in your color vector.
15. Create a bar plot (with color) that displays the average for the oldest person on the policy for only the purchase points for each of the risk_factor levels. Again you'll need to create a table first - try aggregate() or ddply(). You need to give barplot() a vector, not a matrix or data frame. CHALLENGE: If you want to play with different colors that are automatically generated, try using the RColorBrewer package.
Enough of bar plots. Time for histograms! Yay! A histogram is a special kind of bar plot intended to display the distribution of a variable. Why is it special? Well, there is no categorical variable on the x-axis. The x-axis is a bunch of "buckets" that break up an otherwise continuous variable along the x-axis. These buckets hold small ranges of the continuous variable you are plotting. So what kind of variable is on the y-axis? If
you are thinking, "A continuous variable," then you are incorrect. It's not a continuous variable. It's the count of the number of values of the continuous variable that falls into each bucket. So a histogram really only involves a single variable.
16. Create a histogram of the cost variable using only the purchase points. Add a title and a label for the x-axis. Refer to pp. 125-126 of Kabacoff for help. Is cost normally distributed? (Oh, yeah! Now we are really wiping away the Statistics cob webs, huh? Normally distributed? What the heck is that?
"google" it if you need to. It will be important when we get to the regression world.) 17. Now increase the number of bins in the histogram you created to 25.
The distribution of cost appears pretty much normally distributed (except for that rascally long tail on the left), but sometimes its not easy to see the distribution with a histogram. That is when the density plot is useful. (No, McFly! You are not my density!)
18. Create a density plot for cost. Refer to Listing 6.7 in Kabacoff for help.
19. Now, if you are obsessive like I am, you are probably being driven nuts by that elongated left tail on our distribution plots. Find the values that are causing it and decide if you can remove them. If you remove them, create a new density plot. If you decide not to remove them, explain why. Either way, revisit the question of whether the cost variable is normally distributed and explain your thinking.
TIP: I suggest storing the cost column into a separate variable for this problem - like this: myCost <- [your data frame name]$cost. Then operate on myCost and use it to generate your new density plot.
Now let's move on to box plots. Kabacoff pg. 129 has a good description of box plots, if you need refreshing. In short, box plots are another perspective on the distribution of a variable, only focused on the median and quartiles. (Oh man! What's the difference between a mean and a median?! Does a quartile have any meaning in relation to the mean?!)
20. Create a simple box plot of the age_youngest variable. Add an appropriate y-axis label and main title.
21. Now create a box plot to compare the distribution of the youngest age between whether a married couple is on the policy or not. HINT: You'll need to use the R formula notation. Add the proper axis labels.
22. Create a box plot to compare the age of a car on the policy with the value of the car. Add appropriate axis labels. Based on the box plot, which level of car_value do you think represents the cars of least value?
Now we'll move on to Chapter 11 of Kabacoff and cover scatter plots (pp. 256). Scatter plots are useful for comparing the relationship (think correlation!) between two continuous variables. (That's right! Say goodbye to the categorical in this realm!)
23. Show the relationship between the oldest age and the cost of policies purchased in New Mexico with a scatter plot. Refer to Listing
11.1 for help. HINT: I recommend extracting out the records for New Mexico and purchases into a separate variable.
24. Now fit a smoothed line on the scatterplot using the lines() and lowess() functions to emphasize any relationship between the age of the oldest person on the policy and the cost of the policy. Again, see Listing 11.1 for help. NOTE: You might need to make the line a different color than the points on the plot.
26. Now use a scatterplot to compare the duration of the customers' previous insurance issuer with the cost for both New Mexico and Idaho. Use only the "purchase" data. Include labels and boxplots on the x and y axes.
HINT: Again, subset out the data you need first.
HINT #2: Use the scatterplot() function from the car library as shown on pg. 258 in the Kabacoff book. You'll need to click the "Zoom" button in the "Plots" window to get a good view of it.
HINT #2: You will also need to reset the state factor after you subset the NM and ID data out. Something like this: nm_id_purchased$state <- factor(nm_id_purchased$state).
Which state has higher policy costs?
27. Using the subset of data you created for New Mexico and Idaho, create a scatter plot matrix (like the one shown on pg. 260) of the following variables: car_age, risk_factor, age_oldest, duration_previous, and cost. Add an appropriate title to the plot.
What are those plots down the diagonal of the plot matrix? Which variable is the most normally distributed? Which is the least?
Which two variables are the most positively correlated? negatively correlated?
Challenge Questions (no not extra credit questions)
Now that you have you really "whet your awa whistle" (as my 6 year-old might say when she's older and still can't pronounce her r's), I'm going to give you two open-end questions where you need to employ your R-descriptive-statistics-plotting skills to answer. Good luck!
28. (Descriptive Table) Which state has highest average policy cost for policies purchased on Fridays? HINT: This page will be helpful in finding the maximum of the average costs.
29. (Graph or Plot) Your boss has requested to see the separate distributions of the oldest people on purchased policies from CO, ID, NM, and UT. He wants to see the oldest age distribution of each of the four states on a single plot.
Attachment:- r-help.rar