Assignment: EDA and Bootstrapping
Objective
Statistics play a vital role in data science for (at least) two reasons. First, it can be used to gain a deep understanding of data. This process is known as Exploratory Data Analysis (EDA).
Second, it can be used to infer the relationship between a sample and the population. This process is known as inference. In this assignment, you will learn about EDA and statistical inference through the analysis of a very interesting dataset - property tax report data. Specifically, you will learn the followings:
1. Be able to perform EDA on a single column (i.e., univariate analysis)
2. Be able to perform EDA on multiple columns (i.e., multivariate analysis)
3. Be able to extract insights from visualizations
4. Be able to ask critical questions about data
5. Be able to estimate a population parameter based on a sample
6. Be able to use the bootstrap to quantify the uncertainty of an estimated value
In this assignment, you can use pandas or PySpark to manipulate data, and use matplotlib or seaborn to make plots.
Part 1. EDA
Imagine you are a data scientist working at a real-estate company. In this week, your job is to analyze the Vancouver's housing price. You first download a dataset from property_tax_report_2018.zip (property_tax_report_2018.zip). The dataset contains information on properties from BC Assessment (BCA) and City sources in 2018. You can find the schema information of the dataset from attached file. But this is not enough. You still know little about the data. That's why you need to do EDA in order to get a better and deeper understanding of the data.
We first load the data as a DataFrame. To make this analysis more interesting, I added two new columns to the data: CURRENT_PRICE represents the property price in 2018; PREVIOUS_PRICE represents the property price in 2017.
In [ ]:import pandas as pd
df = pd.read_csv("property_tax_report_2018.csv")
df['CURRENT_PRICE'] = df.apply(lambda x: x['CURRENT_LAND_VALUE ']+x['CURRENT_IMPROVEMENT_VALUE'], axis = 1)
df['PREVIOUS_PRICE'] = df.apply(lambda x: x['PREVIOUS_LAND_VAL UE']+x['PREVIOUS_IMPROVEMENT_VALUE'], axis = 1)
Now let's start the EDA process.
Question 1. Look at some example rows
Print the first five rows of the data:
Question 2. Get summary statistics
From the above output, you will know that the data has 28 columns. Please use the describe() function to get the summary statistics of each column.
Please look at the above output carefully, and make sure that you understand the meanings of each row (e.g., std, 25% percentile).
Question 3. Examine missing values
Now we are going to perform EDA on a single column (i.e., univariate analysis). We chose YEAR_BUILT, which represents in which year a property was built. We first check whether the column has any missing value.
Missing values are very common in real-world datasets. In practice, you should always be aware of the impact of the missing values on your downstream analysis results.
Question 4. Plot a line chart
We now start investigating the values in the YEAR_BUILT column. Suppose we want to know: "How many properties were built in each year (from 1900 to 2018)?" Please plot a line chart to answer the question.
Please write down the two most interesting findings that you draw from the plot. For example, you can say: "Vancouver has about 6300 properties built in 1996 alone, which is more than any other year". For each finding, please write no more than 2 sentences.
Findings
1. [ADD TEXT]
2. [ADD TEXT]
Question 5. Plot a bar chart
Next, we want to find that, between 1900 and 2018, which years have the most number of properties been built? Plot a bar chart to show the top 20 years.
In [ ]:
Please write down the two most interesting findings that you draw from the plot.
Findings
1. [ADD TEXT]
2. [ADD TEXT]
Question 6. Plot a histogram
What's the distribution of the number of properties built between 1900 and 2018? Please plot a histogram to answer this question.
In [ ]:
Please write down the two most interesting findings that you draw from the plot.
Findings
1. [ADD TEXT]
2. [ADD TEXT]
Question 7. Make a scatter plot
Suppose we are interested in those years which built more than 2000 properties. Make a scatter plot to examine whether there is a relationship between the number of built properties and the year?
In [ ]:
Please write down the two most interesting findings that you draw from the plot.
Findings
1. [ADD TEXT]
2. [ADD TEXT]
Question 8. PDF and CDF
Can you believe that you have already drawn 8 interesting findings by exploring a single column! This is the power of EDA combined with critical thinking. Now we are moving to multivariate analysis.
Suppose you want to compare the housing price between this year and last year, i.e., CURRENT_PRICE vs. PREVIOUS_PRICE. You can plot their distributions, and make the comparison. There are two ways to define a distribution: Probabilistic Distribution Function and Cumulative Distribution Function (CDF).
In the following, please make two plots and put them side-by-side.
In the first plot, use histograms to plot the probabilistic distributions of CURRENT_PRICE and PREVIOUS_PRICE.
In the second plot, use histograms to plot the cumulative distributions of CURRENT_PRICE and PREVIOUS_PRICE.
There are a few properties which are way more expensive than the others. For both plots, please exclude those properties by setting xlim = (0, 5Million).
In [ ]:
Please write down the two most interesting findings that you draw from the plots.
Findings
1. [ADD TEXT]
2. [ADD TEXT]
Question 9. Use EDA to answer an interesting question (1)
In the above plots, we found that the overall housing price has increased, but we do not which type of property has increased more.
Now we add another variable LEGAL_TYPE (e.g., STRATA, LAND) to the analysis, and consider three variables (LEGAL_TYPE, CURRENT_PRICE, PREVIOUS_PRICE) in total.
In the following, please make two plots and put them side-by-side.
In the first plot, please use histograms to plot the probabilistic distributions of CURRENT_PRICE and PREVIOUS_PRICE for LEGAL_TYPE = "STRATA".
In the first plot, please use histograms to plot the probabilistic distributions of CURRENT_PRICE and PREVIOUS_PRICE for LEGAL_TYPE = "LAND".
In [ ]:
Please write down the two most interesting findings that you draw from the plots.
Findings
1. [ADD TEXT]
2. [ADD TEXT]
Question 10. Use EDA to answer interesting questions
Although the housing price of the entire Vancouver area is increasing, there might be some areas whose housing price is decreasing. To answer this question, we need to consider another column
-- PROPERTY_POSTAL_CODE.
PROPERTY_POSTAL_CODE (e.g., "V5A 1S6") is a six-character string with a space separating the third and fourth characters. We use the first three characters to represent an area.
We first filter out the areas which have less than 10 properties. For each of the remaining areas, we calculate the percentage of the properties whose price has decreased compared to the last year. For example, if an area "V5A" has 50 properties, and 30 of them have decreased, then the percentage is 60%.
Please write code to find the top-10 areas with the highest percentages. Create a bar chart to visualize them.
In [ ]:
Please write down the two most interesting findings that you draw from the plot.
Findings
1. [ADD TEXT]
2. [ADD TEXT]
Question 11. Come up with your own question.
You need to complete the following three tasks.
Firstly, please come up with an interesting question on your own (like Q9 and Q10).
A short description of the question: [ADD TEXT]
Secondly, please write code so that the output of your code can answer the question.
In [ ]:
Thirdly, please write the two most important findings.
Findings
1. [ADD TEXT]
2. [ADD TEXT]
Part 2. Bootstrapping
In Part 1, we run our analysis over the full dataset. In reality, however, you may not be that lucky. It is more often than not that you can only collect a sample of the data. Whenever you derive a conclusion from a sample (e.g., The Vancouver's housing price has increased by 10\% since last year), you should ALWAYS ask yourself: "CAN I TRUST IT?". In other words, you want to know that if the same analysis was conducted on the full data, would the same conclusion be derived? In Part 2, you will learn how to use bootstrapping to answer this question.
Please download the sample dataset property_tax_report_2018_sample.zip (property_tax_report_2018_sample.zip), and load it as a DataFrame.
In [ ]:df_sample = pd.read_csv("property_tax_report_sample.csv")
df_sample['CURRENT_PRICE'] = df_sample.apply(lambda x: x['CURR ENT_LAND_VALUE']+x['CURRENT_IMPROVEMENT_VALUE'], axis = 1)
df_sample['PREVIOUS_PRICE'] = df_sample.apply(lambda x: x['PRE VIOUS_LAND_VALUE']+x['PREVIOUS_IMPROVEMENT_VALUE'], axis = 1)
df_sample = df_sample[df_sample['LEGAL_TYPE'] == 'STRATA']
Task 1. Analysis Result Without Bootstrapping
Please compute the median of PREVIOUS_PRICE and CURRENT_PRICE, respectively, and compare them in a bar chart.
In [ ]:
Task 2. Analysis Result With Bootstrapping
From the above chart, we find that the median of PREVIOUS_PRICE is about 0.6 M, and the median of CURRENT_PRICE is about 0.7 M. Since the numbers were obtained from the sample, "CAN WE TRUST THESE NUMBERS?"
In the following, please implement the bootstrap by yourself, and compute a 95%-confidence interval for each number. You can find the description of the algorithm in Section 7.
Attachment:- Assignment.rar