Problem Set
This problem set asks you to download and analyze data from the World Bank. You should submit your solutions to me via email before class on Tuesday, Sept 13. You should include two files in your submission, one in Word and one in Excel. The Word document should contain your written answers to each question, as well as any graphs requested. You should also send me your Excel file showing your work. If you don't have Microsoft Office on your computer, you can complete this assignment with one of the free, open-source office programs. The options are LibreOffice (https://www.libreoffice.org) and OpenOffice (https://www.openoffice.org/). Both have word processing and spreadsheet programs that you can use in place of Word and Excel.
Download the following two indicators from the World Bank:
GNI per capita, PPP (current international $)
https://data.worldbank.org/indicator/NY.GNP.PCAP.PP.CD?view=chart
Adult literacy rate, population 15+ years, both sexes (%)
https://data.worldbank.org/indicator/SE.ADT.LITR.ZS?view=chart
Create a single data set for the year 2010. Your final data set should have countries in rows and variables in columns. Your data set should contain countries only, and not aggregates. You should not have observations for regions, such as High Income, Euro area, and World.
There are two ways that you can create this data set.
1. The first option is to download the datasets individually from the links provided above. Then you can use copy and paste to put the data for 2010 into a single spreadsheet. You will have to look through the data and remove observations for regional aggregates.
2. The second option is to format the dataset using the World DataBank (https://databank.worldbank.org), then download a single file. If you choose to use the databank method, make sure that you do not download formatted data. If your data contain commas (e.g. 74,342.00), Excel will read the data as text, and you will be unable to perform calculations on them. Your data should not contain regional aggregates. You can exclude these in the DataBank selection, or delete them in Excel after downloading the data.
In either case, your data set should be organized like the ones on the next page.
After downloading your data, delete all observations that have missing data. To make this faster, first sort the data by literacy rate, then delete observations with missing data. Your final data set should have 43 observations with no missing data.
In Excel, your data set should look like this:
Question 1: Scatter plot
Create a scatter plot with GNI per capita on the vertical axis and adult literacy on the horizontal axis. Give your graph an informative title and label both axes appropriately. Do the two variables appear to have a linear relationship?
Question 2: Correlation
Calculate the correlation coefficient between the two variables. Follow the format used in the handout on correlation, page 7. Consider GNI per capita as the y-variable and adult literacy as the x-variable. Including the scatter plot, your Excel sheet should look like the picture below. What is the correlation between the two variables? How would you interpret this statistic?
Question 3: Linear Regression Formula
Suppose you are interested in explaining GNI per capita in terms of adult literacy using linear regression. Which variable is your dependent variable? Which variable is your independent variable? Write the linear regression equation that you would use to explore this relationship.
Question 4: Linear Regression Coefficients
Calculate the slope and intercept coefficients from your linear regression equation. What are the coefficients? Interpret the meaning of the slope coefficients. Include the measurement units of each variable. (Hint: Are you interpreting a one percent change in the adult literacy rate or a one percentage point change in the adult literacy rate?)
Question 5: Predicted values of GNI
Using your regression coefficients from question 4, calculate the predicted value of GNI for each observed value of adult literacy. Add these data points to your scatter plot as a line, as pictured below. Are all of the predicted values reasonable? Are there any that are impossible? Explain.
Question 6: Predicted residuals
Use the predicted values of yhat to calculate estimated residuals, uhat. What is the expected value (mean) of your estimated residuals.