1. Constructing a Data Set
For this assignment, you will download a data set from the University of Michigan Panel Survey Of Income Dynamics. Note that during this process you will need to create an account. To access the PSID, go to https://psidonline.isr.umich.edu/. (If you are using Windows Explorer, make sure you are compatibility mode.) From the home page you will select Data, and then Data Center under the section labeled Public Use. Once you get to the Data Center, select File. From this page, expand PSID Family- Level and PSID Main Family Data. Then expand the 2005 survey from which you will select your variables. Following the instructions given on the screen, select the following variables.
ER25017 AGE OF HEAD
ER25018 SEX OF HEAD
ER25020 # CHILDREN IN FU
ER25023 HEAD MARITAL STATUS
ER25128 BC21 MAIN IND FOR JOB 1: 2000 CODE (HD)
ER25160 BC41 YRS PRES EMP (H-E)
ER25161 BC41 MOS PRES EMP (H-E)
ER25910 G13 WAGES/SALARY OF HEAD
ER27393 L40 RACE OF HEAD-MENTION 1
ER27418 L55 HGHST COLLEGE DEGREE RECD-HD
ER27931 LABOR INCOME OF HEAD-2004
ER28003 HEAD WAGE RATE-2006
ER28047 COMPLETED ED-HD
After you add the data to your cart, go to your cart and complete the check-out process. Select PDF for the Code Book type and Microsoft Excel Spreadsheet for the data output type. Save both in a safe place.
The first thing you must do after downloading your data is to clean it. It will be easiest to do this by using the filter feature in Excel and deleting rows as appropriate. A separate handout will be posted for those of you who do not know how to use this.
a. We need to eliminate from our sample each household for which we are missing data. Typically, this will be a case where a respondent has refused to answer or does not know the answer. Eliminate observations as indicated in the table below:
Variable
|
Eliminate Observations with Code
|
AGE OF HEAD
|
999
|
HEAD MARITAL STATUS
|
8, 9
|
BC21 MAIN IND FOR JOB 1: 2000 CODE (HD)
|
0, 999
|
BC41 YRS PRES EMP (H-E)
|
98, 99*
|
BC41 MOS PRES EMP (H-E)
|
98, 99*
|
G13 WAGES/SALARY OF HEAD
|
0, 9,999,998, 9,999,999
|
L40 RACE OF HEAD-MENTION 1
|
0, 9
|
L55 HGHST COLLEGE DEGREE RECD-HD
|
8, 97, 98, 99
|
LABOR INCOME OF HEAD-2004
|
0
|
HEAD WAGE RATE-2004
|
0
|
COMPLETED ED-HD
|
99
|
*For these two variables you will need to filter twice so as to delete all observations for which both variables take the value of zero. These will be the people we will consider to have no time on their current job and hence not working. Note here that for the sake of simplicity we are assuming away some of the complexity of the actual data set. It is not strictly correct to eliminate some of these people from our data set. Doing so will not significantly affect our results though.
Once you correctly complete this process you will have 4,256 observations. Before you do anything else, save your data set with a unique name. You should always save this version as it is. Save the version you create in Part b under a different name.
b. A number of these variables are categorical variables. Convert the following into dummy variables as instructed:
i. "Head Marital Status" into a single binary variable of married = 1 for married and zero otherwise. Not married is the benchmark.
ii. "Race of Head-Mention 1" into a single binary variable of white or non-white with white as the benchmark. White=1 if race is white and zero otherwise.
iii. "Sex of Head" into a single binary variable of female=1 for female and zero for male. In other words male is the benchmark.
iii. "Highest College Degree Recd-HD" into dummy variables for Associate Degree, Bachelor's Degree and Graduate Degree using no degree (code 0) as your benchmark.
iv. Convert the categorical variable INDUSTRY into a series of dummy variables as indicated in the table below.
Dummy Variable Category
|
Variable Name
|
Codes in Original Data
|
Resources, Utilities and Construction
|
RUC
|
17-77
|
Manufacturing
|
MFCTR
|
107-399
|
Wholesale and Retail Trade, Transport and Warehousing
|
TRDTRSP
|
407-639
|
Information, Finance and Real Estate
|
INFR
|
647-719
|
Professional Services
|
PFSRVCS
|
727-847
|
Hospitality and Entertainment
|
HSPENT
|
856-869
|
Other Services
|
OSRVCS
|
877-929
|
Public Admin and Military
|
PUBADMIN
|
937-987
|
Note that for convenience you are creating a dummy variable for every category. Clearly, you can't include all of them in a regression as they will then be perfectly multi-collinear. Whichever variable we leave out of a regression will be the benchmark for that particular regression.
c. You will also need to combine years and months of employment. To do this, you will need to divide months of employment by 12 and add the resulting decimal number to years of employment. You can complete this step either in Excel or in Eviews. Note that PSID also includes length of employment in days. To get an accurate number for length of employment we would combine years, months and days. For simplicity, we are only using years and months. Label this new variable "Tenure"
d. Name your variables in a logical and intuitive manner (you are welcome to use my names, but are not required to) and compute summary statistics for each. Report only the mean, median and standard deviation for each variable. Report the sample size for your overall data set. For each of your dummy variable, interpret the mean. What is it telling you? The table below provides a check value for the mean some variables.
Variable
|
Mean
|
Variable
|
Mean
|
Variable
|
Mean
|
AGE
|
41.73167
|
INFR
|
|
WHITE
|
|
FEMALE
|
|
PFSRVCS
|
|
ASSOC
|
|
CHILDREN
|
|
HSPENT
|
|
BACHELORS
|
|
MARRIED
|
|
OSRVCS
|
|
GRADUATE
|
|
RUC
|
|
PUBADMIN
|
|
LABOR INCOME OF HD
|
44,256.72
|
MFCTR
|
|
TENURE
|
8.2883
|
WAGE_RATE
|
|
TRDTRSP
|
0.20113
|
WAGES
|
|
YEARS_EDUCATION
|
13.32
|
Save all three versions of your data set. You should have the original data from PSID, the data set that you have cleaned up in part a, and the data set with all the dummy variables as separate files.