Project: Multiple Linear Regression
Use the dataset m2project2016.dta for this project. The dataset is a sample from the 1999-2002 National Health and Nutrition Examination Survey (NHANES). A race-stratified random sample was selected from a complete dataset for the variables included, and then modified to remove missing data. The random sampling based on race was setup to result in a race distribution similar to the US population. Also, variable names are modified (so you won't find the variable names on the NHANES website), and in some cases continuous variables are categorized, or categories combined for categorical variables. The data are restricted to no reported history of heart disease, and no reported use of prescription drugs for hyperlipidemia or inflammation. The dietary intakes are based on a 24-hour diet recall interview.
The objective is to examine the nature of the relationship between fish intake and blood HDL-cholesterol levels (if any), and evaluate the confounding influence of other variables. Submit the following items for this project using the Dropbox:
1. A project report organized according to the steps listed below. The insertion of program output into the report should be limited and clearly tied to the discussion of results. Figures (graphs) can be inserted onto the report document of posted as separate files. All tables should be incorporated into the report document.
2. A program file with all the program commands used to complete the project (do-file in Stata). The commands should be ordered according to the list of tasks below and comments included describing the data analysis step the commands are used for.
Data Analysis Steps:
1. Create a new variable that categorizes blood HDL into quartiles: <=40 mg/dL, 41 to 48 mg/dL.to make sure you get the proper HDL categories, you can use the following Stata code:
genhdlcat = 1 if hdl>0 &hdl<=40
replacehdlcat = 2 if hdl>40 &hdl<=48
replacehdlcat = 3 if hdl>48 &hdl<=59
replacehdlcat = 4 if hdl>59 &hdl<.
2. Conduct descriptive analyses and bivariate association tests between HDL quartiles and other variables so as to complete Tables 1 and 2. Note the categorized variable for blood HDL needs to be used for Tables 1 and 2, but the continuous HDL variable is used for the multiple regression analyses.Conduct the appropriate tests to evaluate bivariate associations of characteristics with HDL quartile (as a categorical variable). When the characteristic is numerical (e.g., BMI), test for linear and quadratic trends across the HDL quartiles. See the competed analyses for fish consumption and race in Table 1 below as examples.
3. Create new variables to mean-center the following continuous variables: age, bmi, tkcal, tprot, tcarb, tchol, tfat, tfibe, tvc, tsele, tg, and ldl.Keep the original variables.
4. Use a saturated model with no interactions (a saturated model includes all predictors) and perform an initial evaluation of collinearity using the variance inflation factor.
Saturated Model
Dependent Variable: HDL
Predictors:gender, centered-age, race, education, centered bmi, centered dietary energy, centered dietary protein, centered dietary carbohydrate, centered dietary cholesterol, centered dietary fat, centered dietary fiber, centered dietary vitamin C, centered dietary selenium, alcohol, physical activity, dietary fish, centered blood triglyceride, centered blood LDL, smoking status
Identify the predictors, that when removed from the model, eliminate the collinearity.First remove predictors other than tfish, gender, age, bmi, pactive, and smoke to resolve collinearity (these predictors are in the first group for the Allen-Cady procedure - see below). Establish a model containing as many of the original variables as possible that does not exhibit collinearity. This will still be called model 1.
5. Use model 1 from step 4 above and evaluate the linearity assumption forBMI, age, fish consumption, and blood triglyceride. There should be evidence for nonlinearity for two of these four predictors. Create quadratic terms for the two variables and include in model 1. Then re-evaluate linearity for the two variables (just the linear terms). Include in the report a description of how linearity was assessed and any graphs and relevant statistical output used both before and after inclusion of quadratic terms in the model. The model from this step is called model 2.
6. Usemodel 2 from step 5 above and evaluate the normality and homogeneity of variance assumptions for the dependent variable. A problem with the normality and homogeneity of variance should be found that is mostly fixed by transforming the dependent variable (use the natural log).Include in the report a description of how the assumptions are evaluated and include any graphs and relevantstatistical output used. The model from this step is called model 3.
Normally, the linearity for numerical variables would be rechecked (BMI, age, fish consumption, and blood triglyceride). This step will be skipped for the project. The two variables that whose nonlinearity was fixed by inclusion of quadratic terms remained fixed when the transformed HDL is used as the dependent variable instead of HDL.
7. Usethe DFBETA statistic with model 3 (log HDL)to evaluate and document the presence of influential data, but do not delete any data. Make a list of the influential observations (ID numbers and associated influence statistics; graphs are also useful).
8. Usethe transformed form of HDL and fish consumption (extra terms for a nonlinearity or not) and examine the confounding influence of covariates (extra terms for a nonlinearity or not) on the association of fish consumption and blood HDL. Do this using regression models for HDL containing only fish consumption and the one potential confounder. Complete Table 3 and describe the findings.
9. Use the Allen-Cady Modified Backwards Selection procedure with Model 3 (log HDL) to reduce the number of predictors in the regression model. For the first group of predictors that are always in models, use the following predictors: total fish consumed, gender, centered-age, centered-BMI, physical activity, and smoker status (exclude any of these variables that induce collinearity - exceptthe predictor of interest total fish consumption). These predictors were chosen for the first group based on one of two criteria: 1) a predictor of interest (total fish consumption), or 2) some documentation in the literature for an association with blood HDL. There may be documentation for some of the other predictors having an association with blood HDL, but for this project the predictors previously listed will be used to simplify the possible final models.The ranking of covariates that is required for the second group is left to each student to perform independently. Then carry out the backward selection using p=0.1 as the retention criteria. Use Table 4 to present the stepwise results and use Table 5 to summarize the model resulting from the backward selection (include all predictors from both groups and add rows to the table or modify as needed). Write a summary that includes interpretations of the regression coefficients in terms of the association between predictor and blood HDL-cholesterol (see page 129 of VGSM for interpreting coefficients when the dependent variable is log transformed). Evaluate any ordinal predictors in the modelfor trends (linear and quadratic when justified), and if there is no trend adjust p-values for multiple comparisons for multilevel categorical variables.
10. Using the regression model selected by the Allen-Cady Modified Backwards Selection procedure evaluate the interaction between fish consumption and gender.Use Table 6 to summarize the final model with the interaction added (add rows to the table or modify as needed), and write a summary that includes interpretations of the regression coefficients for the interaction in terms of the association between predictor and blood HDL-cholesterol.Make a graph that illustrates the interaction (or absence of interaction). Indicate whether the inclusion of the interaction modified any association HDL-cholesterol with the other predictors in the model.
Table 1. Characteristics of the study sample by Blood HDL Quartiles
|
Characteristic
|
HDL (mg/dL) Categories
|
p-value
|
< 40
(n=439)
|
41 to 48
(n=458)
|
48 to 59
(n=433)
|
> 59
(n=436)
|
Mean or %
|
SD
|
Mean or %
|
SD
|
Mean or %
|
SD
|
Mean or %
|
SD
|
Fish Consumption (meals/30 days)
|
1.9
|
3.4
|
1.5
|
3.1
|
2.1
|
3.0
|
3.0
|
4.6
|
< 0.001a
0.002c
|
Age (years)
|
|
|
|
|
|
|
|
|
|
BMI (kg/m2)
|
|
|
|
|
|
|
|
|
|
Gender (% female)
|
|
|
|
|
|
|
|
|
|
Smoker (% yes)
|
|
|
|
|
|
|
|
|
|
Race/Ethnicity (%)
|
|
|
|
|
|
|
|
|
0.005d
|
White
|
63.8
|
|
58.3
|
|
61.2
|
|
65.1
|
|
Black
|
8.9
|
|
13.1
|
|
15.2
|
|
14.9
|
|
Hispanic
|
27.3
|
|
28.6
|
|
23.6
|
|
20.0
|
|
Physical Activity (%)
|
|
|
|
|
|
|
|
|
|
Low
|
|
|
|
|
|
|
|
|
Low-Moderate
|
|
|
|
|
|
|
|
|
High-Moderate
|
|
|
|
|
|
|
|
|
High
|
|
|
|
|
|
|
|
|
Education Level (%)
|
|
|
|
|
|
|
|
|
|
Less than HS
|
|
|
|
|
|
|
|
|
HS/GED
|
|
|
|
|
|
|
|
|
Some college
|
|
|
|
|
|
|
|
|
College or more
|
|
|
|
|
|
|
|
|
a. ANOVA F-test.
b. Test for linear trend after ANOVA.
c. Test for quadratic trend after ANOVA.
d. Chi-square test.
|
Table 2. 24-Hour diet intake profile of the study sample by Blood HDL Quartile
|
Dietary Factor
|
HDL (mg/dL) Categories
|
p-value
|
< 40
(n=439)
|
41 to 48
(n=458)
|
48 to 59
(n=433)
|
> 59
(n=436)
|
Mean or %
|
SE
|
Mean or %
|
SE
|
Mean or %
|
SE
|
Mean or %
|
SE
|
Energy (kcal)
|
|
|
|
|
|
|
|
|
|
Protein (gm)
|
|
|
|
|
|
|
|
|
|
Carbohydrate (gm)
|
|
|
|
|
|
|
|
|
|
Fat (gm)
|
|
|
|
|
|
|
|
|
|
Cholesterol (gm)
|
|
|
|
|
|
|
|
|
|
Fiber (gm)
|
|
|
|
|
|
|
|
|
|
Vitamin C (mg)
|
|
|
|
|
|
|
|
|
|
Selenium (mcg)
|
|
|
|
|
|
|
|
|
|
Alcohol (% yes)
|
|
|
|
|
|
|
|
|
|
a. ANOVA F-test.
b. Test for linear trend after ANOVA.
c. Test for quadratic trend after ANOVA.
d. Chi-square test.
|
Table 3. Confounding Influence of Covariates on Fish Consumption Regression Coefficient
|
Potential Confounder
|
b
|
% Change in ba
|
p-valueb
|
None
|
0.551
|
|
< 0.001
|
Age
|
0.554
|
+ 0.5
|
< 0.001
|
BMI
|
|
|
|
Gender
|
|
|
|
Smoker
|
|
|
|
Race/Ethnicity
|
|
|
|
Physical Activity
|
|
|
|
Education Level
|
|
|
|
Dietary Energy
|
|
|
|
Dietary Protein
|
|
|
|
Dietary Carbohydrate
|
|
|
|
Dietary Fat
|
|
|
|
Dietary Cholesterol
|
|
|
|
Dietary Fiber
|
|
|
|
Dietary Vitamin C
|
|
|
|
Dietary Selenium
|
|
|
|
Alcohol
|
|
|
|
a. (b(confounder) - b(fish))/b(fish) as %.
b. P-value for fish beta coefficient for in a model also containing the potential confounder.
|
Table 4. Allen-Cady Procedure Results
|
Predictor in Rank Ordera
|
Coefficient Estimate P-Valueb
|
Step 1
|
Step 2
|
Step 3
|
Step 4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a. Most important to least important.
b. P-value for beta coefficient for predictors at each step in the backward selection with p=0.1 as the retention criteria.
|
Table 5. Regression model for the association of blood HDL-cholesterol with fish consumption adjusting for confounding by demographic characteristics and dietary factors.
|
Predictor
|
b
|
95% CI
|
p-value
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 6. Regression model for the association of blood HDL-cholesterol with fish consumption and the interaction with gender, with adjustment for confounding by demographic characteristics and dietary factors.
|
Predictor
|
b
|
95% CI
|
p-value
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Attachment:- Assignment.rar