Stat6000health research methods assuming the assumptions


Scenario:

A researcher randomly recruited a group of 63 children in southwest Western Australia from an elementary school and followed them for a period of 12 years. At the beginning of the study, each of the children was given an identification number and their gender, area where they live, their daily energy intake, their fibre intake and time spent on physical activities were recorded by a research assistant who compiled all the information into the dataset '2016'. After 12 years, the amount of time these teenagers (the then children) spent playing sports were measured and recorded in the same dataset. The body mass index of the children and the teenagers were also measured. All of the data were measured independently of the children. The variables of the dataset are listed in Table 1.

Table 1: Variables and their descriptions as collected in the study

Variable

Labels

ID

Identification number of the children

GENDER

Gender of the children (1 = Girl; 2 = Boy)

AREA

Area where the children live (1 = Country; 2 = City)

ENERGY

Daily energy intake of the children (in kJ's)

FIBRE

Daily fibre intake of the children (in g)

TIME1

Time children spent playing sports per day (in mins)

TIME2

Time teenager spent playing sports per day (in mins)

BMIC

Body mass index of the children (in kg/m2)

BMIT

Body mass index of the teenagers (in kg/m2)

Open the 2016data set. Use the following questions to guide you through the process as you run some descriptive statistics and also the inferential statistics and prepare your interpretations or conclusions for the researcher.

It is recommended that you first assign the variable 'labels' and 'values' according to the Table above. This will enable you to read the outputs easily.

1. Which of the following would be appropriate to describe the frequency distribution for gender (boys and girls)?       

a. Frequency and percentage

b. Mean and standard deviation

c. Median and interquartile range

d. Variance and standard error

2. The appropriate statistics to describe the body mass index (BMI) of the children would be: _______

a. The percentage of BMI of the children is 20.37 kg/m2.

b. The range of BMI of the children population is 17.26 kg/m2.

c. The BMI of the children is a lot lesser than the BMI of the teenager in the population.

d. In this sample, the average BMI of the children is 20.27 kg/m2, with a standard deviation of 3.67 kg/m2.

3. The researcher is interested to know if variable ENERGY has a Normal distribution. Use the following table as a guide.                                                                               

Measures

Criteria/Cut off points

  • Histogram

Symmetrical, bell-shaped curve

  • Boxplot

Median in the centre of the box with whiskers at equal length at both ends of the box and no outliers

  • Normal Q-Q plot

Most observations appear on the straight line

  • Skewness coefficient

Between -1 and 1

  • Kurtosis coefficient

Between -1 and 1

[STATA users to substrate 3 from the given kurtosis coefficient]

a. Do you think transformation is required for variable ENERGY?                               

i. Yes, natural logarithm of the variable should be done and assessment needs to be carried out in full to assess the Normality of the newly transformed variable.

ii. Yes, variable ENERGY has a Normal distribution and should be transformed to ensure the distribution remains Normal.

iii. No, variable ENERGY already has a Normal distribution.

iv. No, skewed variables (including variable ENERGY) should never be transformed.

b. As you learn to know the distribution of variable ENERGY, what should be the most appropriate measures of centrality and variability to report for variable ENERGY? *Hint: Different measures of centrality and variability need to be reported for data that display a Normal or a skewed distribution.*   i. Mean and standard deviation. The reason is that variable ENERGY has a normal (symmetric) distribution.

ii. Median and interquartile range. The reason is that variable ENERGY does not have a normal distribution but a skewed distribution.

c. A practical interpretation about the measure of variability of variable ENERGY within this sample of southwest Western Australian children for the dietician, as referred to by the 68-95-99% rule, would be: _______                                                     

i. Approximately 4573.04 to 5291.15 kJ is consumed by approximately 68% of the children in this population.

ii. Approximately 95% of the children in this sample consumed 4213.99 kJ to 5650.20 kJ per day.

iii. In this population, approximately 99% of the children consumed 3854.94 kJ to 6009.25 kJ a day.

iv. The average daily energy intake of the children should be between 4841.67 kJ and 5022.52 kJ in this population, as estimated with 95% confidence.

4. The researcher now wants to investigate the levels of energy intake of the boys and girls who spent various amount of time playing sports per day. You need to first recode the variable ENERGY and TIME1 as follows: *Hint: Give the recoded variables a new name and assign value labels to the new recoded variables. *                   

ENERGY

Values for ENERGY to be recoded into following levels

Code

 

 Less than 4500 kJ

(< 4500 kJ)

1

 

Greater or equal to 4500 kJ but less than 5000 kJ

(4500 - 5000 kJ)

2

 

Greater than 5000 kJ

(>5000 kJ)

3

TIME1

Values for TIME1 to be recoded into following levels

Code

 

Less than 45 minutes

(< 45 min)

1

 

Equal to or more than 45 minutes

(>= 45 min)

2

a. Obtain a cross-tabulation consisting of the appropriate statistics for energy intake and time the children played sports*Hint: cross-tabulation is for categorical variables.* Which of the following statement(s) is/are appropriate to describe the levels of energy intake between the children who spent less than 45 minutes and those who spent equal or more than 45 minutes playing sports?   

i. Most of the girls (53% of them) spent less than 45 minutes playing sports while less of the boys (45% of them) spent less than 45 minutes playing sports.

ii. There are 32% of the children who consumed between 4500 and 5000 kJ per day spent less than 45 minutes playing sports.

iii. There is not much difference between the percentages of children who spent less than 45 minutes playing sports (54%) than those who spent more than 45 minutes playing sports (46%).

iv. Of the children who consumed more than 5000 kJ, more of them also tend to spend more than 45 minutes playing sports (60%).

b. Assuming the assumptions are met, how would you test if there is any association between the levels of energy intake of children and the levels of time they spent playing sports?                                      

i. Use Pearson Correlation Coefficient, with significance level set at 5% level.

ii. Use Chi-square test, with significance level set at 5% level.

iii. Use an independent samples t-test, with significance level set at 5% level.

iv. None of the above is suitable for this research hypothesis.

c. How can you conclude about the relationship between levels of energy intake and levels of time the children played sports?

i. The chi-square statistic is 12.12 with a p-value of less than 0.05. Assuming the assumptions are met, it can be concluded that there is an association between levels of energy intake and levels of time the children played sports.

ii. The p-value from the test is 0.02. Assuming the assumptions are met, it can be concluded that there is no association between levels of energy intake and levels of time the children played sports when the significance level is set at 5%.

iii. The p-value of the t-test was found to be 0.88. Assuming the assumptions (including the Levene's test) are met, it can be concluded that the energy intake between those who played less than 45 minutes of sports is not significantly different from those who played more than 45 minutes of sports.

iv. None of the above.

5. Confidence intervals (CI) are used to estimate the population parameters as it is impossible to reach everyone in the population.

a. Which of the following statement is correct about the estimation of the average time the population of teenagers spent playing sports?                                 

i. The average time the population of teenagers spent playing sports is estimated to be between 19.7 and 24.8 minutes.

ii. We are 95% confident that the mean time the children spent playing sports lie between 19.7 and 24.8 minutes in this population.

iii. The higher the confidence levels (eg. from 90% to 95% to 99%), the more confident we are about capturing the actual population parameter and therefore the corresponding lengths of the CIs tend to be shorter.

iv. None of the above is correct.

b. If the sample size of this study increased from 63 to 630, we will expect: ______

i. The range of values that are captured within the 90% CI, 95% CI and 99% CI to become shorter as we can be more confident about our estimation now with larger sample size.

ii. The length of the 95% CI to remain the same but the 95% CI is now a more reliable estimation than the 99% CI as the larger sample size warrants a higher level of precision.

iii. The length of the 99% CI to be shorter and be more precise than when the sample size was 63.

iv. Statements 'i' and 'iii' are both correct.

6. The researcher wants to test a research hypothesis that the mean body mass index (BMI) of the teenagers in this population is 22 kg/m2.  

a. The correct hypotheses statement(s) for this research objective would be: ___

i. Ho: μ =22 years old.

ii. Ho: μ =22 kg/m2, H1(or Ha): μ ≠ 22kg/m2

iii. Null hypothesis: the mean BMI of the teenagers is 22kg/m2; Alternative hypothesis: the mean BMI of the teenager is not 22 kg/m2.

iv. Null hypothesis: the population mean BMI of the teenagers is 22kg/m2; Alternative hypothesis: the population mean BMI of the teenagers is not 22 kg/m2.

v. Statements 'ii' and 'iii' are both correct.

vi. Statements 'ii' and 'iv' are both correct.

b. The appropriate statistical test to test this hypothesis and the results would be: ___

i. One sample t-test with 5% level of significance; t-value = 3.84, p-value = <0.001.

ii. Two samples (independent samples) t-test with 5% level of significance; t-value = 3.84, p-value = <0.001.

iii. Paired-sample t-test with the 'alpha' set at 5%; t-value = 0.64, p-value = 0.527.

iv. Pearson correlation coefficient with 5% level of significance; r = 0.083, p-value = 0.520.

v. One-way ANOVA with 5% level of significance; t-value = -5.57, p-value = <0.001.

vi. Chi-square test with 5% level of significance; χ2= 3906, p-value = 0.239.

c. An appropriate conclusion about the research hypothesis would therefore be: ____

i. There is no significant mean difference between the population BMI of the teenagers and the test value, 22 kg/m2, as the p-value is close to zero.

ii. In this population, it is estimated that the mean BMI of the teenagers is 2.21571 kg/m2 less than the hypothesized 22 kg/m2, and therefore the null hypothesis has to be rejected (p<0.05).

iii. The p-value is not much difference from the set level of significance. In addition, the 95% confidence intervaI of the difference does not include the hypothetical value '22', therefore supporting the decision to accept the null hypothesis and conclude that the population mean BMI of the teenagers is 22 kg/m2.

iv. In this population, it is estimated that the mean BMI of the teenagers is significantly 2.21571 kg/m2 higher than the hypothesized 22 kg/m2. In addition, the estimated 95% confidence interval does not include the hypothetical value '22' and therefore the null hypothesis has to be rejected (p<0.05).

7. The researcher now wishes to test the hypothesis that the population mean fibre intake is the same for the boys and girls.  You will test the hypothesis by following the steps of hypothesis testing.

a. State the hypotheses.                                                                                                             

b. State which statistical test you plan to use, and the level of significance (α) you are using.       

c. In addition to 'random sampling' and 'independent observations', state the other two assumptions for the statistical test you decided to use, and test if these two assumptions are met.                               

Assumptions

Evidence of assumptions being met

Biostatistical remedy if assumptions are not met (where applicable)

 

 

 

 

 

 

 

 

 

 

d. After you run the statistical analyses, what can you conclude about the research hypothesis?                               

i. The test statistics is 2.10, the p-value is 0.04, the 95% CI of the difference is (0.02, 0.93) and does not include '0', suggesting that we have to reject the null hypothesis and conclude that the population mean fibre intake is different between the boys and the girls.

ii. The test statistics is 2.54, the p-value is 0.01, the 95% CI of the difference is (0.90, 7.58) and does not include '0', suggesting that we have to reject the null hypothesis and conclude that the population mean fibre intake is different between the boys and the girls.

iii. The test statistics is 2.44, the p-value is 0.02, the 95% CI of the difference is (0.77, 7.71) and does not include '0', suggesting that we have to reject the null hypothesis and conclude that the population mean fibre intake is different between the boys and the girls.

iv. The test statistics is 2.10, the p-value is 0.98, the 95% CI of the difference is (0.02, 0.93) and does not include '0', suggesting that we have to accept the null hypothesis and conclude that the population mean fibre intake is the same between the boys and the girls.

v. None of the above is correct.

8. The researcher wants to know if the average time the children played sports (in minutes) are the same as the average time they spent playing sports (in minutes) when they became teenagers.

a. State the hypotheses.                                                                                                             

b. State which statistical test you plan to use, and the level of significance (α) you are using.                       

c. Assuming the assumptions for the statistical test you chose to do are met, what can you conclude about the research hypothesis after you run the statistical analyses?             

i. It is found that the r-value is -0.27, the p-value is 0.03, suggesting that the time the children spent playing sports is only mildy related to the time the teenagers spent playing sports.

ii. The mean difference between the time the children played sports and the time the teenagers played sports is 22.16 minutes. The t-value is 14.07, p-value is <0.001, 95% CI of the difference is (19.01, 25.31) minutes and does not include '0', suggesting that, in this population, there is a significant difference between the time the children spent playing sports and the time they played sports when they became teenagers on average.

iii. The mean difference between the time the teenagers played sports and the time they played sports while they were children - 22.16 minutes. The t-value is -14.07, p-value is <0.001, 95% CI of the difference is (-25.31, -19.01) minutes and does not include '0', suggesting that, in this population, there is a significant difference between the time the children spent playing sports and the time they played sports when they became teenagers.

iv. Only statement 'i' is incorrect.

9. The researcher wishes to know if population mean energy intake amongst the children is related to the BMI of the teenagers.

a. Assuming all the assumptions are met, the appropriate statistical analysis would be____________

i. One sample t-test with 5% level of significance.

ii. Two samples (independent samples) t-test with 5% level of significance.

iii. Paired-sample t-test with the 'alpha' set at 5%.

iv. Pearson's correlation coefficient with 5% level of significance.

v. One-way ANOVA with 5% level of significance.

vi. Chi-square test with 5% level of significance.

b. Based on the analyses you conducted, is there any relationship between energy intake and BMI of the teenagers?

i. Yes, the p-value is larger than 0.05 from the one sample t-test so we can conclude that there is a relationship between energy intake and BMI of the teenagers.

ii. The p-value of the independent samples t-test is in agreement with the 95% CI of population mean difference ('0' is included in the 95% CI), suggesting that there is no relationship between energy intake and BMI of the teenagers.

iii. Yes, the p-value from the paired-sample t-test is p<0.001, suggesting that there is a significant relationship between energy intake and the BMI of the children and the teenagers.

iv. The correlation coefficient (-0.09) indicates that there is a weak negative linear relationship between the children's energy intake and BMI of the teenagers, suggesting that there is no significant linear relationship between energy intake and the BMI of the teenagers in this population (p = 0.506).

v. The p-value of the one-way ANOVA test is 0.94, suggesting that there is no significant relationship between children's energy intake and BMI of the teenagers in this population.

vi. The chi-square statistic is 126 with a p-value of 0.43, suggesting that there is no relationship between energy intake and BMI of the teenagers in this population.

10. The researcher wishes to test if the BMI of the children varies across the three levels of daily energy they consumed.

a. The appropriate test to use would be ___________ 

i. One sample t-test with 5% level of significance.

ii. Two samples (independent samples) t-test with 5% level of significance.

iii. Paired-sample t-test with the 'alpha' set at 5%.

iv. Pearson correlation coefficient with 5% level of significance.

v. One-way ANOVA with 5% level of significance.

vi. Chi-square test with 5% level of significance.

b. What can you conclude about this research hypothesis?                         

i. The t-statistic is 109.03, p-value is <0.001, the 95% CI is (4841.67, 5022.52) kJ and does not include '0', suggesting that in this population, there is a significant difference between the BMI of the children and their daily energy intake.  

ii. The p-value of the multiple comparison groups is larger than 0.05, suggesting that the null hypothesis should be accepted.

iii. The t-statistic is 108.64, p-value is <0.001, the 95% CI is (4821.45, 5002.20) kJ and does not include '0', suggesting that in this population, there is a significant difference between the BMI of the children and their daily energy intake.

iv. The r-value is 0.06, p-value is 0.640, suggesting that there is no strong variation between the BMI of the children in this population and their daily energy intake.

v. The F test-statistic is 1.08, p-value is 0.347, suggesting that there are no significant differences between the population mean BMI of the children who had different levels of energy intake in this population.

vi. The χ2 is 124, p-value = 0.433, suggesting that there is no significant differences between the levels of energy intake and the BMI of the children in this population.

11. The researcher wants to know if the time the children spent playing sports is related to the area they lived when they were children. Note that there are two different statistical tests that you can use to test this hypothesis. You need to therefore decide if you wish to use the outcome variable as a continuous or the recoded categorical variable and state the variables clearly in the hypothesis, the one statistical test you plan to carry out to test your stated hypothesis, the significance level, justify your choice of statistical test and significance level you planned to use, state the conclusion and provide the output (0.5 mark). You can assume that the assumptions for the test you chose to run are met.

12. Lastly, the researcher wants to test if the BMI (kg/m2) of the teenagers can be predicted by the time these teenagers spent playing sports (in minutes), the time they spent playing sports as children (in minutes), and their BMI (kg/m2) when they were children. *Hint: you only need to consider one of the independent variables that is significant.*You will need to (i) state the statistical approach you plan to carry out to answer the stated research question, including the significance level, (ii) justify your choice (1 mark) of statistical approach and significance level you planned to use, (iii) state the regression equation that the researcher can use to predict mean values of BMI of the teenagers, (iv) interpret the estimated regression coefficient, (v) comment on the fitness of the regression model and (vi) provide the output. You can assume that the assumptions for the statistical approach you chose to partake are met.

Attachment:- Assignment.rar

Request for Solution File

Ask an Expert for Answer!!
Basic Statistics: Stat6000health research methods assuming the assumptions
Reference No:- TGS01391420

Expected delivery within 24 Hours