What does the p-value tell you in statistical significance


Assignment

Background:

This assignment will put into practice some skills you have developed in the BIOM2020 BioStats classes and is designed to help you cement basic anatomical data analysis and interpretation skills. This includes data entry, analysis, p-value interpretation, and drawing of scientific inferences. These skills are foundational to science and can be utilized for any data analysis you undertake in the future, including cross platform approaches.

General Instructions:

• This assignment requires you to follow the below described exercises and submit two files: a .doc file of your R console history; AND an .html web report (also known as a notebook). These notebooks are a very popular way of sharing your code with other researchers (for discussion see: "Interactive Notebooks" Nature, 2015, Volume 515, Pages 151-2).

You will work iteratively in the base R program (keeping a record of your console rough work) and maintaining a separate text file of your working code that can be cut-and-paste into the pre-supplied RStudio Markdown template (.rmd), to generate a notebook just as real scientists do!

• To complete the assignment simply work sequentially through the exercises below. To avoid problems, it is recommended that you use the base R program on a Windows platform, before moving to RStudio (also in Windows) to generate your notebook.

• DO ALL of your rough work iteratively in the base R console. Save out ALL of your console work after each work session, as a text file, using R"s "Save to File..." option from the "File" Menu.

When you have completed your R coding, compile your rough work into a single word document, in sequential order beginning each exercise on a new page, and submit electronically using Turnitin. Your word document should be under 460 pages. If it is longer ensure that you have not repeated exercises multiple time in your rough work. Name your rough work file using the following convention: "Lastname_student#_BioStats2018.doc". Note: rough work is only required for exercises requiring R code. General written responses - marked with an * - do not require any rough work.

• Maintain a separate text file of your working code. After you have worked through the entirety of this assignment, cut-and-paste your working code into the pre-supplied Rstudio markdown template (.rmd file, available for download from Blackboard).

• Open the .rmd template in RStudio and replace the regions marked by the square brackets "[ ]" with your information / working code as instructed. E.g., where it asks for "Student Last Name: [insert last name here]" replace "[insert last name here]" with your last name. That is, if your last name was "Bloggs", the updated markdown template would read: "Student Last Name: Bloggs". Remove any "dev.new()" or "dev.off()" function calls from your code to ensure a valid .rmd template knit.

• After you have populated the markdown template, click "Knit HTML" from the RStudio File Menu or Menu bar. The .html document will then be generated and presented in a new window. If your R-code worked you should see valid entries/graphs for each exercise, including plots. The knitted .html file will be automatically saved to your working directory. Do not, under any circumstances, click the "PUBLISH" icon in Rstudio.

• Electronically submit your .doc rough work (with an eCoversheet) and your .html notebook on the Blackboard "Assessment" page using the corresponding submission links.

Aim: To compare measured body statures to self-reported body statures on Qld drivers' licences.

EXERCISE 1 - Hypothesis Formulation & Statistical Principles

a) Formulate a scientific hypothesis relevant to the above aim. Describe in your own words:

b) How a scientific hypothesis differ from the various statistical hypotheses (i.e., alternate and null-hypotheses)?

c) How do statistical hypotheses differ between Fisherian and Neyman-Pearson statistical tests?

d) What does the p-value tell you in statistical significance testing?

e) What is the difference between practical and statistical significance?

EXERCISE 2 - Basic Data Screening

Undertake this exercise in Microsoft Excel (as available on UQ computers).

a) Open the file "ClassData2018.xlsx" in Excel.

In cell B2 & C3 you will see formulae that calculate the mean and standard deviation for column B.

b) Enter similar formulae for finding the minimum and maximum values in cells B4 and B5 respectively. Use Excel® help to find these formulae as needed.

c) Click cell B2 and, whilst holding the shift key, B5. B2 to B5 should now be selected. Copy these cells and paste to neighbouring cells in columns C-F.

You should now have mean, standard deviation (SD), minimum (min), and maximum (max) values for each column at the top of the data sheet.

d) Because heights are meant to be recorded in millimetres, Seca, Chad and DL columns should contain cells with four digits only and no decimal places. Using the summary statistics you just calculated, identify the incorrect data entries in the rows below them and correct these typos. (NB. Normally original data records should also be re-examined to confirm typos, but we will skip that here due to logistical complexities.)

There are 10 errors total to be fixed across these three columns. Did you find them all? You can check by re-examining the summary statistics to ensure they make sense. If not go back to d) and find the additional corrections.

If yes, examine the Alog data...

e) Is it legitimate to report analogue scale data to one decimal place as some students have done for this data? If not what precision is reasonable and why?

Let's leave the decimal places for now. We will come back to them for the alog data in R shortly.

f) Examine the Alog summary statistics to identify any typos for these data (not counting decimal places), then make the necessary corrections (here you can use the dig data to help guide your judgement).

g) Examine the dig summary statistics to identify any typos for these data, then make the corrections in the data (here you can use the Alog data to help guide your judgement). Which cell needed adjustment?

You should now have corrected two data errors in the mass data. Now your data should be typo free.

h) Delete the top six rows of the data sheet, and save as a comma delimited file (.csv) named "ClassData2018.csv" to a working directory of your choice, ready for import to R.

EXERCISE 3 - R Data Entry

Open the base R console.

a) Without generating an error message, type into the R console "#EXERCISE 3".

b) Appropriately set your working directory.

c) Load the "ClassData2018.csv" file and assign it the object name "data2018". DO NOT make additional changes to the data, than what you already undertook at Exercise 2. Simply load the data and follow the instructions below.

d) Use the class() function to ensure your data table is of the class "data frame".

e) If you have not already done so, call "data2018" to the R console so that you can see the data table. Did you remember to set the header option to FALSE in the file load call? If not, repeat steps c) to e).

f) Create a new vector called "headers" and using the concatenate function, assign it the column names in sequence listed in "ClassData2018.xlsx".

g) Now, use the colnames() function to assign the corresponding elements of the headers vector to the data2018 data.frame.

h) Call back data2018 to check your commands worked and your R data now has the correct column headers.

EXERCISE 4 - Exploratory Data Analysis

a) Without generating an error message, type into the R console "#EXERCISE 4".

b) Write R function to generate a scatter plot for the Seca vs Chad data. Do not use any dev.off() or graphics.off() calls. Add a line through the origin using the abline() function.

Identify and note the two outliers below the main data cloud, representing differences of > 95 mm between the two measurements.

c) What is likely responsible for the differences here? Is it a problem with instrument, subject or practitioner error?

d) Using the which() function, identify the rows in data2018 corresponding to these outlying points. Seek R help if necessary using ?which().

e) Assign the result the object name "del2018". (Re type the command if necessary.)

f) Now, use this function to delete the two rows in the table that hold these outlying values, remembering to overwrite the old data2018 with the new values.

g) Use the length() function to calculate the new sample size of data2018. The number of individuals in data2018 should now be 217.

h) Regenerate the scatter plot for the Seca vs Chad data. Do not use any dev.off() or graphics.off() calls.

EXERCISE 5 - Some More Exploratory Data Analysis

a) Without generating an error message, type into the R console "#EXERCISE 5".

b) Scatterplot the Seca and DL data. Do not use any dev.off() or graphics.off() calls.

c) Scatterplot the Chad and DL data. Do not use any dev.off() or graphics.off() calls.

Think about these plots. What do they tell you? (No need to record your thoughts in the .rmd template - just think about the question.)

d) Scatterplot the Alog and Dig data. Do not use any dev.off() or graphics.off() calls. You should now be satisfied that all erroneous data have been removed from the dataset.

e) Reset the row labels in data2018 using the command: rownames(data2018) <- NULL.

You should now have a data.frame with 217 rows the same as your sample size calculation. Reapply the header vector to label the columns
Call back data2018 to check everything is in order.

f) Determine how many students entered decimal places for the Alog data by counting how many entries are not divisible by 10. Use the following code to do this: "count <- length(which(data2018[,5]%%1!=0))" and enter this call into your working code.

g) Calculate the percentage of students that this represents.

h) What does the "!=" call, used above, mean in R language?

EXERCISE 6 - Measurement Error

a) Without generating an error message, type into the R console "#EXERCISE 6".

b) Calculate the Technical Error of Measurement (TEM) for height using the Seca and Chad stadiometers as repeated measurements.

Hint: make a simple addition and adjustments to the following skeleton code to calculate the TEM: sum((data2018[,x]-data2018[,x])^2)/ (2*(length(data2018[,1]))).

Refer back to your BioStats notes if necessary.

c) Using a modification of the code you derived above, calculate the TEM for repeated body mass measurements (i.e., the Alog and Dig data).

d) What do these TEMs tell you about the accuracy of the height and mass data?

e) Why are measurement errors important to consider when interpreting data?

f) Why is it important to conduct exploratory data analysis prior to calculating P-values?

EXERCISE 7 - Dispersion Plots

a) Without generating an error message, type into the R console "#EXERCISE 7".

b) Generate a boxplot that includes the following height data in sequence on the same plot:

• Seca; Chad; DL Ensure you:
• use a ylim value from zero to 2500;
• assign an axis label, including units, to the y-axis.

c) Interpret the results. What is this graph telling you in regards to:

• repetability of body height measurement using the stadiometers?

• stadiometer measurements compared to self-reported driver's license height?

(No need to record your thoughts in the .rmd template - just consider the questions.)

d) Calculate the Pearson Product Moment Correlation Coefficient (yes! That's the one derived by Karl Pearson) for the Seca and DL height data using the cor() function.

e) What does the correlation coefficient tell you in this instance? Does this agree with the TEM results?

EXERCISE 8 - t-tests

a) Without generating an error message, type into the R console "#EXERCISE 8".

b) Use the function t.test() to conduct a paired, two-sided (or tailed), t-test between Chad and DL for all 217 individuals in data2018. Look up the necessary options for the function call in R help if required.

c) Why is a paired test used in this instance?

d) Why does Student's t-test go by the name of "Student"? Who formulated this test initially? Who helped correct the mathematics of this test?

e) Discuss the statistical significance of the test results:

• What is the P-value result and what is it telling you in this instance?

• What is the mean difference between the groups? Is this larger or smaller than the TEM?

• Based on the collective of the results generated in this assignment, is driver license height an accurate indicator of body height? How do these data fit with your hypothesis? Explain.

• Did the test you just conducted follow a Fisherian Statistical Significance test method or a Neyman- Pearson Null-Hypothesis Test? Why?

f) Now, let's take a smaller sample of n = 30 that includes approximately equal numbers of males and females and rerun the t-test (use rows 110:135). Note: this smaller sample is more applicable to the t-tests original formulation for small (rather than large) samples.

g) Now, what is the P-value result and how big is the mean difference between the groups? (No need to record your thoughts in the .rmd template - just consider the questions.)

h) What drives the difference between the P-value results in this instance?

(No need to record your thoughts in the .rmd template - just consider the question.)

i) What good are statistical significance tests, if everything becomes significant with large sample sizes?

EXERCISE 9 - Time Taken

Report the approximate time in hrs that you spent completing the rough work, then cut-and-paste your working code into the .rmd template, in Rstudio, to knit your .html notebook.

Request for Solution File

Ask an Expert for Answer!!
Applied Statistics: What does the p-value tell you in statistical significance
Reference No:- TGS02750089

Expected delivery within 24 Hours