Programming Assignment: Cancer Genome Identification Tool
I. Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
Implement pointers and/or arrays
Apply parallel arrays
Compare and contrast pointers and arrays
Pass output parameters to functions
Apply repetition structures within algorithms
Compose C programs consisting of sequential, conditional, and iterative statements
Create structure charts for a given problem
Determine an appropriate functional decomposition or top-down design from a structure chart
II. Prerequisites:
Before starting this programming assignment, participants should be able to:
Analyze a basic set of requirements and apply top-down design principles for a problem
Apply repetition structures within an algorithm
Construct while (), for (), or do-while () loops in C
Compose C programs consisting of sequential, conditional, and iterative statements
Eliminate redundancy within a program by applying loops and functions
Create structure charts for a given problem
Open and close files
Read, write to, and update files
Manipulate file handles
Apply standard library functions: fopen (), fclose (), fscanf (), and fprintf ()
Compose decision statements ("if" conditional statements)
Create and utilize compound conditions
Summarize topics from Hanly&Koffman Chapter 6 including:
o What is a pointer?
o What is an output parameter?
III. Overview & Requirements:
One person dies from cancer every minute in the U.S. (https://cancergenome.nih.gov/). DNA is the chemical responsible for carrying instructions that control cells. When the instructions are not recognized by the cells because of mutations, cells do not function properly. Improper functioning of cells can lead to cancer.
If mutations can be identified, then cancer treatments can be applied. Software may be used to identify mutations in the genome. The genome is the collection of DNA instructions in your cells. Most cells contain two sets of chromosomes, one from your father and one from your mother. Each chromosome has billions of DNA strands that consist of nucleotide bases. The four bases are A, C, G, and T. In the double helix structure of DNA, for a normal cell, the A-T and C-G bases are paired.
For this assignment we simplify our model of the genome. Our goal is to identify mutations in a DNA sequence. We will place our normal DNA sequences and "test" sample sequences in a file called "sequences.txt". The section of the file that represents the normal sequences will be identified by a ‘N' in the file, and the section that represents the "test" sample sequences will be represented by a ‘S'.
Mutations will be identified by mismatched base pairs, such as A-C, A-G, T-C, T-G, C-A, G-A, C-T, and G-T. They will also be identified by changes (flips) in any of the bases from the normal sequence to the sample sequence. Our definition of sequence is base pairs across multiple lines in our file. For example:
N
ATGGAATTCTCGCTC
TACCTTAAGAGCGAG
CGGTCA
GCCAGT
S
TTGGAATTCTAGCTC
AACCTTAAGAGCGCG
CGATGA
GCCACT
The file may contain an unknown number of sequences. However, you may assume that each sequence will not exceed 15 bases as shown above.
Your program must identify each mutation by indicating in which sequence it is found and in which position in the sequence. The results must be written to a file called "mutations.txt". Using the example above, your program would write the following to the file:
Mutation(s) found in sequence 1
Pair 1 flipped pair
Pair 11 mismatched pair
Pair 14 mismatched pair
Mutation(s) found in sequence 2
Pair 3 mismatched pair
Pair 5 flipped pair
BONUS
Generate randomly paired bases to form random DNA sequences that are written to your "sequences.txt" file.