Assignment: Data Structure and Algorithms
The final exam is very time sensitive - you MUST turn this in at the end of the week in which it was assigned. If for any reason you need additional time you must contact your instructor prior to the end of the week and make arrangements.
1. Given the following DNA sequence: GGTGTAAAGAATCTT
a. Construct a keyword tree
b. Construct a suffix tree
2. How many different nucleotide sequences may code for the following protein sequence:
Arg-Lys-Pro-Val-Ser-Ile-Ala?
3. Given the following MSA (Multiple Sequence Alignment), describe (in pseudocode) how you would determine which positions contained informative sites:
AT-ACGCCGATGCAT
ATTACGACGATGCTT
ATTACGACGAAGCTT
AT-ACGACGATGCAT
4. Describe how gene finding algorithms work. Include a description of all the elements that they search for to help determine whether or not a sequence is a protein coding gene.
5. What is BLAST? Describe how the algorithm works. Be sure to include any statistical measures that are used in determining the strength of any BLAST results.
6. A graduate student has written part of an R script to perform an analysis. It is listed below.
• Describe what each line does by adding comment lines to it as appropriate.
• Execute the script and show all of the output it generates.
• Modify the script so that there are 3 centroids displayed.
• Provide the final modified script and its output.
Library (stats)
mydata <- iris
# Round 1
set.seed(101)
km <- kmeans(mydata[,1:4], 10)
plot(mydata[,1], mydata[,2], col=km$cluster)
points(km$centers[,c(1,2)], col=1:3, pch=19, cex=2)
#-------------------------------------------------------