Need help for Q4
1. Define zero and first order Markov models for the sequence (seqeuence1_A2) provided in the course content. Sequence1_A2 is Mycobacterium tuberculosis gene mtb48
Hints:
- Zero order Markov model is defined by P(i), where i= {A,T,G,C}
- First order Markov Model is defined by P(i|j), where i,j ={A,T,G,C}. For example P(A|T) is probability of observing A after T in DNA sequence
- For this and higher order Markov models read 3.2.1 of Borodovsky and Ekisheva
- To implement this would be easiest by writing a small script in R using a alphabetFrequencyfunction of the Biostrings package you have already installed or perl or any other language of your choice. Otherwise, if you have to, exhausted all the options , see no other way and hopelessly behind on your schedule, you can use Microsoft word or excel's substitute function or MS word's find/replace.
2. Using models you derived in (1) determine the probability of DNA fragment AGTAGCTTCCAG (this fragment was also used in A1)
3. Given hidden Markov Model framework
a. What is hidden?
b. What is emitted?
Feel free to use examples
4. a) Define zero order Markov model for sequence2_A2, which represents portion of non-coding sequence of Mycobacterium tuberculosis(refer to course content)
b) Use zero order Markov models defined for sequence1_A2 and sequence2_A2 and apply Viterbi algorithm to find the most likely path for sequenceCGCGTTCATTCAATG in frame 1 only
Assume:
Initial transition probabilities
a0c= a0n =0.5
ann= anc =0.5
acc =0.55acn= 0.45
where, aij is transition probability, c- coding, n-non-coding
Note that this problem is for exercise purposes. As a result for this short sequence you may observe even shorter coding/noncoding regions.