Bioinformatics Assignment -
In this assignment should check the following sequence and test whether it has the following restriction cut sites. This searching should be done globally, that is, it should check for all possible restriction sites. If the restriction sites are present, print out the regex, the pattern that matched the regex, and the position of where the cut beings.
Hint: the pos function gets the position of the last matched substring. Play around with it to see how it works.
Construct regular expressions for the two restriction enzyme motifs. Each restriction enzyme motif should be represented by one regular expression:
CACNNN/GTG (so CACNNN or CACGTG) where N represents A,C,T, or G
GCCWGG, where W represents A or T
The DNA sequence you will be searching in is this one, which you will paste into your program:
$dna = 'AACAGCACGGCAACGCTGTGCCTTGGGCACCATGCAGTACCAAACGGAACGATAGTGAAAACAATCACGA
ATGACCAAATTGAAGTTACTAATGCTACTGAGCTGGTTCAGAGTTCCTCAACAGGTGAAATATGCGACAG
TCCTCATCAGATCCTTGATGGAGAAAACTGCACACTAATAGATGCTCTATTGGGAGACCCTCAGTGTGAT
GGCTTCCAAAATAAGAAATGGGACCTTTTTGTTGAACGCAGCAAAGCCTACAGCAACTGTTACCCTTATG
ATGTGCCGGATTATGCCTCCCTTAGGTCACTAGTTGCCTCATCCGGCACACTGGAATTTAACAATGAAAG
CTTCAATTGGACTGGAGTCACTCAAAATGGAATCAGCTCTGCTTGCAAAAGGAGATCTAATAACAGTTTC
TTTAGTAGATTGAATTGGTTGACCCACTTAAAATTCAAATACCCAGCATTGAACGTGACTATGCCAAACA
ATGAAAAATTTGACAAATTGTACATTTGGGGGGTTCACCACCCGGGTACGGACAATGACCAAATCTTCCT
GTATGCTCAAGCATCAGGAAGAATCACAGTCTCTACCAAAAGAAGCCAACAGACTGTAATCCCGAATATC
GGATCTAGACCCAGAGTAAGGAATATCCCCAGCAGAATAAGCATCTATTGGACAATAGTAAAACCGGGAG
ACATACTTTTGATTAACAGCACAGGGAATTTAATTGCTCCTAGGGGTTACTTCAAAATACGAAGTGGGAA
AAGCTCAATAATGAGATCAGATGCACCCATTGGCAAATGCAATTCTGAATGCATCACTCCAAATGGAAGC
ATTCCCAATGACAAACCATTTCAAAATGTAAACAGGATCACATATGGGGCCTGGCCCAGATATGTTAAGC
AAAACACTCTGAAATTGGCAACAGGGATGCGAAATGTACCAGAGAAACAAACTAGAGGCATATTTGGCGC
AATCGCGGGTTTCATAGAAAATGGTTGGGAAGGAATGGTGGATGGTTGGTACGGTTT'
If you print out $dna, you may notice that the sequence is wrapped around some 70 characters or so. This means that $dna currently contains some \n characters in it, which will affect how regex matches against the string. In order to correctly identify all possible restriction sites, you would need to first remove those newline characters. This can be done by including the substitution operator after the variable declaration (similar to what we did in the vim writing exercises):
$dna =~ s/\s//g; # What would happen if the 'g' modifier is removed?
The following is the expected output. Instead of "$pattern1" and "$pattern2", you should be printing out the actual regular expression that you used to match the restriction enzyme motif. I did not print it out because that would give you part of the answer.
The program should include:
two regular expressions, one for each enzyme
one variable that contains the DNA sequence
optional if you would like to challenge yourself, include some code that will accept one command-line argument. If one is given, replace $dna above with the sequence provided by the user. Ensure that the provided sequence is a DNA sequence; otherwise, end the program and print a helpful message back to the user.
One subroutine called find_cut_sites that will accept 2 parameters: a DNA sequence and a regular expression. The subroutine should match the regular expression against the sequence and print the positions of all found cut sites. The position printed should be the starting position of where the site was found. Nothing should be explicitly returned by this subroutine. Whenever a subroutine does not explicitly return anything, it is known to be a void subroutine (void because a result is not provided back to the caller).
(There should be two subroutine calls for find_cut_sites(): one for each regular expression.)
Comments describing your subroutine (what it accepts, what it returns, what it does) and any other ambiguous code.
Attachment:- Assignment File.rar