Assignment
The genome of an organism can be expresses as some number G of "base pairs". Typical sizes of various genomes are given in.
String matching can be used to find particular sequences in a genome. Several string matching algorithms are described in.
Consider a program to find to find if a particular sequence of base pairs is found in a genome, and if so, where and how many times.
Your program will run on a cluster with the following properties:
Number of nodes - 20
Number of processors per node 16 2.6 GHz Xeon
Memory per node 16 GB
GPU - 2 (NVIDIA CUDA) per node, 1024 stream processors and 4GB RAM, running at 1.5 GHz
local drives 1 T SATA , 6 GB/sec
NFS drive 10TB RAID, bandwidth limited by network
Switched Ethernet network
Latency L = 20 microseconds
Bandwidth B = 1Gb/sec == 100 Mbytes/sec for messages
larger than 32Kbytes
You may not need all the above information. If you feel you need some other system property, feel free to assume some reasonable value (Try Wikipedia)
Assume the genome you are exploring and the sequence you are trying to find, are both initially files on the NFS disk.
1. Parallel String Match algorithm - in MPI, OpenMP, CUDA or some combination of these. Description in English and/or pseudocode is sufficient. Is yoyur algorithm data parallel, task parallel or both?
Describe data transfer during computation (disk to program, process to process, CPU - GPU and node - node). Describe how data is partitioned between processes, shared between processes, or replicated at each process.
2. You may not need all the hardware available for your algorithm. You may use the entire cluster or any part of it. Describe what resources your algorithm will use to execute. Explain your choice.
3. Estimate how your algorithm would perform on the computer system described above. Consider:
a. Complexity; communication costs.
b. Is there some file size (in bytes, number of elements, or both) that is too small for your algorithm to work efficiently? Given the wide range of genome sizes, is there some range of size that you expect would be best for your algorithm?
c. How much speedup would you exepect on the given hardware as compared to running on a single CPU?
Format your assignment according to the following formatting requirements:
1. The answer should be typed, double spaced, using Times New Roman font (size 12), with one-inch margins on all sides.
2. The response also include a cover page containing the title of the assignment, the student's name, the course title, and the date. The cover page is not included in the required page length.
3. Also Include a reference page. The Citations and references should follow APA format. The reference page is not included in the required page length.