Purpose and Objective
The aim of this assignment is to affirm your knowledge and capabilities in drafting networks and storage requirements based on a given design scenario. To be able to identify main traffic sources. To be able to map storage needs to data flows requirements. To be able to estimate capacity based on an offered load estimated based on data and users requirements.
The ability to recognize and classify users, data, loads, and behaviours of IP flows.
Keywords
Traffic Sources, traffic flow, traffic load, traffic behavior, data sinks, user communities, QoS,
Data Communications Network Design
Enterprise Description:
Genome4U is a scientific research project at a large university in the United States. Genome4U has recently started a large-scale project to sequence the genomes of 250,000 volunteers with a goal of creating a set of publicly accessible databases with human genomic, trait, and medical data.
The project's founder, a brilliant man with many talents and interests, tells you that public databases would provide information to the world's scientific community in general, not just those interested in medical research. Genome4U is trying not to prejudge how the data would be used as there may be opportunities for interconnections and correlations which computers could find that people may have missed.
The founder envisions clusters of servers that would be accessible by researchers all over the world. The databases would be used by end users to study their own genetic heritage, with the help of their doctors and genetic counsellors. Additionally, the data would be used by computer scientists, mathematicians, physicists, social scientists, and other researchers.
The genome for a single human consists of complementary DNA strands wound together in a double helix. The strands hold about 6 billion base pairs of nucleotides connected by hydrogen bonds. To store the research data, 1 byte of capacity is used for each base pair. As a result, 6 Giga-Bytes of data capacity is needed to store the genetic information of just one person. The project plans to use network-attached Storage (NAS) clusters. A system has been prototyped using the current version of FreeNas software (from: www.freenas.org). Production software is expected to migrate to HBase and Hadoop cloud computing infrastructure.
Genome4U has developed new techniques to sequence a person’s genome, quickly, accurately and most importantly at low cost. Research group is a contestant for the $10,000,000 X-Prize offered by Archon-Genomics (see https://genomics.xprize.org for details). With their current funding they expect to complete the sequencing of 25,000 individuals by December 2012 and can sequence 5,000 individuals every month thereafter with the equipment that they are currently using.
Additionally to genetic information, the project would ask volunteers to give detailed information about their traits so that researchers could find correlations between traits and genes. Volunteers will also provide their medical records. Storage will be required for these data sets and the raw nucleotide data. This detailed medical information is expected to require not more than 100 Mega-Bytes of storage for each individual.
Since the data is to be publically shared, an initial community of 25,000 active users are expected, and this community expected to double every 18 months. Active users are expected to access 10% of the entire database daily which is expected to create huge demand on the networking infrastructure.
You have been brought in as a network design consultant to help the Genome4U project and the management team has asked you to help them organize their requirements.
Answer the following questions:
1. List the major user communities.
2. List the major data stores and the user communities for each data store.
3. Draw a graph of the storage requirements for the project monthly for the next 3 years.
4. Based on the size of the database, and the demands of the active users, what is the expected network capacity required to support the growing community of users? Add this capacity demand to the storage graph you drew above.
5. Can you find out the relationship between the storage size, number of genomes, number of users and network capacity requirements? If possible express this as an equation.
6. Review the capabilities of FreeNAS software. Will the FreeNAS software scale to the projected requirements of this application? If you find limits to its scalability what other solutions are possible?
7. Characterize the network traffic in terms of flow, load, behavior, and QoS requirements. You will not be able to precisely characterize the traffic but provide some theories about it and document the types of tests you would conduct to prove your theories right or wrong.
8. What additional questions would you ask Genome4U's founder about this project? Who besides the founder would you talk to and what questions would you ask them?