Case Study:
As you learned in Case Study, Amazon processed more than 306 order items per second on its peak day of the 2012 holiday sales season. To do that, it processed customer transactions on tens of thousands of servers. With that many computers, failure is inevitable. Even if the probability of any one server failing is .0001, the likelihood that not one out of 10,000 of them fails is .9999 raised to the 10,000 power, which is about .37. Thus, for these assumptions the likelihood of at least one failure is 63 percent. For reasons that go beyond the scope of this discussion, the likelihood of failure is actually much greater. Amazon.com must be able to thrive, even in the presence of such constant failure. Or, as Amazon.com engineers stated: “Customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.” The only way to deal with such failure is to replicate the data on multiple servers. When a customer stores a Wish List, for example, that Wish List needs to be stored on different, separated servers. Then, when (notice when, not if) a server with one copy of the Wish List fails, Amazon applications obtain it from another server. Such data replication solves one problem but introduces another. Suppose that the customer’s Wish List is stored on servers A, B, and C and server A fails. While server A is down, server B or C can provide a copy of the Wish List, but if the customer changes it, that Wish List can only be rewritten to servers B and C. It cannot be written to A, because A is not running. When server A comes back into service, it will have the old copy of the Wish List. The next day, when the customer reopens his or her Wish List, two different versions exist: the most recent one on servers B and C and an older one on server A. The customer wants the most current one. How can Amazon.com ensure that it will be delivered? Keep in mind that 15.6 million orders are being shipped while this goes on. None of the current relational DBMS products was designed for problems like this. Consequently, Amazon.com engineers developed Dynamo, a specialized data store for reliably processing massive amounts of data on tens of thousands of servers. Dynamo provides an always-open experience for Amazon.com’s retail customers; Amazon.com also sells Dynamo store services to others via its S3 Web Services product offering. Meanwhile, Google was encountering similar problems that could not be met by commercially available relational DBMS products. In response, Google created Bigtable, a data store for processing petabytes of data on hundreds of thousands of servers.9 Bigtable supports a richer data model than Dynamo, which means that it can store a greater variety of data structures. Both Dynamo and Bigtable are designed to be elastic; this term means that the number of servers can dynamically increase and decrease without disrupting performance. In 2007, Facebook encountered similar data storage problems: massive amounts of data, the need to be elastically scalable, tens of thousands of servers, and high volumes of traffic. In response to this need, Facebook began development on Cassandra, a data store that provides storage capabilities like Dynamo with a richer data model like Bigtable Initially, Facebook used Cassandra to power its Inbox Search. By 2008, Facebook realized that it had a bigger project on its hands than it wanted and gave the source code to the open source community. As of 2012, Cassandra is used by Facebook, Twitter, Digg, Reddit, Cisco, and many others. Cassandra, by the way, is a fascinating name for a data store. In Greek mythology, Cassandra was so beautiful that Apollo fell in love with her and gave her the power to see the future. Alas, Apollo’s love was unrequited, and he cursed her so that no one would ever believe her predictions. The name was apparently a slam at Oracle. Cassandra is elastic and fault-tolerant; it supports massive amounts of data on thousands of servers and provides durability, meaning that once data is committed to the data store, it won’t be lost, even in the presence of failure. One of the most interesting characteristics of Cassandra is that clients (meaning the programs that run Facebook, Twitter, etc.) can select the level of consistency that they need. If a client requests that all servers always be current, Cassandra will ensure that that happens, but performance will be slow. At the other end of the trade-off spectrum, clients can require no consistency, whereby performance is maximized. In between, clients can require that a majority of the servers that store a data item be consistent. Cassandra’s performance is vastly superior to relational DBMS products. In one comparison, Cassandra was found to be 2,500 times faster than MySQL for write operations and 23 times faster for read operations12 on massive amounts of data on hundreds of thousands of possibly failing computers!
Q1. Clearly, Dynamo, Bigtable, and Cassandra are critical technology to the companies that created them. Why did they allow their employees to publish academic papers about them? Why did they not keep them as proprietary secrets?
Q2. What do you think this movement means to the existing DBMS vendors? How serious is the NoSQL threat? Justify your answer. What responses by existing DBMS vendors would be sensible?
Q3. Is it a waste of your time to learn about the relational model and Microsoft Access? Why or why not?
Q4. Given what you know about AllRoad, should it use a relational DBMS, such as Oracle Database or MySQL, or should it use Cassandra?
Q5. Suppose that AllRoad decides to use a NoSQL solution, but a battle emerges among the employees in the IT department. One faction wants to use Cassandra, but another faction wants to use a different NoSQL data store, named MongoDB. Assume that you’re Kelly, and Lucas asks for your opinion about how he should proceed. How do you respond?
Your answer must be typed, double-spaced, Times New Roman font (size 12), one-inch margins on all sides, APA format.