Know Thy CAP Theorem for NoSQL

Posted on

I’ve been attending some NoSQL events of late and the same questions keep coming up. There is usually at least one person in the room who wants to debunk NoSQL technologies for no other reason than they are more comfortable working with traditional data stores. Not a problem unless they manage to derail an otherwise informative presentation. Another ideology usually in attendance is the guy who wants NoSQL systems to be interchangeable and ‘one size fits all’. He/she has a very low tolerance for design trade-offs scalable systems make to provide certain features or make them more devops friendly. Then there are people like me, who know how long it takes to introduce new technology into an established IT/Operations regime in large organisations. The chosen technology absolutely needs to be the right tool for the job and objectivity is key no matter how cool the tech is.

Recently I was consulting on a large scale message store project. The data profile was as follows: heavy write load (90%), infrequent reads (10%), data almost immutable meaning it changed very infrequently after the initial write. Furthermore the service fee was calculated based on the number of messages stored per day, meaning availability was tied to the bottom line. How did I help the organisation make the choice? I used the CAP theorem.

The definition of CAP is as follows (Wikipedia):

In theoretical computer science, the CAP theorem, also known as Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

  • Consistency (all nodes see the same data at the same time)
  • Availability (a guarantee that every request receives a response about whether it was successful or failed)
  • Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

A great visual reference to illustrate which technologies fit where is below:

media_httpfarm5static_mevIk

From: http://blog.nahurst.com/visual-guide-to-nosql-systems

So for example, in the application my client needed Partition Tolerance was an absolute must. Furthermore, we knew consistency could be sacrificed in favour of availability. In this case, lower consistency means, eventually consistent.

So our choices were down to the AP vertex. Maturity of the technology was a factor as well as supportability. The client in question, being a large, ponderous organisation, was not going to install a technology into a production environment without a suitable support package from a reputable company. They also weren’t interested in Cloud based hosting.

The second driver for technology selection was the schema. Each message stored had about 10 fields associated, keyed off id and type. There were no requirements for facet based searching or complex queries on the messages outside of being able to pull out a list of messages of a particular type for a particular day and an edge case where someone might want to look at their backed up messages immediately after they had been stored.

Cassandra was the technology we went with primarily because the volume of writes involved and its ability to right itself should the data input become too unwieldy for a period of time. With its tuneable consistency level set to QUORUM or LOCAL_QUORUM for multi-DC, if Cassandra doesn’t get two successful responses from nodes in the cluster for every write (with RF=3) it will throw an error. That means every write is guaranteed to have been successfully written on at least two nodes and replicated to a third (at some point). This tuneability guarantees us the immediate consistency requirement in our case.

In this case why wouldn’t an RDBMS have sufficed? If we look at the CAP, RDBMS falls into the consistent and available category. RDBMS can be scaled by sharding or full replication, but these strategies come with complexities and limits, the major one being that you will always need to have redundancy per shard which means your operational complexity increases as well as having to make the application tier aware of your sharding scheme. In my opinion, the increase in application and operational complexity doesn’t justify the cost. Not to mention that the cost of licensing and support for an Enterprise RDBMS is typically calculated per server.

Why not MongoDB? I will first say that MongoDB is a guilty pleasure of mine, mainly because it took less than five minutes to get an application going with it using Spring Data (http://www.springsource.org/spring-data/mongodb). There are many use cases where Mongo fits the bill but again availability was of paramount importance under heavy write loads. With MongoDB, there are failure cases where the authoritative shard per segment is unavailable meaning the data in that shard is unavailable, this can occur during master re-election.

The above are just examples of how the CAP theorem can be applied to a concrete use case. Had the use case been more focsed on consistency and partitioning, giving me the consistency of an RDBMS without the complexity at scale, MongoDB may have been suitable. If I knew the data set would never grow beyond a certain size, I may not have required partitioning. I’m not saying that CAP is the be all and end all of choosing a NoSQL solution but certainly it is a starting point in deciding which technologies should be considered for your use case.

Also useful to have a browse of are…

http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing