In general, weakening the consistency guarantees is not done to improve scalability, but rather to improve the partition tolerance. There are replication techniques (e.g. chain replication) that allow very high performance without sacrificing consistency.
But can it also manage simultaneous writes (of possibly conflicting data) to multiple masters, to achieve write scalability? Won't it get slowed down, checking every master for conflicts before committing?
The point of database partitioning is that a transaction that doesn't span partitions only touches a constant number of servers (usually 2 or 3).
There are at least two problems to solve on opposite ends of scale: how to get many components inside one computer to cooperate without stepping all over each other, and how to get many computers to cooperate without drowning in coordination overhead. They may be special cases of a more general problem, and one solution will work for all. Or perhaps we'll have one kind of programming for the large and another for the small, just as the mechanics of life are different inside and outside of the cell.
That last sentence is a very strong put-down of NoSQL; if it wasn't published on the ACM website, it should have a "zing!" at the end of it.
Hey Michael,
Interesting read. I'm with you on most points. I do have a couple comments though...
Failure mode 1 is an application error.
Failure mode 2 should result in a failed write. It shouldn't be too hard to trap errors programmatically and handle them intelligently / not propagate them. Of course the devil's in the details and hardware / interpreter / other problems in components that are outside of the DBs control can make things more difficult. These are the sorts of issues that tend to stick around until a system is "battle tested" and run in a couple of large / high volume operations.
Failure modes 3, 4, 5, 6 (partition in a local cluster) - this seems to be where the meat of your argument is... but you totally gloss over your solution. I'm not sure I believe that network partitions (even single node failures) are easily survived by lots of algorithms... Or, more specifically, I don't believe that they can be survived while maintaining consistency (in the CAP sense, not the ACID sense). I threw together a super simplified "proof" of why consistency is essentially impossible in this situation in a recent talk. See http://www.slideshare.net/mmalone/scaling-gis-data-in-nonrel... - slides 16 through 20. What algorithms are there to get around this? If a replica is partitioned you either can't replicate to it and have to fail (unavailable) or you can't replicate to it and succeed anyways (replica is inconsistent).
I also don't buy the argument that partitions (LAN or WAN) are rare and therefore we shouldn't worry about them. For a small operation this may be true, but when you're doing a million operations a second then a one-in-a-million failure scenario will happen every second.
Failure mode 7 will probably result in some data loss unless (as you mention) you're willing to live with the latency of waiting for durable multi-datacenter writes to occur. But having that option is definitely nice, and that's a trade off that I'd like to be able to make on a per-write basis. I may choose to accept that latency when I'm recording a large financial transaction, for example. Another thought related to this issue - in a lot of ways writing something to memory on multiple nodes is more "durable" than writing it to disk on one. So you may be able to do multi-DC replicated writes in memory with tolerable latency assuming your DCs are close enough that the speed of light isn't limiting. That should get you durability up to the point where the entire eastern seaboard disappears, at least.
Failure mode 8 is another core issue that I think you're glossing over. WAN failures (particularly short-lived ones) can and do happen on a regular basis. It's true that routing issues are typically resolved quickly, but it's another law-of-large-numbers thing. Amazon AWS had an issue that took an entire data center offline a while back, for example. Shit happens. In CAP terms this is really the same thing as a failure modes 3, 4, 5, 6, and 7 though. So the same arguments apply. Re: your argument that only a small segment splits - what happens when a read comes into the small split segment (maybe from a client in the same datacenter)? If data has been updated on the larger segment it couldn't have been replicated, so again you're either serving stale data or your data store is unavailable.
Thanks for putting this together, it was an interesting read. Looking forward to hearing more about some of these issues!
There are however several issues with this. First, eventual consistency means eventual consistency of data on the replicas; it doesn't mean that a client can't be presented with a consistent view. Dynamo-pattern systems use quorums to provide this. Even when a quorums aren't used (R + W < N), at least with Voldemort, the weak (no read-your-writes) form of eventual consistency is still only a failure condition: when the "coordinator" (first in the preference list) node for the specific key fails after a write has happened, but before the next read is made.
Second, a CA system means that the core switch (and there may be multiple within a datacenter, especially in the case of startups leasing colocation space or using a managed/"cloud" hosting provider) is now a single point of failure. The way to build around this is by building an "AP" layer on top of the "CA" layer, spanning multiple core switches. This is similar to Yahoo's PNUTS and Google's Spanner. Both of these had been multi-year projects, by companies with expertise in distributed computing, with very specific and limited requirements (i.e., they were building this for themselves, not selling them as general purpose solution). Which brings me to the next point:
"CA" consistency is much more difficult (that is, error prone) to implement than "AP" consistency. Two phase commit is one way to do so, but doesn't provide fault tolerance (it won't withstand the failure of the coordinator/master node). Paxos is one way to do so, but a high performance Paxos implementation requires leases and is still very tricky to implement. Again, it took years for Google to build, trouble shoot and perfect the Chubby/GFS/BigTable stack; the first version did not have a fault tolerant master and the query model is still much simpler than SQL.
That's why I am skeptical when people claim to be able to deliver to market (within months, not years) a commercial solution that provides strong consistency, fault tolerance, horizontal scalability (without a hard upper limit), supports multiple datacenters and still allows execution of SQL queries (even if without certain types of JOINs) with OLTP-suitable performance. That's not to say it's logically impossible, it's just a very bold claim to make.