Errors in Database Systems, Eventual Consistency, and the CAP Theorem (opens in new tab)

(cacm.acm.org)

41 pointsdm_mongodb16y ago28 comments

28 comments

11 comments · 4 top-level

kennu16y ago· 5 in thread

I always figured that the C in CAP is more about scalability. If you keep adding servers indefinitely, it becomes slower and slower to commit every transaction to every server. But if you ditch the C, you can write fully in parallel to all servers.

cx0116y ago

ACID transactions require consistency (C in CAP) in order to guarantee atomicity. So if you ditch C, you'll just be unable to perform transactions at all.

In general, weakening the consistency guarantees is not done to improve scalability, but rather to improve the partition tolerance. There are replication techniques (e.g. chain replication) that allow very high performance without sacrificing consistency.

kingkilr16y ago

The C in CAP and the C in ACID are not the same C. Consistency in the ACID sense means that all constraints are met, Consistency in the CAP sense means all servers have the same data.

2 more replies

kennu16y ago

Without knowing much about it, I can imagine that chain replication could reduce the problem of copying data to every server from O(n) to something like O(log n).

But can it also manage simultaneous writes (of possibly conflicting data) to multiple masters, to achieve write scalability? Won't it get slowed down, checking every master for conflicts before committing?

1 more reply

wmf16y ago

If you keep adding servers indefinitely, it becomes slower and slower to commit every transaction to every server.

The point of database partitioning is that a transaction that doesn't span partitions only touches a constant number of servers (usually 2 or 3).

kennu16y ago

I would interpret that as a compromise where C, A and P are divided into some ratio (you still don't get 100% of each). I understand Cassandra does that well if the application fits the model.

justinsb16y ago· 2 in thread

I think Stonebraker is right: CAP has been dramatically over-applied.

aristus16y ago

It's probably to do with differences in scale. At the LAN level, he's right: partitioning is rare and C/A might make the most sense. At the WAN level, partitioning is a fact of life.

There are at least two problems to solve on opposite ends of scale: how to get many components inside one computer to cooperate without stepping all over each other, and how to get many computers to cooperate without drowning in coordination overhead. They may be special cases of a more general problem, and one solution will work for all. Or perhaps we'll have one kind of programming for the large and another for the small, just as the mechanics of life are different inside and outside of the cell.

justinsb16y ago

Agreed, but he's addressed WAN partitioning very well. Editing: "[Consider] a partition in a WAN network. There is enough redundancy engineered into today’s WANs that a partition is quite rare ... the most likely WAN failure is to separate a small portion of the network from the majority. In this case, the majority can continue with straightforward algorithms, and only the small portion must block. Hence, it seems unwise to give up consistency all the time in exchange for availability of a small subset of the nodes in a fairly rare scenario."

That last sentence is a very strong put-down of NoSQL; if it wasn't published on the ACM website, it should have a "zing!" at the end of it.

1 more reply

mmalone16y ago

I posted this comment to the ACM article, but they're pre-moderating and apparently someone's shirking their responsibilities... so I'm reposting here. Hope you guys don't mind:

Hey Michael,

Interesting read. I'm with you on most points. I do have a couple comments though...

Failure mode 1 is an application error.

Failure mode 2 should result in a failed write. It shouldn't be too hard to trap errors programmatically and handle them intelligently / not propagate them. Of course the devil's in the details and hardware / interpreter / other problems in components that are outside of the DBs control can make things more difficult. These are the sorts of issues that tend to stick around until a system is "battle tested" and run in a couple of large / high volume operations.

Failure modes 3, 4, 5, 6 (partition in a local cluster) - this seems to be where the meat of your argument is... but you totally gloss over your solution. I'm not sure I believe that network partitions (even single node failures) are easily survived by lots of algorithms... Or, more specifically, I don't believe that they can be survived while maintaining consistency (in the CAP sense, not the ACID sense). I threw together a super simplified "proof" of why consistency is essentially impossible in this situation in a recent talk. See http://www.slideshare.net/mmalone/scaling-gis-data-in-nonrel... - slides 16 through 20. What algorithms are there to get around this? If a replica is partitioned you either can't replicate to it and have to fail (unavailable) or you can't replicate to it and succeed anyways (replica is inconsistent).

I also don't buy the argument that partitions (LAN or WAN) are rare and therefore we shouldn't worry about them. For a small operation this may be true, but when you're doing a million operations a second then a one-in-a-million failure scenario will happen every second.

Failure mode 7 will probably result in some data loss unless (as you mention) you're willing to live with the latency of waiting for durable multi-datacenter writes to occur. But having that option is definitely nice, and that's a trade off that I'd like to be able to make on a per-write basis. I may choose to accept that latency when I'm recording a large financial transaction, for example. Another thought related to this issue - in a lot of ways writing something to memory on multiple nodes is more "durable" than writing it to disk on one. So you may be able to do multi-DC replicated writes in memory with tolerable latency assuming your DCs are close enough that the speed of light isn't limiting. That should get you durability up to the point where the entire eastern seaboard disappears, at least.

Failure mode 8 is another core issue that I think you're glossing over. WAN failures (particularly short-lived ones) can and do happen on a regular basis. It's true that routing issues are typically resolved quickly, but it's another law-of-large-numbers thing. Amazon AWS had an issue that took an entire data center offline a while back, for example. Shit happens. In CAP terms this is really the same thing as a failure modes 3, 4, 5, 6, and 7 though. So the same arguments apply. Re: your argument that only a small segment splits - what happens when a read comes into the small split segment (maybe from a client in the same datacenter)? If data has been updated on the larger segment it couldn't have been replicated, so again you're either serving stale data or your data store is unavailable.

Thanks for putting this together, it was an interesting read. Looking forward to hearing more about some of these issues!

strlen16y ago

It's an interesting article, as fundamentally sound systems have been built which are "CA" within a single datacenter. BigTable is the premier example.

There are however several issues with this. First, eventual consistency means eventual consistency of data on the replicas; it doesn't mean that a client can't be presented with a consistent view. Dynamo-pattern systems use quorums to provide this. Even when a quorums aren't used (R + W < N), at least with Voldemort, the weak (no read-your-writes) form of eventual consistency is still only a failure condition: when the "coordinator" (first in the preference list) node for the specific key fails after a write has happened, but before the next read is made.

Second, a CA system means that the core switch (and there may be multiple within a datacenter, especially in the case of startups leasing colocation space or using a managed/"cloud" hosting provider) is now a single point of failure. The way to build around this is by building an "AP" layer on top of the "CA" layer, spanning multiple core switches. This is similar to Yahoo's PNUTS and Google's Spanner. Both of these had been multi-year projects, by companies with expertise in distributed computing, with very specific and limited requirements (i.e., they were building this for themselves, not selling them as general purpose solution). Which brings me to the next point:

"CA" consistency is much more difficult (that is, error prone) to implement than "AP" consistency. Two phase commit is one way to do so, but doesn't provide fault tolerance (it won't withstand the failure of the coordinator/master node). Paxos is one way to do so, but a high performance Paxos implementation requires leases and is still very tricky to implement. Again, it took years for Google to build, trouble shoot and perfect the Chubby/GFS/BigTable stack; the first version did not have a fault tolerant master and the query model is still much simpler than SQL.

That's why I am skeptical when people claim to be able to deliver to market (within months, not years) a commercial solution that provides strong consistency, fault tolerance, horizontal scalability (without a hard upper limit), supports multiple datacenters and still allows execution of SQL queries (even if without certain types of JOINs) with OLTP-suitable performance. That's not to say it's logically impossible, it's just a very bold claim to make.

j / k navigate · click thread line to collapse

28 comments

11 comments · 4 top-level

kennu16y ago· 5 in thread

cx0116y ago

ACID transactions require consistency (C in CAP) in order to guarantee atomicity. So if you ditch C, you'll just be unable to perform transactions at all.

kingkilr16y ago

The C in CAP and the C in ACID are not the same C. Consistency in the ACID sense means that all constraints are met, Consistency in the CAP sense means all servers have the same data.

2 more replies

kennu16y ago

Without knowing much about it, I can imagine that chain replication could reduce the problem of copying data to every server from O(n) to something like O(log n).

1 more reply

wmf16y ago

If you keep adding servers indefinitely, it becomes slower and slower to commit every transaction to every server.

The point of database partitioning is that a transaction that doesn't span partitions only touches a constant number of servers (usually 2 or 3).

kennu16y ago

I would interpret that as a compromise where C, A and P are divided into some ratio (you still don't get 100% of each). I understand Cassandra does that well if the application fits the model.

justinsb16y ago· 2 in thread

I think Stonebraker is right: CAP has been dramatically over-applied.

aristus16y ago

It's probably to do with differences in scale. At the LAN level, he's right: partitioning is rare and C/A might make the most sense. At the WAN level, partitioning is a fact of life.

justinsb16y ago

That last sentence is a very strong put-down of NoSQL; if it wasn't published on the ACM website, it should have a "zing!" at the end of it.

1 more reply

mmalone16y ago

I posted this comment to the ACM article, but they're pre-moderating and apparently someone's shirking their responsibilities... so I'm reposting here. Hope you guys don't mind:

Hey Michael,

Interesting read. I'm with you on most points. I do have a couple comments though...

Failure mode 1 is an application error.

Thanks for putting this together, it was an interesting read. Looking forward to hearing more about some of these issues!

strlen16y ago

It's an interesting article, as fundamentally sound systems have been built which are "CA" within a single datacenter. BigTable is the premier example.

j / k navigate · click thread line to collapse