The Limits of the CAP Theorem (opens in new tab)

(cockroachlabs.com)

182 pointsbdarnell9y ago44 comments

44 comments

32 comments · 9 top-level

falcolas9y ago· 6 in thread

> In the event that the leaseholder is partitioned away from the other replicas, it will be allowed to continue to serve reads (but not writes) until its lease expires (leases currently last 9 seconds by default), and then one of the other two replicas will get a new lease (after waiting for the first replica’s lease to expire).

So, what happens to readers who are partitioned away from the node which holds that data? Can they not read the data for that lease duration? If they can't, then yeah, CP is a good description.

...

So the design doc seems to hold this up - reads must go to the lease holder, until the lease expires. Nice.

EDIT: Design doc link:

https://github.com/cockroachdb/cockroach/blob/master/docs/de...

bdarnellOP9y ago

Yes, that's correct. If you are partitioned away from the lease holder, then you can't read until either the partition heals, or the lease expires and a new lease is granted to a node that you can reach.

Writes require the lease too, so it is not possible for a quorum of nodes on one side of a partition to serve writes while a lease holder on the other side serves stale reads. This is a degenerate case of quorum leases (https://www.cs.cmu.edu/~dga/papers/leases-socc2014.pdf) for a single lease holder; in the future we're interested in supporting multiple lease holders to improve read latency (at the expense of write performance and availability).

YZF9y ago

If a partitioned node can serve reads and the other nodes can serve writes then you must be reading stale data though.

falcolas9y ago

Writes also require the lease, so no data can be written until the lease expires.

Writes becoming unavailable during a partition is a reasonable solution for a CP system.

Groxx9y ago

Yeah, that outcome seems pretty straightforward...

It's probably not hard to require that writes (which require a majority) also require the lease-holder to ack the write, which seems like it'd solve this. It's a bit odd that they don't mention anything like this, but it is a fairly short blog post.

A bit of lazy browsing didn't lead me to any more detailed descriptions of how it handles partitions. Anyone else know?

1 more reply

readams9y ago

Allowing stale reads is a pretty weak version of consistency.

falcolas9y ago

The reads, per the design doc, can't get stale, because the values can't be updated until the lease ends.

zimbatm9y ago· 6 in thread

I have a new definition:

* CP is a database

* AP is a cache

Anyone else pretending AP is a database is lying (unless it's a content-addressable store) :p

jedberg9y ago

Casandra and Riak are AP, and both can certainly be used as sources of truth. You just have to move the "C" part up into your app, which may actually be a better place for it, since what is "consistent" can be dependent on the data and application of that data.

irfansharif9y ago

Here is what Google has to say about 'moving the "C" part up into your application':

“We also have a lot of experience with eventual consistency systems at Google. In all such systems, we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level.”[1]

[1]: https://yokota.blog/2017/02/17/dont-settle-for-eventual-cons...

1 more reply

zimbatm9y ago

Just :) Dealing with merge conflicts is not trivial. So far I know of three different strategies:

Automatic with CRDTs, but there is a price to pay in storage and available data-structures to play with.

Manual, like Git.

Denial, aka. Last Write Wins. With is probably good enough for a cache but has been used in other context.

urethrafranklin9y ago

Riak is configurable. It can be used as CP.

d_t_w9y ago

I have a system which computes over an immutable time series and stores the results in Cassandra, which I treat as the source of truth. I think it's fair to call it a DB.

zimbatm9y ago

What is the merge strategy after a partition?

The initial comment was mainly intended as a joke since a lot of AP systems use the last-write-wins strategy.

1 more reply

wwilson9y ago· 5 in thread

It's kind of amazing how we have to have this discussion again every time somebody designs a CP system with excellent availability.

I'll just come out and say it: the 'A' in CAP is boring. It does not mean what you think it means. Lynch et al. probably chose the definition because it's one for which the 'theorem' is both true and easy to prove. This is not the impossibility result with which designers of distributed systems should be most concerned.

My heuristic these days is that worrying about the CAP theorem is a weak negative signal. (EDIT: This is not a statement about CockroachDB's post, which doubtless is designed to reassure customers who are misinformed on the topic. I'm familiar with that situation, and it makes me feel a deep sympathy for them.)

(Disclosure: I work on a CockroachDB competitor. Also none of this is Google's official position, etc., etc. For that, here's the whitepaper by Eric Brewer that we released along with the Cloud Spanner beta launch https://static.googleusercontent.com/media/research.google.c...).

oillio9y ago

I read the paper and I don't understand this passage:

  For example, any database cannot provide availability if all of its replicas are offline, which has nothing to do with
  partitions. Such a multi-replica outage should be very rare, but if partitions are signi cantly more rare, then you can effectively
  ignore partitions as a factor in availability. 
  For Spanner, this means that when there is an availability outage, it is not in practice due to a partition, 
  but rather some other set of multiple faults (as no single fault will forfeit availability).

My understanding was that 'Partition' in CAP was a bit of a misnomer. To a running node, a partition of half the cluster is indistinguishable from half the nodes failing. So, partition tolerance really covers partitions as well as multi-node failures. Brewer wrote the original paper, so I will trust his definitions. However, if P doesn't cover multi-node failures, it seems to weaken the usefulness of CAP considerably. As is mentioned, in my experience, partitions are very rare. Multi-node failures on the other hand are the primary failure case I worry about.

(edit): I have thought about it some more, and this article really annoys me. It reads like marketing material: "CAP doesn't apply to us because we are Google, bitches."

There is an argument there, but I think the way Brewer makes the argument is really weak. I would much rather them say: "We have built a really great CP system. Also, because we are Google we are capable of 99.99958% uptime, so you really don't need to worry to much about tiny edge cases where you will lose A."

joncrocks9y ago

From what I understand of CAP (and I'm no expert by any means), 'partition tolerance' is the ability of a system to reconcile itself in the event of the system being split into partitions.

In these types of scenarios, different sections of the system are still working as normal, and each have a different view of the network of available nodes.

In a scenario where a set of homogeneous nodes in a system is split in two, both are equally 'available' and so the system as a whole has to decide what to do in that scenario. If both sides present themselves as available then they will be making decisions based only on interactions with half of the nodes in the system and their view of the system as a whole will start to drift apart.

This is bad because at the point where they get reconnected again they may well realise that the system as a whole is now not internally consistent. If you think about a distributed database, then you can start to having conflicting commits and now your DB is FUBAR.

You are right in thinking that from the point of view of a partition that it can't know if the rest of the system is just partitioned away or has crashed and will never come back.

But what that simplifying assumption is saying is that if you can ensure that you are much much more likely to have the nodes go down completely rather than actually be partitioned, then things are easy because you don't have to consider diverging system views and how you might re-integrate them.

Or something like that, I ended up writing more than I was planning!

1 more reply

tracked24x79y ago

Your comment mostly reads like signaling.

It is not "kind of amazing" considering Eric B. felt required to write a follow up to his CAP paper.

mcguire9y ago

I think the definition of availability is due to Brewer. The weird part of the 'theorem' due to Lynch and Gilbert is the definition ood consistency: in I recall correctly, it's linearizability, which is somewhat stronger than any guarantee that and DB actuality makes.

dragonwriter9y ago

> it's linearizability, which is somewhat stronger than any guarantee that and DB actuality makes.

Isn't that equivalent to serializability, which is a guarantee that lots of DBs offer (though you can choose weaker ones instead.)

2 more replies

ainar-g9y ago· 2 in thread

Does anybody have an experience with CockroachDB in production? Is it ready to replace PostgreSQL as "the default database"? How does it handle querying and updating big (>10Gb) collections of data?

kevan9y ago

It only hit 1.0 6 weeks ago[1], I don't think we'll have a good sample size of prod usage until the end of 2017 at the earliest

[1] https://www.cockroachlabs.com/blog/cockroachdb-1-0-release/

greggyb9y ago

If you're seeing trouble at 10G of data, you've got bigger problems than finding the right distributed database.

YZF9y ago· 2 in thread

I like to look at multi-core CPUs as examples. While in theory cores can partition from each other or fail in myriad of ways the system is engineered such that the probability of these failures is low enough that it doesn't matter. If you lose a core or you lose an interconnect between the cores, you lose the chip. Really you can look at each transistor on a chip (any chip) as a node in a distributed system, as long as the system is engineered not to fail you don't really think about CAP.

The more interesting trade-off is using consensus algorithms for availability and durability. You can keep going as long as you have a quorum of nodes but you pay an extra rtt (at least). Having multiple replicas (in either consistent or eventually consistent systems) costs in linearly more expensive writes and storage (typically, unless you use some sort of erasure coding.)

xfer9y ago

In that sense multi-core cpus/"logic cells" in FPGA are not really partition tolerant.(i.e CA)

closeparen9y ago

This is just choosing CP: "In the event of a partition, it's okay to lose availability, and this is okay because the likelihood of a partition is small."

AP would be to keep trying to run the chip with severed connections between cores.

marknadal9y ago· 2 in thread

Key quote:

"The only time that a CAP-Available system would be available when a CAP-Consistent one would not is when one of the datacenters can’t talk to the other replicas, but can talk to clients, and the load balancer keeps sending it traffic. By considering the deployment as a whole, high availability can be achieved without the CAP theorem’s requirement of responses from a single partitioned node."

It is true that if you assume your client app is not important that a CP system is the right choice. And I would also say this /was/ true up till about 2004 when Gmail was released. But it definitely stopped being true in 2007 when the iPhone was released and you started having installed apps.

Since then, users have slowly grown to expect both mobile apps and SPAs to work regardless of whether the servers work, regardless of load balances, regardless of connectivity.

If you look at the market trends, things are increasingly going in this direction. From self-driving cars, to IoT devices, to drone delivery, to even traditionally server-dependent productivity tools like gDocs and others - people need to get work done even if the internet to your server doesn't exist.

Will banking applications still need mostly server-dependent behavior? Yes. Is CP still important? Yes. But it is biased to say that CP systems are better. Choose the right tool for the right job. CockroachDB and RethinkDB are definitely the right choice for a strongly consistent database, but they aren't the right choice for everything. My database is an AP system, but it should not be used for many apps out there. Neither of these are "better", they are just tradeoffs you have to decide upon.

bdarnellOP9y ago

That's an important point. With mobile applications that support offline usage, you can no longer assume a single global source of truth, and the application as a whole is AP.

However, I'd argue that this tilts the balance even more in favor of a CP database on the backend. Even when the client application is not executing transactions on the database, consistency at the database level is what makes it possible to support secondary SQL indexes that work without surprises. An offline-capable mobile app buffers writes, moving the write to the server out of the critical path so server-side write-latency is not as visible to the user.

marknadal9y ago

Yes, again if you are doing some transactional behavior like two users buying the last concert seat. However, making those apps be offline-first is kinda silly in the first place.

The types of apps that naturally fit with mobile apps, client-facing behavior, are ones that have more append-only data structures (twitter, snapchat, messaging, etc.). Those apps benefit much more from an AP system rather than a CP system, because it makes the end user's (the client) life better/more-available.

Again, the right tool for the job. And CockroachDB is certainly the right choice for the right problem. Well written article, keep it up!

thraxil9y ago

Martin Kleppmann's "A Critique of the CAP Theorem" lays this all out very nicely and goes further, providing a better conceptual framework for discussing the tradeoffs: https://arxiv.org/abs/1509.05393

One of the best papers I've come across in the last few years.

Dave_Rosenthal9y ago

An older piece from FoundationDB (archived by odbms.org) that talks about the same issues and comes to many of the same conclusions: http://www.odbms.org/wp-content/uploads/2013/11/cap-theorem....

I think the overloaded term "availability" has been a big source of confusion for many trying to understand the implications of the CAP theorem at a simple level.

For example, a simple PAXOS implementation is "high availability" (continues working even when individual machines fail) but sacrifices "availability" in the CAP sense.

itcmcgrath9y ago

It is refreshing to see an article from a distributed database vendor that gives a reasonably good description of the trade-offs they make and why - without the all the nonsense hyperbole claiming they're the best for everything without any trade-offs*

* I've reviewed ~400 databases over the last month and it's surprising (?) how many of them are all the best of every use case and are the [fastest|first|only|best]

j / k navigate · click thread line to collapse

44 comments

32 comments · 9 top-level

falcolas9y ago· 6 in thread

So, what happens to readers who are partitioned away from the node which holds that data? Can they not read the data for that lease duration? If they can't, then yeah, CP is a good description.

...

So the design doc seems to hold this up - reads must go to the lease holder, until the lease expires. Nice.

EDIT: Design doc link:

https://github.com/cockroachdb/cockroach/blob/master/docs/de...

bdarnellOP9y ago

YZF9y ago

If a partitioned node can serve reads and the other nodes can serve writes then you must be reading stale data though.

falcolas9y ago

Writes also require the lease, so no data can be written until the lease expires.

Writes becoming unavailable during a partition is a reasonable solution for a CP system.

Groxx9y ago

Yeah, that outcome seems pretty straightforward...

A bit of lazy browsing didn't lead me to any more detailed descriptions of how it handles partitions. Anyone else know?

1 more reply

readams9y ago

Allowing stale reads is a pretty weak version of consistency.

falcolas9y ago

The reads, per the design doc, can't get stale, because the values can't be updated until the lease ends.

zimbatm9y ago· 6 in thread

I have a new definition:

* CP is a database

* AP is a cache

Anyone else pretending AP is a database is lying (unless it's a content-addressable store) :p

jedberg9y ago

irfansharif9y ago

Here is what Google has to say about 'moving the "C" part up into your application':

[1]: https://yokota.blog/2017/02/17/dont-settle-for-eventual-cons...

1 more reply

zimbatm9y ago

Just :) Dealing with merge conflicts is not trivial. So far I know of three different strategies:

Automatic with CRDTs, but there is a price to pay in storage and available data-structures to play with.

Manual, like Git.

Denial, aka. Last Write Wins. With is probably good enough for a cache but has been used in other context.

urethrafranklin9y ago

Riak is configurable. It can be used as CP.

d_t_w9y ago

I have a system which computes over an immutable time series and stores the results in Cassandra, which I treat as the source of truth. I think it's fair to call it a DB.

zimbatm9y ago

What is the merge strategy after a partition?

The initial comment was mainly intended as a joke since a lot of AP systems use the last-write-wins strategy.

1 more reply

wwilson9y ago· 5 in thread

It's kind of amazing how we have to have this discussion again every time somebody designs a CP system with excellent availability.

oillio9y ago

I read the paper and I don't understand this passage:

  For example, any database cannot provide availability if all of its replicas are offline, which has nothing to do with
  partitions. Such a multi-replica outage should be very rare, but if partitions are signi cantly more rare, then you can effectively
  ignore partitions as a factor in availability. 
  For Spanner, this means that when there is an availability outage, it is not in practice due to a partition, 
  but rather some other set of multiple faults (as no single fault will forfeit availability).

(edit): I have thought about it some more, and this article really annoys me. It reads like marketing material: "CAP doesn't apply to us because we are Google, bitches."

joncrocks9y ago

From what I understand of CAP (and I'm no expert by any means), 'partition tolerance' is the ability of a system to reconcile itself in the event of the system being split into partitions.

In these types of scenarios, different sections of the system are still working as normal, and each have a different view of the network of available nodes.

You are right in thinking that from the point of view of a partition that it can't know if the rest of the system is just partitioned away or has crashed and will never come back.

Or something like that, I ended up writing more than I was planning!

1 more reply

tracked24x79y ago

Your comment mostly reads like signaling.

It is not "kind of amazing" considering Eric B. felt required to write a follow up to his CAP paper.

mcguire9y ago

dragonwriter9y ago

> it's linearizability, which is somewhat stronger than any guarantee that and DB actuality makes.

Isn't that equivalent to serializability, which is a guarantee that lots of DBs offer (though you can choose weaker ones instead.)

2 more replies

ainar-g9y ago· 2 in thread

Does anybody have an experience with CockroachDB in production? Is it ready to replace PostgreSQL as "the default database"? How does it handle querying and updating big (>10Gb) collections of data?

kevan9y ago

It only hit 1.0 6 weeks ago[1], I don't think we'll have a good sample size of prod usage until the end of 2017 at the earliest

[1] https://www.cockroachlabs.com/blog/cockroachdb-1-0-release/

greggyb9y ago

If you're seeing trouble at 10G of data, you've got bigger problems than finding the right distributed database.

YZF9y ago· 2 in thread

xfer9y ago

In that sense multi-core cpus/"logic cells" in FPGA are not really partition tolerant.(i.e CA)

closeparen9y ago

This is just choosing CP: "In the event of a partition, it's okay to lose availability, and this is okay because the likelihood of a partition is small."

AP would be to keep trying to run the chip with severed connections between cores.

marknadal9y ago· 2 in thread

Key quote:

Since then, users have slowly grown to expect both mobile apps and SPAs to work regardless of whether the servers work, regardless of load balances, regardless of connectivity.

bdarnellOP9y ago

That's an important point. With mobile applications that support offline usage, you can no longer assume a single global source of truth, and the application as a whole is AP.

marknadal9y ago

Yes, again if you are doing some transactional behavior like two users buying the last concert seat. However, making those apps be offline-first is kinda silly in the first place.

Again, the right tool for the job. And CockroachDB is certainly the right choice for the right problem. Well written article, keep it up!

thraxil9y ago

One of the best papers I've come across in the last few years.

Dave_Rosenthal9y ago

An older piece from FoundationDB (archived by odbms.org) that talks about the same issues and comes to many of the same conclusions: http://www.odbms.org/wp-content/uploads/2013/11/cap-theorem....

I think the overloaded term "availability" has been a big source of confusion for many trying to understand the implications of the CAP theorem at a simple level.

For example, a simple PAXOS implementation is "high availability" (continues working even when individual machines fail) but sacrifices "availability" in the CAP sense.

itcmcgrath9y ago

* I've reviewed ~400 databases over the last month and it's surprising (?) how many of them are all the best of every use case and are the [fastest|first|only|best]

j / k navigate · click thread line to collapse