Neat Algorithms: Paxos (opens in new tab)

(harry.me)

331 pointshbrundage11y ago38 comments

38 comments

28 comments · 10 top-level

ahelwer11y ago· 6 in thread

Software engineers love Paxos because it takes something very complex (a distributed system) and makes it equivalent to working with a single machine: you only ever talk to the leader. It gives you redundancy at the expense of performance.

Paxos is used to achieve something called Strong Consistency, where each node sees the same message in the same order. If you think of each node as a deterministic state machine, they are guaranteed to end up in the same state after responding to the same sequence of messages. It's nice and intuitive, but requiring global synchronization on every write is terrible for performance.

Other consistency schemes exist. A popular one is Eventual Consistency, where writes are made immediately at any node (not just the leader) and the system is expected to synchronize in the background and "converge" to the same state. However, this can result in merge conflicts: if you're editing a document in collaboration with other users, what if you edit a word in a paragraph while another user deletes that entire paragraph? Does the system resolve this automatically, or require user assistance? The answer to this question varies according to system requirements. I think most HN users have experienced the joys of resolving merge conflicts.

A newer model is something called Strong Eventual Consistency, which is similar to Eventual Consistency but merge conflicts are impossible by design: every update to the system must be commutative, associative, and idempotent with other updates. It is not always possible to design your system this way. These systems are implemented with Conflict-Free Replicated Data Types (or ad-hoc equivalents) and have excellent liveness/throughput/performance characteristics compared to Strong Consistency.

CRDTs are not as simple as Paxos. You're forced out of the cozy one-system world and your system must deal with two nodes concurrently holding different values. For most applications, magic Paxos dust is all you need. For others, CRDTs are an excellent tool.

steventhedev11y ago

I strongly suggest Shapiro's paper[0] on CRDTs. In a nutshell, the only really problematic data type are sequences such as arrays or strings. There are some specialized approaches specifically for those, and you can always fall back on a LWW conflict resolution.

In general, I like to think of Paxos as an approach that uses LWW for all value types.

[0] - http://pagesperso-systeme.lip6.fr/Marc.Shapiro/papers/RR-695...

ahelwer11y ago

Shapiro's name is on about seven different CRDT papers :) it makes citations difficult for the Wikipedia article. Personal opinion, this[0] 2011 paper is probably the best one to read, where Shapiro's and Baquero's teams finally joined forces and put out a good comprehensive paper on the subject. The one you linked focuses a bit too heavily on TreeDoc in lieu of good treatment of CRDT theory. Their survey of known CRDTs[1] is also worth reading.

[0] https://hal.inria.fr/file/index/docid/609399/filename/RR-768...

[1] https://hal.inria.fr/inria-00555588/document

1 more reply

otis_inf11y ago

IMHO 'Eventual consistency' and related 'consistencies' are not really giving a consistent state, as they're more about a promise that the system will reach a consistent state 'eventually', but when that happens is unknown: in a volatile database, the system could never reach a consistent state, due to the lack of linearizability. See: http://hackingdistributed.com/2013/03/23/consistency-alphabe...

ahelwer11y ago

Wow! Great blog post. I've been looking for something along these lines for a while. The author is correct that Wikipedia's coverage of consistency is difficult to follow, and I don't yet have a deep enough understanding to contribute.

nwmcsween11y ago

Can you prove authorship with CRDTs? scuttlebutt[0] moved from a CRDT implementation to a log based structure of signed messages[1].

[0]: https://github.com/dominictarr/scuttlebutt

[1]: https://github.com/ssbc/secure-scuttlebutt

ahelwer11y ago

I'm not sure what you mean by proving authorship, could you explain?

krat0sprakhar11y ago· 4 in thread

This is up for debate but this[0], IMHO, is pretty much the gold standard of explaining distributed algorithms.

[0] - http://thesecretlivesofdata.com/raft/

nacs11y ago

Thanks the presentation, helps a lot.

Quick question though. On this slide ( http://i.imgur.com/m02CMxx.png ) it shows a network split condition and shows how the 2 split networks will eventually negotiate and the 3 node split wins because it had a majority while the 2 node side's uncommitted changes are thrown out.

What happens if the split happens right down the middle (3 active nodes on each side instead of the 2 and 3)? Wouldn't both sides elect leaders that both have majorities with committed data?

1 more reply

DAddYE11y ago

I agree, I would also add that Paxos seems exceptionally hard to get it right or at least this was what I saw so far in popular projects (i.e. Cassandra) that use it.

atmosx11y ago

That's impressive, thanks for sharing.

jackmaney11y ago

Neat. Thanks!

lordnacho11y ago· 3 in thread

Having looked for a few minutes, it really reminds me of the routing protocols used for distributing routes in networks. (Also Layer 2 stuff IIRC). There you also find heartbeats, elections, etc.

Is there a connection?

Also, does it have anything to do with the Byzantine Generals Problem?

ahelwer11y ago

Heartbeats (is this system up?) and leader election (which server should I talk to?) are common components of any distributed system. Byzantine generals is a much different problem. Whereas Paxos gets the majority of nodes (a quorum) to agree on a certain value, Byzantine Generals deals with unanimous agreement in a system with network link failures. Furthermore, in Byzantine Generals, some of the generals can be traitors actively working against the others. Unanimous agreement isn't all that useful or interesting, but production distributed systems should handle Byzantine failures. This covers innocent events such as corrupted data/messages and software bugs all the way up to hackers trying to take over your system through a compromised node. Variants of paxos exist which handle or mitigate Byzantine failures.

neilc11y ago

> production distributed systems should handle Byzantine failures

I don't think this is true, at least not as stated. You should certainly be thinking about more than crash-stop failures, but full-blown Byzantine fault tolerance is rarely warranted in practice. Empirically, the number of systems that use non-Byzantine vs. Byzantine agreement is probably on the order of 100:1.

1 more reply

sinwave11y ago

Well for one, both Paxos and "Byzantine Generals" were authored by Leslie Lamport (latest Turing award winner IIRC).

But Paxos has gentler assumptions about adversaries than in the Byzantine Generals problem.

emin-gun-sirer11y ago· 2 in thread

This well-illustrated post is technically about the core Synod agreement protocol in Paxos. Building a consistent distributed service on top requires additional scaffolding and infrastructure. Typically, people layer on a system that implements a "replicated state machine (RSM)" on top, which maintains the illusion of a single consistent object, even though it is composed of distributed replicas.

Also keep in mind that Raft, Zab, and View-Stamped replication (in reverse chronological order) are alternatives to the Synod protocol in Paxos. These protocols differ from Paxos by employing a different leader-election mechanism and slightly different way of maintaining their invariants.

There have been many Paxos variants. This site [1] shows the various Paxos variants over a timeline and points out their contributions.

Those of you interested in building replicated state machines using Paxos should take a look at OpenReplica [2]. It is a full Multi-Paxos implementation that takes any Python object and makes it distributed and fault-tolerant, like an RPC package on steroids.

[1] http://paxos.systems/

[2] http://openreplica.org/faq/

ahelwer11y ago

It looks like you are one of the developers of OpenReplica?

emin-gun-sirer11y ago

Yes, it's an open-source project from my research group at Cornell.

amelius11y ago· 2 in thread

Also interesting: [1]

> Raft is a consensus algorithm that is designed to be easy to understand. It's equivalent to Paxos in fault-tolerance and performance. The difference is that it's decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems. We hope Raft will make consensus available to a wider audience, and that this wider audience will be able to develop a variety of higher quality consensus-based systems than are available today.

[1] https://raftconsensus.github.io/

SamReidHughes11y ago

However, see http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-857.pdf about Raft's pitfalls in environments where the network is being difficult.

samirmenon11y ago

This is a great walkthrough of Raft:

http://thesecretlivesofdata.com/raft/

_almosnow11y ago· 1 in thread

> Side note: it’s important that no two proposers ever use the same sequence number, and that they are sortable, so that they truly reference only one proposal, and precedence between proposals can be decided using a simple comparison.

They are moving the core problem into a different domain. Worst explanation of PAXOS ever... nice animations though.

Edit: 'Worst explanation' is just an exaggeration, obv. It is nice, but doesn't explain really important issues.

zerker200011y ago

The requirement is that they are sortable, nothing about the numbers reflecting the actual order of proposals (inasmuch as uncommitted actions in a distributed system can be considered to be ordered). Concatenating system time and node number upholds this property: 4:01-A and 4:01-B produced by nodes A and B respectively are distinct and numerically sortable, and both in turn "predate" 4:13-C attempted later.

ash11y ago

> Honest-to-goodness real-life implementations of Paxos can be found at the heart of … Google’s magnificent Spanner database…

I'm not sure about Spanner and Paxos. Sebastian Kanthak said during his Google Spanner talk:

"If you've been to the Raft talk this morning, our Paxos implementation is actually closer to the Raft algorithm than to what you'd read in the Paxos paper, which is… if you haven't read it, don't read it, it's horrible." (at 7:43)

http://www.infoq.com/presentations/spanner-distributed-googl...

Animats11y ago

Nice animations.

How do you keep a broken or hostile node from advancing the sequence number to the end of the sequence number space?

There's an algorithm from one of Butler Lamson's grad students at MIT which fixes this, but it seems to require one more message per cycle. (http://pmg.csail.mit.edu/~castro/thesis.pdf) That paper later appears as a Microsoft Research paper on how to make an NFS-like file system with this consensus properly. Did Microsoft ever put that in a product?

subbu11y ago

The animations are a bit fast. Would've been great if the reader could control them. Easier to learn.

fitshipit11y ago

> Know Paxos? Stealth-mode big data startup is hiring founding engineer

lol

j / k navigate · click thread line to collapse

38 comments

28 comments · 10 top-level

ahelwer11y ago· 6 in thread

steventhedev11y ago

In general, I like to think of Paxos as an approach that uses LWW for all value types.

[0] - http://pagesperso-systeme.lip6.fr/Marc.Shapiro/papers/RR-695...

ahelwer11y ago

[0] https://hal.inria.fr/file/index/docid/609399/filename/RR-768...

[1] https://hal.inria.fr/inria-00555588/document

1 more reply

otis_inf11y ago

ahelwer11y ago

nwmcsween11y ago

Can you prove authorship with CRDTs? scuttlebutt[0] moved from a CRDT implementation to a log based structure of signed messages[1].

[0]: https://github.com/dominictarr/scuttlebutt

[1]: https://github.com/ssbc/secure-scuttlebutt

ahelwer11y ago

I'm not sure what you mean by proving authorship, could you explain?

krat0sprakhar11y ago· 4 in thread

This is up for debate but this[0], IMHO, is pretty much the gold standard of explaining distributed algorithms.

[0] - http://thesecretlivesofdata.com/raft/

nacs11y ago

Thanks the presentation, helps a lot.

What happens if the split happens right down the middle (3 active nodes on each side instead of the 2 and 3)? Wouldn't both sides elect leaders that both have majorities with committed data?

1 more reply

DAddYE11y ago

I agree, I would also add that Paxos seems exceptionally hard to get it right or at least this was what I saw so far in popular projects (i.e. Cassandra) that use it.

atmosx11y ago

That's impressive, thanks for sharing.

jackmaney11y ago

Neat. Thanks!

lordnacho11y ago· 3 in thread

Having looked for a few minutes, it really reminds me of the routing protocols used for distributing routes in networks. (Also Layer 2 stuff IIRC). There you also find heartbeats, elections, etc.

Is there a connection?

Also, does it have anything to do with the Byzantine Generals Problem?

ahelwer11y ago

neilc11y ago

> production distributed systems should handle Byzantine failures

1 more reply

sinwave11y ago

Well for one, both Paxos and "Byzantine Generals" were authored by Leslie Lamport (latest Turing award winner IIRC).

But Paxos has gentler assumptions about adversaries than in the Byzantine Generals problem.

emin-gun-sirer11y ago· 2 in thread

There have been many Paxos variants. This site [1] shows the various Paxos variants over a timeline and points out their contributions.

[1] http://paxos.systems/

[2] http://openreplica.org/faq/

ahelwer11y ago

It looks like you are one of the developers of OpenReplica?

emin-gun-sirer11y ago

Yes, it's an open-source project from my research group at Cornell.

amelius11y ago· 2 in thread

Also interesting: [1]

[1] https://raftconsensus.github.io/

SamReidHughes11y ago

However, see http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-857.pdf about Raft's pitfalls in environments where the network is being difficult.

samirmenon11y ago

This is a great walkthrough of Raft:

http://thesecretlivesofdata.com/raft/

_almosnow11y ago· 1 in thread

They are moving the core problem into a different domain. Worst explanation of PAXOS ever... nice animations though.

Edit: 'Worst explanation' is just an exaggeration, obv. It is nice, but doesn't explain really important issues.

zerker200011y ago

ash11y ago

> Honest-to-goodness real-life implementations of Paxos can be found at the heart of … Google’s magnificent Spanner database…

I'm not sure about Spanner and Paxos. Sebastian Kanthak said during his Google Spanner talk:

http://www.infoq.com/presentations/spanner-distributed-googl...

Animats11y ago

Nice animations.

How do you keep a broken or hostile node from advancing the sequence number to the end of the sequence number space?

subbu11y ago

The animations are a bit fast. Would've been great if the reader could control them. Easier to learn.

fitshipit11y ago

> Know Paxos? Stealth-mode big data startup is hiring founding engineer

lol

j / k navigate · click thread line to collapse