Prove Raft Correct (opens in new tab)

(github.com)

173 pointsjtgi10y ago27 comments

27 comments

20 comments · 5 top-level

quibit10y ago· 5 in thread

What exactly is a consensus algorithm and how do you prove linearizability?

Kyle Kingsbury has a good look at linearizability and other forms of consistency: https://aphyr.com/posts/313-strong-consistency-models

r0naa10y ago

Kyle Kingsbury is the man, and his blog is a must read to anyone interested/curious by distributed systems.

quibit10y ago

That was a great read. Thanks for a new blog to put on my RSS feed

1 more reply

r0naa10y ago

distributed consensus is about having a group of processes agree on a single data value. For example, imagine you have a cluster of server. Each node has a replicated database.

(1) You want all your nodes to have the exact same replica of the database i.e consistency across your cluster.

You would need to reach consensus before any node actually adds anything to its local database to make sure that property (1) is fulfilled.

== Linearizability is just a consistency model i.e a variant of property 1 with stronger/weaker constraints.

quibit10y ago

Oh wow, that's actually really powerful stuff! Thanks for the explanation.

r0naa10y ago· 4 in thread

Amazing job! It has been mentioned on the Raft mailing list that a proof was in a progress, but to be honest I did not expect anything to come up before a year or so. Forgive my lack of faith! :-)

Well done!

edit: to provide some context

- Raft is a distributed consensus algorithm that is seen by many as a viable alternative to *-Paxos because of its relative simplicity. It was created by D. Ongaro and J. Ousterhout (Tcl/Tk!)

- D. Ongaro's dissertation includes a TLA+ specifications of Raft; TLA+ is a model checker (see ) for a more detailed explanation of what is a model checker/theorem prover's job

- TLA+ is a model checker; Coq is a proof assistant. See https://stackoverflow.com/questions/22418448/can-coq-be-used... for a more detailed explanation.

- Verdi is a Coq framework to make formal proof about distributed systems.

- Doug Woos and James Wilcox, made a proof of Raft using Verdi; Verdi helps you figure whether your implementation of X (in this case X=Raft) meets the specifications (Leader Safety, strong consistency etc..)

Link to the Verdi website: http://verdi.uwplse.org/

edit: as noted by ahelwer, TLA+ is not a model checker but a language to describe a distributed system's specifications. I was referring to TLC which can work with TLA+.

dougws10y ago

Hey, I'm Doug Woos--thanks for this excellent summary! It's worth noting that the Raft proof was completed by a team of people, including me, my research partner James Wilcox (http://homes.cs.washington.edu/~jrw12/), and the other folks listed on our web page at http://verdi.uwplse.org/

r0naa10y ago

Apologies to your research partner for leaving him out, it's edited! Congratulations on the proof by the way, I am looking forward to this week-end so I can have some time to appreciate it with more depth! :-)

1 more reply

ahelwer10y ago

TLA+ is a formal specification language, not a model checker. There exists a model checker called TLC which works with TLA+. There also exists an associated proof system called TLAPS; TLAPS has been used to formally prove correctness of Byzantine Paxos.

r0naa10y ago

Right, I have edited my original comment.

Confusion10y ago· 3 in thread

What isn't yet clear to me about these kinds of proofs: how can you verify that this proof is correct? How can you know there isn't a wrong assumption or wrong bit of 'code' in those 5500 lines? Would it have been impossible for Voevodsky to make his original mistake in such a formalization? Obviously the point of his entire homotopy type theory project is that he thinks he couldn't have. But why are mistakes (near) impossible here?

wilcoxjay10y ago

This is a great question that permeates all verification work. In the lingo, we call the set of assumptions a proof makes the "trusted computing base" (see https://en.wikipedia.org/wiki/Trusted_computing_base ).

I listed some of the assumptions we made in another comment https://news.ycombinator.com/item?id=10018985 .

The question you ask was believed for a long time to be the death knell of formal verification. It was persuasively argued in "Social Processes and Proofs of Theorems and Programs" https://www.cs.umd.edu/~gasarch/BLOGPAPERS/social.pdf that any proof of a complex system is necessarily at least as complex. Thus there would be no reason to trust the proof any more than the original code.

The breakthrough in verification came when we started using machine-checked proofs. In this approach, you write a short, simple, and trusted proof checker. You can then verify complex systems using complex proofs that are checked by the simple checker. Then the only possible errors in the proof can come from the proof checker being wrong. Because the checker is simple and general (ie, it can check all kinds of proofs, not just proofs about a particular program), it is more trustworthy.

Just to reiterate (including some content from my other comment): we can be wrong only if we have written the wrong definition of linearizability, misstated the correctness theorem, or if there is a bug in the proof checker or in Coq's logic itself.

1 more reply

pavpanchekha10y ago

To add on to wilcoxjay's excellent response, this sort of proof development is exactly what Voevodsky would like to (eventually) see happen in mathematics. The linked proof of Raft uses Coq, which is a small proof checker which makes sure that no mistakes were made in the proof. Simpler, "paper" proofs had already been done for Raft.

ongardie10y ago

I included the paper proof for Raft in my PhD dissertation, but there's a good chance it contains errors. Here's a quote from the intro: "The proof shows that the specification preserves the State Machine Safety property. The main idea of the proof is summarized in Section 3.6.3, but the detailed proof is much more precise. We found the proof useful in understanding Raft’s safety at a deeper level, and others may find value in this as well. However, the proof is fairly long and difficult for humans to verify and maintain; we believe it to be basically correct, but it might include errors or omissions. At this scale, only a machine-checked proof could definitively be error-free." From Appendix B of my dissertation: https://github.com/ongardie/dissertation/#readme

This Coq proof is an enormous step forward. It's that machine-checked proof I was hoping someone would do when I wrote the above paragraph. 1) Assuming the TCB is correct, we know the Verdi implementation satisfies linearizability. 2) Assuming the TCB is correct and the Verdi implementation and the Raft paper/dissertation/spec are equivalent, then the Raft algorithm satisfies linearizability too.

I'm more willing to believe that the Verdi implementation implements Raft faithfully than I am willing to believe that my hand proof is correct. If you're really worried about (2), you have the option of using the Verdi generated implementation. If you're less worried about (2), you can now have more confidence in the safety of the Raft algorithm in general.

mjb10y ago· 2 in thread

Extremely cool stuff.

Questions for the researchers, if they are reading:

- Does this proof cover any liveness concerns (weak fairness, deadlock freedom, etc) in addition to the safety property of linearizability?

- If not, what would it take to extend this model to cover liveness? Is this even a good starting point?

dougws10y ago

We are reading! This is a great question. We currently don't prove anything about liveness. We'd love to work on this.

As you probably know, Raft and other consensus algorithms are not guaranteed to be live in all situations. But subject to some assumptions about the frequency of failure, they are guaranteed to make progress.

In Verdi, systems are verified with respect to semantics for the network they are running on. Our semantics currently don't include any restrictions about how often failures can happen; a failure "step" can occur at any time. We're not sure what the best way is to introduce this kind of restriction, but we've got a couple ideas. One would be to guarantee that the total number of failures has some finite bound which is unknown to the system itself but which is available during verification. Another would be to model failure probabilistically. We will probably end up doing at least one of these things in the next year or so :).

mjb10y ago

Thanks. That sounds like extremely interesting research. Being able to say things about liveness relative to probability of failure (or the distribution of probability of failure) would be very interesting.

nernst10y ago· 1 in thread

This is interesting work. I'm curious how confident we can be that the TLA+ proof from Diego Ongaro was correctly represented in Verdi/Coq. This still seems like a manual, hard-to-verify process.

wilcoxjay10y ago

Hey, I'm James Wilcox, another member of the Verdi team.

This is a good question. More broadly, what do you have to trust in order to believe our claims?

In addition to all the usual things (like the soundness of Coq's logic and the correctness of its proof checker), the most important thing you need to trust is the top-level specification. In our case, this is linearizability (have a look at https://github.com/uwplse/verdi/blob/master/raft/Linearizabi... for the key definition; the key theorem statement is at https://github.com/uwplse/verdi/blob/master/raft-proofs/EndT... ). If you can convince yourself that these correspond to your intuitive understanding of linearizability, then you don't need to trust any other theorem statement or definition in the development.

If you actually want to run our code, then you need to make several other assumptions. Our system runs via extraction to OCaml, so you must trust the extractor and the OCaml compiler and runtime. In addition we have a few hundred lines of OCaml to hook up the Coq code to the real world (eg, writing to disk and putting bits on the network).

To respond more directly to your question about Diego's proof, I can tell you we referred to it often to get the high-level idea. But the TLA model differs in several respects from our implementation of Raft in Verdi. Most importantly, our code is an implementation in the sense that you can run it. This means that it resolves all nondeterminism in the specification. Furthermore, there is no need to manually check that what we implemented matches the TLA model, unless that is your preferred means for convincing yourself that we really did implement Raft.

j / k navigate · click thread line to collapse

27 comments

20 comments · 5 top-level

quibit10y ago· 5 in thread

What exactly is a consensus algorithm and how do you prove linearizability?

macintux10y ago

Kyle Kingsbury has a good look at linearizability and other forms of consistency: https://aphyr.com/posts/313-strong-consistency-models

r0naa10y ago

Kyle Kingsbury is the man, and his blog is a must read to anyone interested/curious by distributed systems.

quibit10y ago

That was a great read. Thanks for a new blog to put on my RSS feed

1 more reply

r0naa10y ago

distributed consensus is about having a group of processes agree on a single data value. For example, imagine you have a cluster of server. Each node has a replicated database.

(1) You want all your nodes to have the exact same replica of the database i.e consistency across your cluster.

You would need to reach consensus before any node actually adds anything to its local database to make sure that property (1) is fulfilled.

== Linearizability is just a consistency model i.e a variant of property 1 with stronger/weaker constraints.

quibit10y ago

Oh wow, that's actually really powerful stuff! Thanks for the explanation.

r0naa10y ago· 4 in thread

Amazing job! It has been mentioned on the Raft mailing list that a proof was in a progress, but to be honest I did not expect anything to come up before a year or so. Forgive my lack of faith! :-)

Well done!

edit: to provide some context

- Raft is a distributed consensus algorithm that is seen by many as a viable alternative to *-Paxos because of its relative simplicity. It was created by D. Ongaro and J. Ousterhout (Tcl/Tk!)

- D. Ongaro's dissertation includes a TLA+ specifications of Raft; TLA+ is a model checker (see ) for a more detailed explanation of what is a model checker/theorem prover's job

- TLA+ is a model checker; Coq is a proof assistant. See https://stackoverflow.com/questions/22418448/can-coq-be-used... for a more detailed explanation.

- Verdi is a Coq framework to make formal proof about distributed systems.

Link to the Verdi website: http://verdi.uwplse.org/

edit: as noted by ahelwer, TLA+ is not a model checker but a language to describe a distributed system's specifications. I was referring to TLC which can work with TLA+.

dougws10y ago

r0naa10y ago

1 more reply

ahelwer10y ago

r0naa10y ago

Right, I have edited my original comment.

Confusion10y ago· 3 in thread

wilcoxjay10y ago

I listed some of the assumptions we made in another comment https://news.ycombinator.com/item?id=10018985 .

1 more reply

pavpanchekha10y ago

ongardie10y ago

mjb10y ago· 2 in thread

Extremely cool stuff.

Questions for the researchers, if they are reading:

- Does this proof cover any liveness concerns (weak fairness, deadlock freedom, etc) in addition to the safety property of linearizability?

- If not, what would it take to extend this model to cover liveness? Is this even a good starting point?

dougws10y ago

We are reading! This is a great question. We currently don't prove anything about liveness. We'd love to work on this.

mjb10y ago

nernst10y ago· 1 in thread

This is interesting work. I'm curious how confident we can be that the TLA+ proof from Diego Ongaro was correctly represented in Verdi/Coq. This still seems like a manual, hard-to-verify process.

wilcoxjay10y ago

Hey, I'm James Wilcox, another member of the Verdi team.

This is a good question. More broadly, what do you have to trust in order to believe our claims?

j / k navigate · click thread line to collapse