Why are consensus algorithms always developed as systems, not as libraries? Zookeeper, etcd and LogCabin all operate as a cluster of processes which other nodes connect to over a client library.
I can imagine that the distributed-state-machine-replication-mechanism of Raft or ZAB being implemented as a library where the user has to provide an implementation of the communication layer. Such a library can be used as a starting point for building other more complex systems which aim to provide a low-friction install experience. For example, one good thing about both Cassandra and ElasticSearch is that they both have homogeneous clusters where all nodes play the same role. Incidentally, from what I understand, they both embed a consensus implementation within.
Similarly, a membership service (and failure detector) over gossip protocols will also be very useful.
An installation guide which starts with "First, install Zookeeper cluster. Then, install a SWARM cluster. Then ..." is not very appealing. That being the case, I wonder why there is no mature OSS library which provides these services. What does HN think about this situation?
Part of the issue is that we want to have libraries with small simple interfaces, and a consensus library is probably going to be on the large side, as it has to interface with the disk, the network, and the state machine, plus membership changes, log compaction, who's the leader, configuration settings, debug logging, etc.
Another issue is that, as a library, it has to be in the language you're using. And if that language is C++ or similar, it has to be compatible/convenient with whatever threading approach you're using.
Then there's performance/memory. Some Raft implementations are designed to keep relatively small amounts of data in memory (like LogCabin or etcd), and others are meant to store large amounts on disk (like HydraBase). Some are optimized more for performance, others for safety.
I think we'll get libraries eventually. Keep in mind Raft is still very young. Paxos is around 26 years old, Raft is only .5 to 3 years old (depending on when you start counting). I like to think that Raft lowered the cost of developing a consensus system/library significantly, but it still takes time to develop mature implementations. Right now we have a lot of implementations in early stages; I wonder if some of these efforts will consolidate over time into really flexible libraries.
I work on CockroachDB, which uses etcd's Raft implementation, so we are an existence proof that consensus libraries are possible. However, the library we share was the second attempt by the authors (Xiang Li and Yicheng Qin) to create a Raft implementation. The first attempt to develop a "library form" of a consensus algorithm is likely to reach a dead end, but over time it is possible to develop a library-style implementation.
Distributing a service invokes the sort of lowest-common-denominator solution to this problem, which is "OS process". Everyone can talk to an OS process.
(This is an explanation, not advocacy or celebration.)
Writing it as a service lets you implement it in once in a high-level language like Java or Go and use it from all languages. To get the same portability in a library you'd have to write it in C. It's hard enough to maintain consistent data without worrying about memory corruption bugs.
The protocol implementation is only a part of durable consensus, you also need durable storage, sensible network timeouts, shutdown handling, etc.
You usually want different configurations for your application and your quorum. For the quorum you usually want 3, 5 or 7 members, while your application could be anywhere from one to hundreds of instances. Your quorum members must always know how many other members are supposed to be in the quorum, while your application could remove or add instances on the fly. For the quorum you need low latency, while you may want to optimise your application for throughput. (e.g. disks, garbage collection, swapping)
Databases can be hosted on the framework. DocumentDB is, for example. While at MS, I wrote a near-real-time metrics system using Fabric and worked on a tunable-consistency distributed cache built on Fabric.
Summary of Service Fabric here: http://daprlabs.com/blog/blog/2015/04/30/service-fabric-2/
[0]: https://hoverbear.github.io/raft/raft/index.html
[1]: https://hoverbear.github.io/raft/raft/store/trait.Store.html
[2]: https://hoverbear.github.io/raft/raft/state_machine/trait.St...
[0] https://github.com/logcabin/logcabin/tree/master/Event
Did you consider libuv?
1) Ah, cool, creator of Raft algo, so some of the 'obvious' mistakes in an implementation should've been resolved by now (though if ppl weren't trying to use it in production.... who knows).
2) Great, C++, it should be efficient and fast with consistent RAM usage (Go's GC is a bit.... eh... still).
3) Oh, you need a C++ client library. :(
I would love to say that API's don't matter, but they do. So, so, so much. If they didn't, etcd would never have had a chance against Zookeeper. The Zookeeper folks are looking at adding RESTful API's to allow functionality ala-etcd, because its obvious a convenient API is a huge win. Any distributed system solution attempting to gain steam should consider this from the beginning now, as the bar has been set.
It's an ugly cousin of premature optimization that you have to deal with if you're producing APIs for your software. If the API requires anything that isn't braindead simple, you are going to lose out to competitors due to the learning curve.
i have a question for you, though. why is raft not concerned with byzantine failure? the focus on byzantine fault tolerance from the paxos family of algos (and a lot of the literature/educationally material on distributed consensus) makes me feel like it's important, but your approach suggests it perhaps isn't. do you think this focus is a side-effect of the ubiquity of paxos which is disproportionately concerned with this due to its roots in academia?
Byzantine is more complex, and most people in industry aren't doing it: there are a lot of Byzantine papers out there but few real-world implementations. I think Byzantine is important for uses where the nodes really can't be trusted for security reasons, and maybe there's easier fault-tolerance payoffs elsewhere when the entire system is within one trust domain such as a single company.
Byzantine consensus is slower and requires more servers.
If you don't have independent implementations running on each of your servers, the same software bug could still take out your entire cluster. You get some benefit if the hardware fails independently, but you don't get protection from correlated software outages. Maybe the difficulty in characterizing which faults a particular deployment can handle makes it harder to sell to management.
With Raft, we were just trying to solve non-Byzantine consensus in a way people could understand, and we think it's still a useful thing to study even if your ultimate goal is Byzantine consensus. You might be interested in Tangaroa, from the CS244b class at Stanford, where Christopher Copeland and Hongxia Zhong did some work towards a Byzantine version of Raft [1][2] and Heidi Howard's blog post on it [3]. But really, Castro and Liskov's PBFT is a must read here [4].
[1] http://www.scs.stanford.edu/14au-cs244b/labs/projects/copela...
[2] https://github.com/chrisnc/tangaroa
[3] http://hh360.user.srcf.net/blog/2015/04/conservative-electio...
Why do people need something like zookeeper or LogCabin ?
How does a coordinator came to play ?
I don't know much about distribute system, but I would love to learn more...
Another way to think about it is that consensus gets you the equivalent of a compare-and-swap operation in a distributed setting. Just as compare-and-swap is useful for building synchronization primitives with shared memory, consensus is useful for building synchronization primitives across a network.
In case of Zookeeper you can utilize various recipes, like here: http://curator.apache.org/curator-recipes/index.html
You can even build more advanced systems on top of it:
https://bookkeeper.apache.org/index.html
https://cwiki.apache.org/confluence/display/BOOKKEEPER/HedWi...
I come from an academic lineage of log-based projects, from log-structured filesystems [1] which structure disks as a log, to RAMCloud [2][3][4] whose durability/recovery aspects are a distributed and partially in-memory extension of that, to Raft and LogCabin that are built around the concept of a replicated log for consensus.
LogCabin used to export a log-oriented data model, by the way, where the name made a bit more sense even. There was some talk of renaming it to TreeHouse now that it exports a key-value tree, but that one didn't really catch on.
[1] https://web.stanford.edu/~ouster/cgi-bin/papers/lfs.pdf
[2] http://ramcloud.stanford.edu
[3] https://www.usenix.org/conference/fast14/technical-sessions/...
[4] https://web.stanford.edu/~ouster/cgi-bin/papers/RumblePhd.pd...
https://github.com/logcabin/logcabin/blob/master/Event/Loop....
an event can be read/write, and each fd:event_type should be able to map to a different Event::file.
In its current form, if you receive 2000 events, you'll need to unnecessarily context switch 2000 times.
I do think it's an interesting (dare I say) limitation of epoll, and something I'd think about if I was redesigning epoll.
Please file an issue on github if you have ideas on how to improve this without adding significant complexity.
Really though, it's probably not a feature list that'll make people want to use LogCabin. In Scale Computing's case, it's that they have a C++ code base that they could integrate LogCabin well with, and they know they can work with and maintain the code if they need to.
RAMCloud used to depend on an earlier version of LogCabin (before the data model was a key-value tree), then John rewrote the RAMCloud end of that and switched it to using ZooKeeper, but he made it pluggable on the RAMCloud side. So we just need an "ExternalStorage" implementation for LogCabin in RAMCloud to make that work again. It should be relatively straightforward and I'm confident they'd welcome a well-written and tested patch, but no one has volunteered yet.
LogCabin does currently keep all data in memory, so it never touches disk for read requests. It also keeps a copy of the entire log in memory for now, but I hope to address that eventually [2].