So there's no cost to open sourcing. The benefit comes from being known as technically innovative in general, and for recruiting, being known as having interesting, meaty, challenging projects to work on.
The impetus usually comes from team members who want to do the work. It could be to become known for having worked on the project, or a sense of giving back to the community, or a hope that you'll get bug fixes & features from outside contributors. In my (very limited) experience, managers "passively encourage" it -- they generally don't push the team to do it, but when the team asks, they encourage it.
The cost of maintaining an open source project is real, but when it is a world-class piece of infrastructure, open sourcing it helps keep it world-class.
I imagine you looked at other solutions before starting this. A distributed log is a fairly simple idea to understand (hard to implement) but what pain point is being solved?
Seeing that it is written in C/C++ - would it be that logdevice is optimised purely for speed and responsiveness?
It seems very similar.
In terms of function. LogDevice is similar to the core of Apache Kafka.
In Kafka bulk reading is very cheap, the broker basically just calls sendfile() to send a file segment with compressed message chunks. On the other hand only the leader of a partition can serve requests, so you are often limited by bandwidth. It looks like LogDevice has to do a bit more work server side, but may be able to read from all servers with a replica.
Kafka stores more metadata in the record wrapper, like client and server timestamps and partition key.
There are client libraries for C++ and Python.
Operationally they look similar - both require a Zookeeper cluster, and both require assigning permanent ids to nodes.
It would be interesting to see some benchmarks comparing LogDevice with Kafka and Pulsar. That said, I suspect from the lack of buzz around Pulsar that Kafka isn't a performance bottleneck for most people using it.
The announcement does not clarify the reason they use this over kafka. Is it because Kafka doesn't scale to millions of logs on a single cluster or is it because kafka is not sympathetic to heterogeneous disk arrays containing SSD and HDD. I strongly suspect it may be latency of writes at scale but this is pure speculation.
I don't know. If I understand why anyone might use this I'd contribute to building language bindings for the APIs.
- It's designed to work with a large number of logs (roughly equivalent to partitions in Kafka), hundreds of thousands per cluster is common.
- Sequencer failover is very quick, typical failover time when a sequencer node fails is less than a second.
- It supports location awareness and can place data according to replication constraints specified (e.g. replicate it in 3 copies across 2 different regions and 3 racks).
- Because of non-deterministic data placement, it is very resilient to failures in terms of write availability.
- If a node/shard fails, it detects the failure and rebuilds the data that was replicated to failed nodes/shards automatically
I am happy to expand more on this point.
We have this concept of "node set" of a log which is the set of storage nodes available to receive record copies sent by the sequencer. It is typically made of 20-30 nodes in typical deployments at Facebook. Write availability is maintained as long as enough storage nodes in the node set are available to accept copies. When storage node failures are detected, the sequencer can just exclude these nodes from the list of potential recipients for new records. It does not need to update a view that needs to be synchronized with readers, which is a heavy-weight operation. This model allows preserving high write availability even if many nodes in the node set are unhealthy.
Additionally, this record copy placement flexibility allows the sequencer to quickly route around latency spikes on individual storage nodes, which helps guarantee low append latency.
I doubt that's it, since Kafka can certainly do that.
Try benchmarking Kafka from 0 partitions to a few thousand partitions in 100 partition increments. The benchmark only needs to write to a single topic, using their provided producer perf tool while all other topics are inactive with zero data.
As the partitions increase there is a very noticeable drop in throughout that looks to be linear.
Kafka does not handle a large number of partitions well currently, large even being low thousands. It's easy to hit with just a few hundred topics.
Reading between the lines ehen Linkdin and Netflix advertise several clusters, i am predicting/guessing they shard the data.
Kafka has done well so far, especially in making streaming systems more common, but it's about time for the next-gen systems.
Meanwhile the compute layer becomes very lightweight and almost stateless, which is easy to scale. In LogDevice, the Sequencers are potential bottlenecks but generating a series of incrementing numbers is about the fastest thing you can do so it'll outpace any actual data ingest to a single log, while giving you a total order of all entries within that log. The numbers (LSNs) follow the Hi/Lo sequence pattern so if a Sequencer fails, another one takes its place with a greater "High" number, so it's guaranteed that all of its LSNs will be greater than the previous Sequencer as a result. This also provides a built-in buffer to still accept messages and assign the permanent LSNs to them after recovery in case a Sequencer fails.
Apache Pulsar is similar to LogDevice but goes further where brokers manage connections, routing and message acknowledgements while data is sent to a separate layer of Apache Bookkeeper nodes which store the data in append-optimized log files.
I'm not that familiar with Kafka, but in general LogDevice emphasizes write availability over read availability. There are many applications where data is being generated all the time, and if you don't write it, it will be lost. However, if reading is delayed, it just means readers are a little behind and will need to catch up.
So, when a sequencer node dies and we need to figure out what happened to the records that were in flight -- which ones ended up on disk & can be replicated, what the last record was -- LogDevice still accepts new writes. However, to ensure ordering, these new writes aren't visible to readers until the earlier writes are sorted out.
Scribe isn’t the only place where LogDevice is used though — Facebook has documented using it for TAO as well (as part of the secondary indices)
I think it's fairly simple and might be enough. Can't comment on storage requirements thou.
Do people just point their system journal at Kafka and wait for something to break?
At my previous job we built something similar to this out of rabbitmq and mongodb. I always wondered what the other big log companies used. Mongodb seemed like a pretty good fit, but a pure append only database might be even better. Trimming performance in MongoDB was subpar so we worked around it by creating a new collection for each day, trimming became a simple operation of dropping a collection at the end of each day.
Kafka can be used as a data store if you like, so long as you're happy with the data management and access patterns it gives you - it is, after all, optimised for large sequential reads.
LogDevice looks to be very similar for most use cases to Kafka, hell, they even use RocksDB, which is used by stateful operations in Kafka Streaming, and of course, Zookeeper.
Where it differs is that it looks like it was designed for you to be able to work against a single "cluster" that could well be running across multiple data-centres. Which is very much a Facebook problem to solve.
So yeah, Kafka was a distributed log built for LinkedIn size problems, LogDevice is a distributed log built for Facebook sized problems.
Most of us don't have Facebook sized problems.
MongoDB is a full OLTP document store so it won't match the write throughput and pubsub features of these focused systems. RabbitMQ on the other hand has performance limits but is meant for complex service-bus style routing and RPC uses, but I recommend using NATS for that now.
Scalable
Store up to a million logs on a single cluster. ?
This sounds pretty confusing / low volume.
Since it doesn't say anything about trustlessness, I assume that it assumes that all nodes are trusted.
All daemons and system administration utilities belong into sbin, because bin is for end-user applications.
Historically, the "s" in sbin meant something else, but it always contained applications and scripts only root could run.
When I see these examples, it's depressing to see just how much understanding of UNIX is missing.
That's not the point. The point is that all these generation Y kids grew up on PC buckets and still don't understand UNIX and the concepts behind it, and yet they use it to power their applications. This can only end badly unless they start making an effort to understand the concepts behind the substrate they are writing software for.
Could a LogDevice give a bit of informations about the scale they use that at facebook ?
- How many record this thing can injest per day ? - Any limitations on the maximum number of storage nodes ? - What would be your maximum and advise size of record for a production usage ? - ZooKeeper seems to be the center point used as epoch provider. Did you encounter any scaling limitations or max number of client due to that ?
Hope that helps.
I like the idea of decoupling compute from storage for streaming/log data.
I wonder if it would be easy to make it run under Consul, instead of ZooKeeper.
If anyone from the FB team or anyone using LogDevice wants to test performance with Optane SSDs (and compare to a NAND SSD), make a request by submitting an issue on our GitHub page: https://github.com/AccelerateWithOptane/lab/issues. I'll hook you up with a server hosted by Packet.
https://logdevice.io/docs/Concepts.html#consistency-guarante...
And it uses RocksDB under the hood:
https://logdevice.io/docs/Concepts.html#logsdb-the-local-log...
Previous discussion in HN: https://news.ycombinator.com/item?id=15142266
- Cross vendor replication which makes migration much easier.
- No dependency on vendor provided replication protocols.
- Ability to use in-app databases such RocksDB, SQLite, ...
- Upgrading DB nodes becomes way easier since they are totally separated from each other.