Facebook open-sources LogDevice, a distributed storage for sequential data (opens in new tab)

(logdevice.io)

336 pointscedricvg7y ago118 comments

118 comments

65 comments · 23 top-level

tryptophan7y ago· 7 in thread

What benefit to facebook is there from open sourcing technology they have developed?

Facebook's competitive advantage doesn't come from having the best reliable streaming data store at scale, or from its software in general. Even if MySpace, Friendster or Google + got their hands on the whole software stack & started running it, people would stick with Facebook.

So there's no cost to open sourcing. The benefit comes from being known as technically innovative in general, and for recruiting, being known as having interesting, meaty, challenging projects to work on.

The impetus usually comes from team members who want to do the work. It could be to become known for having worked on the project, or a sense of giving back to the community, or a hope that you'll get bug fixes & features from outside contributors. In my (very limited) experience, managers "passively encourage" it -- they generally don't push the team to do it, but when the team asks, they encourage it.

thrusong7y ago

If that's true, why haven't they open sourced Haystack? Clearly they're holding onto it due to competitive advantage.

1 more reply

adtac7y ago

>So there's no cost to open sourcing

Not true, there's a legal cost associated with making sure something is really ready for public eyes.

lacker7y ago

Look at React - if it had never been open sourced, Facebook might still be using it, but it wouldn't be the same thing it is today. For one, basically all of the current excellent React team probably wouldn't be working at Facebook. And it would be far, far harder for Facebook to recruit engineers for product teams who were proficient in React. Since it is open source and very popular, the odds of a browser introducing a change that unfixably hurts React's performance is now very low. Et cetera.

The cost of maintaining an open source project is real, but when it is a world-class piece of infrastructure, open sourcing it helps keep it world-class.

bwy7y ago

https://www.quora.com/Why-do-huge-profit-oriented-software-c...

ksec7y ago

Branding / Marekting / and more importantly, attract talent to work for them.

barbecue_sauce7y ago

Institutional dependency?

AhmedSoliman7y ago· 5 in thread

Happy to finally see LogDevice open. We have been working on this for years now.

thinkersilver7y ago

Hi Ahmed,

I imagine you looked at other solutions before starting this. A distributed log is a fairly simple idea to understand (hard to implement) but what pain point is being solved?

Seeing that it is written in C/C++ - would it be that logdevice is optimised purely for speed and responsiveness?

the_duke7y ago

Can you give an overview over the difference to eg Apache Kafka?

It seems very similar.

AhmedSoliman7y ago

It's a very different architecture and design. You can head to https://logdevice.io/docs/Concepts.html to learn more about how LogDevice works.

In terms of function. LogDevice is similar to the core of Apache Kafka.

2 more replies

tveita7y ago

From what I can see this doesn't have built-in consumer balancing and offset storage, like Kafka does. It also lacks more exotic Kafka features like topic compaction and exactly-once processing.

In Kafka bulk reading is very cheap, the broker basically just calls sendfile() to send a file segment with compressed message chunks. On the other hand only the leader of a partition can serve requests, so you are often limited by bandwidth. It looks like LogDevice has to do a bit more work server side, but may be able to read from all servers with a replica.

Kafka stores more metadata in the record wrapper, like client and server timestamps and partition key.

There are client libraries for C++ and Python.

Operationally they look similar - both require a Zookeeper cluster, and both require assigning permanent ids to nodes.

It would be interesting to see some benchmarks comparing LogDevice with Kafka and Pulsar. That said, I suspect from the lack of buzz around Pulsar that Kafka isn't a performance bottleneck for most people using it.

1 more reply

martincmartin7y ago

Congrats Ahmed!

thinkersilver7y ago· 5 in thread

The use cases overlap neatly with Kafka's. Everything from it's usage of zookeeper, time-and-storage-based retention tuning are similar

The announcement does not clarify the reason they use this over kafka. Is it because Kafka doesn't scale to millions of logs on a single cluster or is it because kafka is not sympathetic to heterogeneous disk arrays containing SSD and HDD. I strongly suspect it may be latency of writes at scale but this is pure speculation.

I don't know. If I understand why anyone might use this I'd contribute to building language bindings for the APIs.

sh00s7y ago

Some strengths of LogDevice include:

- It's designed to work with a large number of logs (roughly equivalent to partitions in Kafka), hundreds of thousands per cluster is common.

- Sequencer failover is very quick, typical failover time when a sequencer node fails is less than a second.

- It supports location awareness and can place data according to replication constraints specified (e.g. replicate it in 3 copies across 2 different regions and 3 racks).

- Because of non-deterministic data placement, it is very resilient to failures in terms of write availability.

- If a node/shard fails, it detects the failure and rebuilds the data that was replicated to failed nodes/shards automatically

adrienconrath7y ago

> Because of non-deterministic data placement, it is very resilient to failures in terms of write availability.

I am happy to expand more on this point.

We have this concept of "node set" of a log which is the set of storage nodes available to receive record copies sent by the sequencer. It is typically made of 20-30 nodes in typical deployments at Facebook. Write availability is maintained as long as enough storage nodes in the node set are available to accept copies. When storage node failures are detected, the sequencer can just exclude these nodes from the list of potential recipients for new records. It does not need to update a view that needs to be synchronized with readers, which is a heavy-weight operation. This model allows preserving high write availability even if many nodes in the node set are unhealthy.

Additionally, this record copy placement flexibility allows the sequencer to quickly route around latency spikes on individual storage nodes, which helps guarantee low append latency.

otterley7y ago

> Is it because Kafka doesn't scale to millions of logs on a single cluster

I doubt that's it, since Kafka can certainly do that.

manigandham7y ago

Millions of separate topics on a single Kafka cluster? The way it's designed requires opening files for all of those topics and their partitions so good luck if you're trying that. You'll run out of file handles, then memory, and then the disk access will completely freeze up.

1 more reply

beepbeepbeep17y ago

It does not, I've lost alot of time profiling Kafka perf issues against clusters on the exact same hardware with exact same traffic but with a 3000% throughput difference. The root cause was one cluster had a lot of empty test topics

Try benchmarking Kafka from 0 partitions to a few thousand partitions in 100 partition increments. The benchmark only needs to write to a single topic, using their provided producer perf tool while all other topics are inactive with zero data.

As the partitions increase there is a very noticeable drop in throughout that looks to be linear.

Kafka does not handle a large number of partitions well currently, large even being low thousands. It's easy to hit with just a few hundred topics.

Reading between the lines ehen Linkdin and Netflix advertise several clusters, i am predicting/guessing they shard the data.

1 more reply

manigandham7y ago· 5 in thread

Great to see this released. Some similar architecture decisions to Apache Pulsar as well with the separate of compute (in this case the sequencer) from the storage.

Kafka has done well so far, especially in making streaming systems more common, but it's about time for the next-gen systems.

ashu7y ago

How does LogDevice differ from Kafka?

manigandham7y ago

Kafka brokers handle both the computation (partition/topic management, sequencing, assignments, etc) and storage together. This coupling creates scaling and operational challenges which LogDevice removes by separating the layers. Storage nodes can be as simple as object stores (but optimized for appending files) and use multiple non-deterministic locations for a given piece of data to randomize placement. They read, write and recover data very quickly by working together in a mesh.

Meanwhile the compute layer becomes very lightweight and almost stateless, which is easy to scale. In LogDevice, the Sequencers are potential bottlenecks but generating a series of incrementing numbers is about the fastest thing you can do so it'll outpace any actual data ingest to a single log, while giving you a total order of all entries within that log. The numbers (LSNs) follow the Hi/Lo sequence pattern so if a Sequencer fails, another one takes its place with a greater "High" number, so it's guaranteed that all of its LSNs will be greater than the previous Sequencer as a result. This also provides a built-in buffer to still accept messages and assign the permanent LSNs to them after recovery in case a Sequencer fails.

Apache Pulsar is similar to LogDevice but goes further where brokers manage connections, routing and message acknowledgements while data is sent to a separate layer of Apache Bookkeeper nodes which store the data in append-optimized log files.

2 more replies

martincmartin7y ago

I worked on LogDevice at FB until about 6 months ago.

I'm not that familiar with Kafka, but in general LogDevice emphasizes write availability over read availability. There are many applications where data is being generated all the time, and if you don't write it, it will be lost. However, if reading is delayed, it just means readers are a little behind and will need to catch up.

So, when a sequencer node dies and we need to figure out what happened to the records that were in flight -- which ones ended up on disk & can be replicated, what the last record was -- LogDevice still accepts new writes. However, to ensure ordering, these new writes aren't visible to readers until the earlier writes are sorted out.

1 more reply

pdpi7y ago

Scribe is the Facebook-internal Kafka equivalent. LogDevice is the storage layer used by Scribe.

Scribe isn’t the only place where LogDevice is used though — Facebook has documented using it for TAO as well (as part of the secondary indices)

1 more reply

mdasen7y ago

I’m wondering this as well. In the description it says that it ensures total ordering while Kafka only ensures partition ordering. I haven’t read enough to say more.

akavel7y ago· 4 in thread

Can someone from FB chime in with some info how much storage is needed for the logs/data? Say, for 1 GB of raw input logs from a http server (nginx/apache), when stored in LogDevice would they take notably less space on disk (compression), or more (overhead)? This interests ne for evaluating resources/costs I'd need to prepare if I were to deploy it...

cedricvgOP7y ago

These numbers really depend on the compressibility of the content, compression scheme and the type of batching used. The metadata overhead is fairly minimal. LogDevice allows you to configure this on either the client, sequencer or rocksdb level.

akavel7y ago

Is some form of compression enabled by default, without having to tweak options?

1 more reply

SirMonkey7y ago

have you had a look at https://github.com/oklog/oklog https://www.youtube.com/watch?v=gWWK2eyZ-sc

I think it's fairly simple and might be enough. Can't comment on storage requirements thou.

are5957y ago

Dang, I was looking at oklog earlier, but looks like it is archived now...

fullmetaleng7y ago· 3 in thread

Martin Kleppmann seems to point out technologies for problems of similar patterns already exist - https://twitter.com/martinkl/status/1039938408393662465

tinco7y ago

Those are streaming/pubsub services though, this actually claims to be a store. I feel that's an important difference.

Do people just point their system journal at Kafka and wait for something to break?

At my previous job we built something similar to this out of rabbitmq and mongodb. I always wondered what the other big log companies used. Mongodb seemed like a pretty good fit, but a pure append only database might be even better. Trimming performance in MongoDB was subpar so we worked around it by creating a new collection for each day, trimming became a simple operation of dropping a collection at the end of each day.

EdwardDiego7y ago

> Those are streaming/pubsub services though, this actually claims to be a store. I feel that's an important difference. > Do people just point their system journal at Kafka and wait for something to break?

Kafka can be used as a data store if you like, so long as you're happy with the data management and access patterns it gives you - it is, after all, optimised for large sequential reads.

LogDevice looks to be very similar for most use cases to Kafka, hell, they even use RocksDB, which is used by stateful operations in Kafka Streaming, and of course, Zookeeper.

Where it differs is that it looks like it was designed for you to be able to work against a single "cluster" that could well be running across multiple data-centres. Which is very much a Facebook problem to solve.

So yeah, Kafka was a distributed log built for LinkedIn size problems, LogDevice is a distributed log built for Facebook sized problems.

Most of us don't have Facebook sized problems.

1 more reply

manigandham7y ago

All of those are similar systems and have persistence. I'm not sure what distinction "streaming" makes but they also all support multiple publishers and subscribers. Some only use local storage on the nodes while others can tier out to cold storage like S3.

MongoDB is a full OLTP document store so it won't match the write throughput and pubsub features of these focused systems. RabbitMQ on the other hand has performance limits but is meant for complex service-bus style routing and RPC uses, but I recommend using NATS for that now.

posnet7y ago· 2 in thread

Awesome, I have been waiting for this since seeing the @scale talk about it. https://atscaleconference.com/videos/logdevice-a-file-struct...

Rafuino7y ago

Is there supposed to be a replay of that talk on the site you link to or is it just not loading for me?

manigandham7y ago

The event is hosted by Facebook and there is an embedded Facebook video player on that page. Here's the direct link: https://www.facebook.com/atscaleevents/videos/19602876909109...

1 more reply

remh7y ago· 2 in thread

Am i the only being puzzled by

Scalable

Store up to a million logs on a single cluster. ?

This sounds pretty confusing / low volume.

manigandham7y ago

logs = topics, so they mean 1M separate topics on a single cluster.

remh7y ago

Makes more sense. Thank you!

jMyles7y ago· 2 in thread

I don't see anything about trust requirements or verification. Does LogDevice assume that all devices in my cluster are trusted?

cedricvgOP7y ago

LogDevice uses SSL for authentication. This can be enabled for both clients and servers [1].

[1] https://logdevice.io/docs/Settings.html#security

jMyles7y ago

That's not what I mean though. What if I have a cluster with devices I don't trust, but I want to let them emit logs if they conform to a particular protocol. Like, will this thing check signatures for me and such?

Since it doesn't say anything about trustlessness, I assume that it assumes that all nodes are trusted.

2 more replies

Annatar7y ago· 2 in thread

"bin/logdeviced"

All daemons and system administration utilities belong into sbin, because bin is for end-user applications.

Historically, the "s" in sbin meant something else, but it always contained applications and scripts only root could run.

When I see these examples, it's depressing to see just how much understanding of UNIX is missing.

AhmedSoliman7y ago

Maybe sending a PR would help?

Annatar7y ago

It's for Linux only, and I run illumos-based SmartOS on my own infrastructure.

That's not the point. The point is that all these generation Y kids grew up on PC buckets and still don't understand UNIX and the concepts behind it, and yet they use it to power their applications. This can only end badly unless they start making an effort to understand the concepts behind the substrate they are writing software for.

2 more replies

mmcclellan7y ago· 1 in thread

I had just stumbled across https://github.com/facebookincubator/python-nubia and am anxious to try it out. Was wondering about the internal project it was factored out from. This appears to be it.

AhmedSoliman7y ago

Correct. LDShell in logdevice was the starting point of python-nubia.

adev_7y ago· 1 in thread

Thank to Open Source that, it looks a great project.

Could a LogDevice give a bit of informations about the scale they use that at facebook ?

- How many record this thing can injest per day ? - Any limitations on the maximum number of storage nodes ? - What would be your maximum and advise size of record for a production usage ? - ZooKeeper seems to be the center point used as epoch provider. Did you encounter any scaling limitations or max number of client due to that ?

AhmedSoliman7y ago

I cannot give you exact numbers, but here are some information that might be useful: - LogDevice ingests over 1TB/s of uncompressed data at Facebook. This already has been highlighted in last year's talk in @Scale conference. - The maximum limit as defined by default in the code for the number of storage nodes in a cluster is 512. However, you can use --max-nodes to change that. There is no theoretical limit there. Each LogDevice storage daemon can handle multiple physical disks (we call them shards). So, If you have 15 disks per box, 512 servers. That's 7680 total disks in a single cluster. - The maximum record size is 32MB. However, in practice, payloads are usually much smaller. - Zookeeper is not (currently) a scaling limitation as we don't connect to zookeeper from Clients (as long as you are sourcing the config file from filesystem and not using zookeeper for that as well).

Hope that helps.

sandstrom7y ago· 1 in thread

Very interesting!

I like the idea of decoupling compute from storage for streaming/log data.

I wonder if it would be easy to make it run under Consul, instead of ZooKeeper.

AhmedSoliman7y ago

We use Zookeeper primarily for the EpochStore. This is the abstraction that you can you use if you want to replace Zookeeper. It shouldn't be that hard as long as Consul offers the same guarantees as zookeeper.

cardosof7y ago· 1 in thread

How does that fit in a ML training pipeline? (this is mentioned on the page)

manigandham7y ago

It's just streaming data but more scalable and with total ordering which can be important for ML.

senderista7y ago· 1 in thread

Sounds like it might have been influenced by the MSR CORFU project (separate sequencer, write striping). Can anyone confirm?

noahdesu7y ago

It's hard to deny that there is at least some influence there. Like LogDevice, the zlog project [0] is influence by CORFU (separate sequencer, write striping), but both use different storage interfaces / strategies.

[0]: https://github.com/cruzdb/zlog

Rafuino7y ago

Very interesting! I hadn't heard of this before but I'd love to see it in action.

If anyone from the FB team or anyone using LogDevice wants to test performance with Optane SSDs (and compare to a NAND SSD), make a request by submitting an issue on our GitHub page: https://github.com/AccelerateWithOptane/lab/issues. I'll hook you up with a server hosted by Packet.

StreamBright7y ago

The amount of great quality open source projects dein Facebook just keeps growing. I really like the consistency guarantees:

https://logdevice.io/docs/Concepts.html#consistency-guarante...

And it uses RocksDB under the hood:

https://logdevice.io/docs/Concepts.html#logsdb-the-local-log...

javiermaestro7y ago

Awesome to see this finally happening :)

Previous discussion in HN: https://news.ycombinator.com/item?id=15142266

majidazimi7y ago

External logging service is my favorite way of doing replication. It provides nice features. Specifically:

- Cross vendor replication which makes migration much easier.

- No dependency on vendor provided replication protocols.

- Ability to use in-app databases such RocksDB, SQLite, ...

- Upgrading DB nodes becomes way easier since they are totally separated from each other.

pedrorijo917y ago

Is there any comparison with other similar storages?

SkyRocknRoll7y ago

This lot more similar to apache bookeeper.

silur7y ago

this is like....the harder half of a whole blockchain project :D super interesting

polskibus7y ago

Is this a Kafka competitor?

j / k navigate · click thread line to collapse

118 comments

65 comments · 23 top-level

tryptophan7y ago· 7 in thread

What benefit to facebook is there from open sourcing technology they have developed?

martincmartin7y ago

thrusong7y ago

If that's true, why haven't they open sourced Haystack? Clearly they're holding onto it due to competitive advantage.

1 more reply

adtac7y ago

>So there's no cost to open sourcing

Not true, there's a legal cost associated with making sure something is really ready for public eyes.

lacker7y ago

The cost of maintaining an open source project is real, but when it is a world-class piece of infrastructure, open sourcing it helps keep it world-class.

bwy7y ago

https://www.quora.com/Why-do-huge-profit-oriented-software-c...

ksec7y ago

Branding / Marekting / and more importantly, attract talent to work for them.

barbecue_sauce7y ago

Institutional dependency?

AhmedSoliman7y ago· 5 in thread

Happy to finally see LogDevice open. We have been working on this for years now.

thinkersilver7y ago

Hi Ahmed,

I imagine you looked at other solutions before starting this. A distributed log is a fairly simple idea to understand (hard to implement) but what pain point is being solved?

Seeing that it is written in C/C++ - would it be that logdevice is optimised purely for speed and responsiveness?

the_duke7y ago

Can you give an overview over the difference to eg Apache Kafka?

It seems very similar.

AhmedSoliman7y ago

It's a very different architecture and design. You can head to https://logdevice.io/docs/Concepts.html to learn more about how LogDevice works.

In terms of function. LogDevice is similar to the core of Apache Kafka.

2 more replies

tveita7y ago

From what I can see this doesn't have built-in consumer balancing and offset storage, like Kafka does. It also lacks more exotic Kafka features like topic compaction and exactly-once processing.

Kafka stores more metadata in the record wrapper, like client and server timestamps and partition key.

There are client libraries for C++ and Python.

Operationally they look similar - both require a Zookeeper cluster, and both require assigning permanent ids to nodes.

1 more reply

martincmartin7y ago

Congrats Ahmed!

thinkersilver7y ago· 5 in thread

The use cases overlap neatly with Kafka's. Everything from it's usage of zookeeper, time-and-storage-based retention tuning are similar

I don't know. If I understand why anyone might use this I'd contribute to building language bindings for the APIs.

sh00s7y ago

Some strengths of LogDevice include:

- It's designed to work with a large number of logs (roughly equivalent to partitions in Kafka), hundreds of thousands per cluster is common.

- Sequencer failover is very quick, typical failover time when a sequencer node fails is less than a second.

- It supports location awareness and can place data according to replication constraints specified (e.g. replicate it in 3 copies across 2 different regions and 3 racks).

- Because of non-deterministic data placement, it is very resilient to failures in terms of write availability.

- If a node/shard fails, it detects the failure and rebuilds the data that was replicated to failed nodes/shards automatically

adrienconrath7y ago

> Because of non-deterministic data placement, it is very resilient to failures in terms of write availability.

I am happy to expand more on this point.

Additionally, this record copy placement flexibility allows the sequencer to quickly route around latency spikes on individual storage nodes, which helps guarantee low append latency.

otterley7y ago

> Is it because Kafka doesn't scale to millions of logs on a single cluster

I doubt that's it, since Kafka can certainly do that.

manigandham7y ago

1 more reply

beepbeepbeep17y ago

As the partitions increase there is a very noticeable drop in throughout that looks to be linear.

Kafka does not handle a large number of partitions well currently, large even being low thousands. It's easy to hit with just a few hundred topics.

Reading between the lines ehen Linkdin and Netflix advertise several clusters, i am predicting/guessing they shard the data.

1 more reply

manigandham7y ago· 5 in thread

Great to see this released. Some similar architecture decisions to Apache Pulsar as well with the separate of compute (in this case the sequencer) from the storage.

Kafka has done well so far, especially in making streaming systems more common, but it's about time for the next-gen systems.

ashu7y ago

How does LogDevice differ from Kafka?

manigandham7y ago

2 more replies

martincmartin7y ago

I worked on LogDevice at FB until about 6 months ago.

1 more reply

pdpi7y ago

Scribe is the Facebook-internal Kafka equivalent. LogDevice is the storage layer used by Scribe.

Scribe isn’t the only place where LogDevice is used though — Facebook has documented using it for TAO as well (as part of the secondary indices)

1 more reply

mdasen7y ago

I’m wondering this as well. In the description it says that it ensures total ordering while Kafka only ensures partition ordering. I haven’t read enough to say more.

akavel7y ago· 4 in thread

cedricvgOP7y ago

akavel7y ago

Is some form of compression enabled by default, without having to tweak options?

1 more reply

SirMonkey7y ago

have you had a look at https://github.com/oklog/oklog https://www.youtube.com/watch?v=gWWK2eyZ-sc

I think it's fairly simple and might be enough. Can't comment on storage requirements thou.

are5957y ago

Dang, I was looking at oklog earlier, but looks like it is archived now...

fullmetaleng7y ago· 3 in thread

Martin Kleppmann seems to point out technologies for problems of similar patterns already exist - https://twitter.com/martinkl/status/1039938408393662465

tinco7y ago

Those are streaming/pubsub services though, this actually claims to be a store. I feel that's an important difference.

Do people just point their system journal at Kafka and wait for something to break?

EdwardDiego7y ago

Kafka can be used as a data store if you like, so long as you're happy with the data management and access patterns it gives you - it is, after all, optimised for large sequential reads.

LogDevice looks to be very similar for most use cases to Kafka, hell, they even use RocksDB, which is used by stateful operations in Kafka Streaming, and of course, Zookeeper.

So yeah, Kafka was a distributed log built for LinkedIn size problems, LogDevice is a distributed log built for Facebook sized problems.

Most of us don't have Facebook sized problems.

1 more reply

manigandham7y ago

posnet7y ago· 2 in thread

Awesome, I have been waiting for this since seeing the @scale talk about it. https://atscaleconference.com/videos/logdevice-a-file-struct...

Rafuino7y ago

Is there supposed to be a replay of that talk on the site you link to or is it just not loading for me?

manigandham7y ago

The event is hosted by Facebook and there is an embedded Facebook video player on that page. Here's the direct link: https://www.facebook.com/atscaleevents/videos/19602876909109...

1 more reply

remh7y ago· 2 in thread

Am i the only being puzzled by

Scalable

Store up to a million logs on a single cluster. ?

This sounds pretty confusing / low volume.

manigandham7y ago

logs = topics, so they mean 1M separate topics on a single cluster.

remh7y ago

Makes more sense. Thank you!

jMyles7y ago· 2 in thread

I don't see anything about trust requirements or verification. Does LogDevice assume that all devices in my cluster are trusted?

cedricvgOP7y ago

LogDevice uses SSL for authentication. This can be enabled for both clients and servers [1].

[1] https://logdevice.io/docs/Settings.html#security

jMyles7y ago

Since it doesn't say anything about trustlessness, I assume that it assumes that all nodes are trusted.

2 more replies

Annatar7y ago· 2 in thread

"bin/logdeviced"

All daemons and system administration utilities belong into sbin, because bin is for end-user applications.

Historically, the "s" in sbin meant something else, but it always contained applications and scripts only root could run.

When I see these examples, it's depressing to see just how much understanding of UNIX is missing.

AhmedSoliman7y ago

Maybe sending a PR would help?

Annatar7y ago

It's for Linux only, and I run illumos-based SmartOS on my own infrastructure.

2 more replies

mmcclellan7y ago· 1 in thread

I had just stumbled across https://github.com/facebookincubator/python-nubia and am anxious to try it out. Was wondering about the internal project it was factored out from. This appears to be it.

AhmedSoliman7y ago

Correct. LDShell in logdevice was the starting point of python-nubia.

adev_7y ago· 1 in thread

Thank to Open Source that, it looks a great project.

Could a LogDevice give a bit of informations about the scale they use that at facebook ?

AhmedSoliman7y ago

Hope that helps.

sandstrom7y ago· 1 in thread

Very interesting!

I like the idea of decoupling compute from storage for streaming/log data.

I wonder if it would be easy to make it run under Consul, instead of ZooKeeper.

AhmedSoliman7y ago

cardosof7y ago· 1 in thread

How does that fit in a ML training pipeline? (this is mentioned on the page)

manigandham7y ago

It's just streaming data but more scalable and with total ordering which can be important for ML.

senderista7y ago· 1 in thread

Sounds like it might have been influenced by the MSR CORFU project (separate sequencer, write striping). Can anyone confirm?

noahdesu7y ago

[0]: https://github.com/cruzdb/zlog

Rafuino7y ago

Very interesting! I hadn't heard of this before but I'd love to see it in action.

StreamBright7y ago

The amount of great quality open source projects dein Facebook just keeps growing. I really like the consistency guarantees:

https://logdevice.io/docs/Concepts.html#consistency-guarante...

And it uses RocksDB under the hood:

https://logdevice.io/docs/Concepts.html#logsdb-the-local-log...

javiermaestro7y ago

Awesome to see this finally happening :)

Previous discussion in HN: https://news.ycombinator.com/item?id=15142266

majidazimi7y ago

External logging service is my favorite way of doing replication. It provides nice features. Specifically:

- Cross vendor replication which makes migration much easier.

- No dependency on vendor provided replication protocols.

- Ability to use in-app databases such RocksDB, SQLite, ...

- Upgrading DB nodes becomes way easier since they are totally separated from each other.

pedrorijo917y ago

Is there any comparison with other similar storages?

SkyRocknRoll7y ago

This lot more similar to apache bookeeper.

silur7y ago

this is like....the harder half of a whole blockchain project :D super interesting

polskibus7y ago

Is this a Kafka competitor?

j / k navigate · click thread line to collapse