How RocksDB Works (opens in new tab)

(artem.krylysov.com)

285 pointsDAlperin3y ago69 comments

69 comments

58 comments · 13 top-level

nitinreddy883y ago· 19 in thread

I am looking for optimal storage engine(KV) which can store operational telemetry (temporarily) at source node. As we know, operational telemetry is generated frequently and need to merge similar operations frequently (little compaction). Once it reaches good amount of size (100mb), we can transfer it to dedicated time series database engines through various mechanisms. I am struggling to find a fast, write heavy, memory optimal storage for this.

RocksDB seems to fit few boxes but there could be much better solution as we don't need deletes/range scans sort of operations.

Any suggestions?

DocSavage3y ago

If you don't need range scans, have you looked at WiscKey approaches? https://www.usenix.org/system/files/conference/fast16/fast16...

There are a number of implementations including Badger (used in dgraph) and a variant that's "RocksDB for large-value use cases" (https://rocksdb.org/blog/2021/05/26/integrated-blob-db.html)

howerj3y ago

You could store it in a hash and flush it to disk using something like https://en.wikipedia.org/wiki/Cdb_(software), there are a few variants and implementations that might do what you want.

liketochill3y ago

Victoria metrics can store data at local node and forward to central when network is functioning

bbss3y ago

Might want to look at LMDB (https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Databa...)

dboreham3y ago

Also WiredTiger https://source.wiredtiger.com/11.1.0/index.html

seedless-sensat3y ago

Deletes (tombstones) shouldn't really get in your way if you don't need them. Similarly, range scans just come for free from SSTables being sorted. Archiving RocksDB SSTable files can be a decent strategy.

One thing to pay attention to is if your telemetry data is indexed by timestamp (i.e. you're writing to the keyspace in order), the compaction of immutable SSTables layers could be wasteful? Although, the author's nice example of non-overlapping SSTables key ranges suggests there may be minimal write amplification here too.

winrid3y ago

Pipe it to a write-ahead file on the source host, and read it back and store an offset for reading when you commit. That will be a very optimal solution.

But really, I would benchmark sqlite first...

dkarpas3y ago

Speedb is a great option Please check us out and join the most active RocksDB community - https://discord.gg/5fVUUtM2cG You can share your logs there or send them directly to us for analysis Contact me if you have any questions - dan@speedb.io

didgetmaster3y ago

I am building my own KV store (www.Didgets.com) that can store 10 million KV pairs in about 4 seconds (using my Ryzen 5950X machine). Those pairs take up about 300MB of disk space and have about the same memory footprint when loaded. The values have a variety of data types (strings, integers, doubles, datetime, etc.) The software is still in beta but is available for free download.

dikei3y ago

I think Rocksdb would be a good choice, you really can't beat a LSM design for write-heavy workload. Depending on how the mem-table is implemented, they can write at practically RAM-speed. And although you might think that you don't need range scans now, it's a very useful for any kind of time series data.

jlokier3y ago

> you really can't beat a LSM design for write-heavy workload

Depending on the write pattern, you actually can, because standard LSM-trees write the same data repeatedly, into each layer, and again if recompacting a layer. They make up for it by writing so much sequentially that the gains can outweigh the cost of writing the same data many times, and sorting having a cost no matter how you do it. However, if data is being written in batches in mostly sequential key order, then LSM-trees have a type of write amplification that's worse than efficient B-trees.

However, RocksDB deviates from LSM-trees when writing in bulk (or can be made to), to reduce or avoid this form of write amplification.

The optimal balance depends on the write pattern, but neither standard LSM-trees nor B-trees are consistently better than the other for all write patterns.

vlovich1233y ago

> Depending on how the mem-table is implemented, they can write at practically RAM-speed.

Proof needed I think. When last I looked I could get it to just under a gib/s. The disk itself could do 2-3 and ram is 20.

It’s fast but it’s definitely a long long way off from RAM speed. The reason is that the memtable is quite pricey to maintain - you’re having to constantly keep a non trivial amount of data sorted and that sort is expensive.

foota3y ago

What kind of telemetry exactly?

Maybe I don't understand the problem, but can you not just store it in memory (e.g., with a map from key to current value), update (for instance, increment) it as you go, and whenever you want to take a timeseries value just push the set of current values back to a vector?

wnolens3y ago

Likely needs to survive a crash (can't lose 100MB chunks)

digikata3y ago

Write Apache arrow parquet files directly? A lot of other dbs can ingest those directly anyway.

ddorian433y ago

Check how https://github.com/netdata/netdata does it.

RhodesianHunter3y ago

Any reason you can't shove it into Kafka?

nitinreddy883y ago

Too many network calls. Technically it's feasible, operationally it's expensive for Telemetry usecase. Ex: Imagine we are capturing API telemetry. If there are 1000 API calls per minute per node, then we will end up somewhere 1000*10 calls per minute to Kafka. It's not efficient.

2 more replies

dboreham3y ago

Parent wants storage at the source node.

jlokier3y ago· 7 in thread

One thing about LSM trees that are implemented with large numbers of large files in a filesystem, such as RocksDB, is that they defer to the filesystem to deal with fragmentation and block lookup isues. That's not actually free.

LSM tree descriptions typically imply or say outright that each layer is laid out linearly, written sequentially, and read sequentally for merging. And that looking up a block within a layer is an O(1) operation, doing random access I/O to that location.

But really, the underlying filesystem is doing a lot of heavy lifting. It's maintaining the illusion of linear allocation by hiding how the large files are fragmented. That sequential writing is mostly sequential, but typically becomes more fragmented in the filesystem layer as the disk gets closer to full, and over time as various uses of the filesystem mean there are fewer large contiguous regions. More fragmented free space makes the allocation algorithms have to do more work, sometimes more I/O, just to allocate space for the LSM tree's "linear" writes.

Lookup of a block inside a layer requires the filesystem to lookup in its extent tree or, with older filesystems, through indirect block lookups. Those are hidden from the LSM tree database, but are not without overhead.

Writing sequentially to a layer generally requires the filesystem to update its free space structures as well as its extent tree or indirect blocks.

Even a simple operation like the LSM tree database deleting a layer file it has finished with, is not necessarily simple and quick at the filesystem layer.

In other words, when analysing performance, filesystems are the unsung heroes underlying some LSM tree databases. Their algorithmic overhead is often not included in the big-O analysis of LSM tree algorithms running over them, but should be, and their behaviour changes as disk space shrinks and over time due to fragmentation.

eclark3y ago

>But really, the underlying filesystem is doing a lot of heavy lifting.

I think that's vastly under selling what's done to ensure that each block is written linearly, blocks are structured, sized, written and accessed in a way that the filesystem does very little (directio, fadvise, droping caches on writes, etc). I was in total agreement with you, for a long time. The rocksdb devs have put in the work, and tuning rocksdb usually gets faster the less the FS does.

Lately linear reads and writes are not why one is choosing LSM's in a datacenter setting. Access times of even cheap slow ssd's are amazing.

They are used for controlling write amplification with tunable known costs. That is you write fewer hardware blocks to the flash chips with a well tuned rocksdb.

jlokier3y ago

I agree that linear reads and writes are not relevant these days in most settings.

I've worked on my own DB engine that uses a structure similar to LSM (but it's not an LSM tree), where the highest possible performance (millions of TPS) for random-writes mixed with semi-sorted writes mixed with random-reads on current SSDs was the target. There's no need for any data to be allocated sequentially on those, other than just enough aggregation to ensure a sufficiently large block size to reduce IOPS during streaming and merging operations, when IOPS-bound. Indeed it's better to fragment to reuse already filesystem-allocated space where possible - that lowers overhead on the filesystem.

I also agree that a well tuned RocksDB can perform very well, and that the authors have done the work, and that it has methods to reduce avoidable write amplification.

However, the RocksDB applications I've seen haven't use the fancy APIs to get the most out of it. They just used it as a plain k-v store with little understanding of what makes a DB sing, and got not great performance as a result.

1 more reply

stingraycharles3y ago

> That is you write fewer hardware blocks to the flash chips with a well tuned rocksdb.

What would you consider a well-tuned rocksdb? My understanding is that, due to level-based compaction, there is always a decent amount of write amplification that is unavoidable -- i.e. for one modification to eventually end up in the bottommost level (e.g. L6), it would need to be (re-)written to disk 5 or 6 times. That's quite heavy an amount of write amplification, but maybe my expectations are off?

2 more replies

vlovich1233y ago

I agree that DB papers will typically overlook the impact the filesystem has on the database (not just rocksdb - what you wrote is true for everything except something like BlueStore). It’s particularly depressing when you look at how they measure write amplification which tends to ignore things they’re just offloading to the filesystem.

However, I think you’re making a mistake on a core part of your argument:

> More fragmented free space makes the allocation algorithms have to do more work, sometimes more I/O, just to allocate space for the LSM tree's "linear" writes.

The file system in no way needs to guarantee on-disk contiguity for read or write performance, nor does any online defrag need to happen. Indeed, the whole premise behind LSM trees is to try to optimize around solid state storage. AFAIK if the filesystem can only find 1 MiB blocks it will allocate them at the cost of a larger set of extents (there’s also defrag happening). Typically the filesystems do a fantastic job of defrag too. That’s certainly an important part but I’d say those parts of the filesystem are likely the first things implemented and never/rarely changed (just a hunch - I haven’t actually bothered looking at the Linux changelog).

Also no one is really going to care about performance on an almost full filesystem (kind full like 75% but old so lots of fragments is valid but I doubt it’s actually a problem because of how good filesystems are).

jlokier3y ago

> However, I think you’re making a mistake on a core part of your argument:

No, that part has been misunderstood so I guess I didn't write it clearly enough.

I'm not saying the filesystem defragments, or does any particular effort to ensure on-disk contiguous storage. I'm saying that as a result of the presence of other data on the filesystem and historic accumulating entropy in layout (sometimes caused by an LSM tree DB!), the filesystem ends up keeping track of appended data discontiguously on disk. In keeping track, the filesystem has to consult its free-space structures to find new free areas, with a heuristic or two to decide whether to allocate a large or small one, write updates to the free-space structures, and keep track of the resulting discontiguous mapping for the file by writing to those structures too. Even when it's contiguous on-disk, versions of those metadata writes are needed for transactional, durable DB writes. They're simpler during bulk non-durable writes of course.

These are the "more work, sometimes more I/O, just to allocate space".

Good filesystems are efficient, but that activity does add overhead (especially when durable transactions are involved) and big-O algorithmic factors (those 1 MiB extents add up), and the picture painted of linear-time operations in LSM papers is inaccurate as a result. I don't think this overhead is necessarily large in practice most of the time. However it's where much of the complexity and corner cases lie, if honestly analysing the performance range and big-O scaling factors.

You make an interesting point about write amplification at the filesystem layer not being accounted for. In addition, classic LSM trees also have significant write amplification (regardless of filesystem or even block device) due to the simple act of writing all data to every layer. This is well known by LSM designers, and there are mitigations, but it's somehow left out of the public image of LSM trees. Classic LSM trees are excellent for random-access writes that need to be gradually sorted, but for some other write patters, are slower (more I/O) than good quality B-trees and some other structures.

> no one is really going to care about performance on an almost full filesystem (kind full like 75% but old so lots of fragments is valid but I doubt it’s actually a problem because of how good filesystems are).

Heh. In practice, every RocksDB or LevelDB I've seen in production is hundreds of GB or several TB on filesystems that have run low on space or even run out, mainly due to the size of the DB :-) They also have thousands or tens of thousands of files in each DB, which they have to open for random-access queries. This is what motivated me to recognise the filesystem can be quite involved.

1 more reply

l05k3y ago

I would recommend a related paper, "Building an Efficient Key-Value Store in a Flexible Address Space", which focused on the interaction problem between extent-based filesystem and key-value store.

However, in scenarios that really care about performance, memory-mapped files seem to be able to bypass the I/O stack and optimize performance. RocksDB also has corresponding optimizations: https://github.com/facebook/rocksdb/wiki/PlainTable-Format

midom3y ago

this misses multiple points - lots of block access happens via internal caches (hyperclock/lru/etc) and filesystem isn't in the critical path as much as you'd think.

files are mapped out in relatively large chunks, especially compaction outputs - there's prealloaction, and usually you will just have a flat file->block conversion without huge trees or anything.

based on performance profiles filesystem doesn't do any heavy lifting, there's not that much fragmentation (and you usually keep some free space for flash GC anyway),

compaction output write is one logical operation on filesystem for tens of megabytes of data.

> filesystems are the unsung heroes underlying some LSM tree databases

meh

zip12343y ago· 7 in thread

Well written article--clarification on how Meta uses it though. It is not Tao it is ZippyDb: https://engineering.fb.com/2021/08/06/core-data/zippydb/

ipozgaj3y ago

I am lucky enough to have worked on all three of these systems (TAO, ZippyDB, and currently MySQL) so can shed some light here.

Both MySQL and ZippyDB are datastores that use RocksDB under the hood, in a slightly different way and with different querying capabilities exposed to the end user. ZippyDB uses it exclusively, but MySQL uses both the traditional InnoDB and RocksDB (MyRocks). TAO is in memory graph database, layer above both of these, and doesn't persist anything by itself - it talks to the database layer (MyRocks).

eclark3y ago

/wave

1 more reply

KAdot3y ago

Author here.

> Well written article

Thanks!

> It is not Tao it is ZippyDb

I don't work for Meta, so might have made a mistake. There is an old blog post[1] about Tao and there is a recent paper[2] mentioning that the graph database is powered by MyRocks, which runs on RocksDB.

[1]: https://engineering.fb.com/2013/06/25/core-data/tao-the-powe...

[2]: https://www.vldb.org/pvldb/vol13/p3217-matsunobu.pdf

zip12343y ago

TIL, didn't realize that it was used as the storage engine for MySQL.

midom3y ago

rocksdb is used in many more systems, but the storage layer behind TAO (UDB) is using MyRocks for quite a few years by now - migration was done in 2016 or so - https://engineering.fb.com/2016/08/31/core-data/myrocks-a-sp...

zip12343y ago

TIL, didn't realize that it was used as the storage engine for MySQL.

ckwalsh3y ago

Zippy does use it, but I think the author was specifically referring to MyRocks

https://myrocks.io/

dikei3y ago· 3 in thread

RocksDB is awesome, though don't use it with regular glibc malloc because it can cause extreme memory fragmentation. Use jemalloc, tcmalloc, or mimalloc: basically any other advance malloc libraries that can effectively reuse memory.

jeffbee3y ago

This goes for pretty much every C++ program. I doubt there are any useful programs for which the GNU allocator is optimal.

citrin_ru3y ago

globc malloc works reasonably well if threads are not used (nginx and postgres are examples of apps which don’t rely on threads), but if an app uses many threads on multi core CPU shortcomings of glibc malloc (or advantages of jemalloc) become more obvious, especially if you use some LTS Linux distro with an old glibc.

dikei3y ago

For databases maybe.

However, most programs do not need to allocate so much memory, so frequently that an allocator become an issue.

polishdude203y ago· 3 in thread

How does flushing in a background process work if it says that it's an embeddable database that's in your application? It says there is no external process so how is there a background process that performs compaction and flushing?

remram3y ago

It's a thread: https://artem.krylysov.com/blog/2023/04/19/how-rocksdb-works...

> RocksDB runs a dedicated background thread that persists immutable memtables to disk.

They are using "process" to mean "mechanism", something that happens, not a literal OS process. I agree that it's a bit confusing to use the word both ways.

KAdot3y ago

I rephrased some sentences to avoid confusion. Thanks for the feedback!

adev_3y ago

To be perfectly exact, it is even a thread pool and it is tunable.

schwartzie3y ago· 2 in thread

Congratulations to the author for a remarkably clear, easy-to-follow, and informative post. Excellent technical writing!

tmulcahy3y ago

Agreed. This is the best explanation of Log-Structured Merge Trees I've seen. I finally feel confident that I understand the concept.

benjaminwootton3y ago

I was about to post the same. One of the best technical articles I've ever read.

willvarfar3y ago· 2 in thread

A bit of a tangent, but HNers often have the kind of hands-on experience that's hard to find in internet searches, so I'll ask away :)

A long time ago we had a big MySQL tokudb db and were keen to migrate to myrocks. But myrocks put every table into a single big file, rather than a file per partition.

The partition-per-file is a big deal if you are retaining N days of data in a DB and every night will be dropping some old day. If your DB stores each partition in separate files, the DB can simply delete them. But if your DB stores all the partitions in a single file, then it will end up having to compact your absolutely massive big dataset. It was completely unworkable for us.

Has this changed?

jorangreef3y ago

Hey Will! Joran from TigerBeetle here.

Partitioning data across files (or LSM trees) can be a remarkable win. For data retention policies, as well as for exploiting immutability in different workloads to reduce write amplification.

For example, in TigerBeetle, a DB that provides double-entry financial accounting primitives, our secondary indexes mutate, but half of our ingest volume, all the transactions themselves are immutable, and inserted in chronological order.

We therefore designed our local storage engine as an LSM-forest, putting different key/value types in their own tree, so that mutable data wouldn't compact immutable data. This turns our object tree for primary keys into essentially an append-only log.

I did a lightning talk on this, and a few of our other LSM optimizations, at Jamie Brandon's HYTRADBOI conference last year: https://www.youtube.com/watch?v=yBBpUMR8dHw

RocksDB also allows you to do this, with its concept of column families, if I am not mistaken. However, we wanted more memory efficiency with static memory allocation, deterministic execution and deterministic on disk storage for faster testing (think FoundationDB's simulator but with storage fault injection) and faster recovery (thanks to smaller diffs, with less randomness in the data files being recovered), and also an engine that could solve our storage fault model.

All details in the talk. Or ping me if you have questions.

midom3y ago

something doesn't make sense here - MySQL/InnoDB does put tables into files, but partitions get separate file.

MyRocks has a collection of files per each column family, and when you drop data it can quickly expunge files that don't contain data for other tables/partitions - and trigger compaction on neighbors, if needed.

jamesatmeetapro3y ago· 1 in thread

Great writing. Looks like it is used extensively in Meta. I heard they even wanted to use it in Cassandra.

midom3y ago

https://developers.facebook.com/videos/f8-2018/cassandra-on-...

(there's not much Cassandra now, though)

kilotaras3y ago· 1 in thread

> To find a specific key, we could use binary search on the SST file blocks.

I don't think it's possible except when both key and value are fixed size (which is not the case in the example shown).

KAdot3y ago

Binary searching SST file blocks is a stretch, I agree. The key-value pairs would need to have a specific shape. Enabling compression makes it completely impossible. I'll remove this from the article, thanks for the feedback!

adev_3y ago

RocksDB is an amazing piece of engineering that deserve to be more known.

It is battle tested. It does one job and does it well.

I have used it in the past as a middleware database taking an average of 2-3k req/sec with over 400 GB of data stored. It works like a charm.

If I had a single reproach to do to it, it would be around the instrumentation. It is not that straightforward to get proper metrics and reporting of the internals.

almog3y ago

You might want to also listen to the SE Daily episode with Dhruba and Igor of RockDB, they cover similar aspects of RocksDB in details: https://softwareengineeringdaily.com/2019/02/05/rocksdb-with...

attrutt3y ago

Damn, there's some astonishing writing skills going on here

Tim256593y ago

Great post on rocksdb

j / k navigate · click thread line to collapse

69 comments

58 comments · 13 top-level

nitinreddy883y ago· 19 in thread

RocksDB seems to fit few boxes but there could be much better solution as we don't need deletes/range scans sort of operations.

Any suggestions?

DocSavage3y ago

If you don't need range scans, have you looked at WiscKey approaches? https://www.usenix.org/system/files/conference/fast16/fast16...

There are a number of implementations including Badger (used in dgraph) and a variant that's "RocksDB for large-value use cases" (https://rocksdb.org/blog/2021/05/26/integrated-blob-db.html)

howerj3y ago

You could store it in a hash and flush it to disk using something like https://en.wikipedia.org/wiki/Cdb_(software), there are a few variants and implementations that might do what you want.

liketochill3y ago

Victoria metrics can store data at local node and forward to central when network is functioning

bbss3y ago

Might want to look at LMDB (https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Databa...)

dboreham3y ago

Also WiredTiger https://source.wiredtiger.com/11.1.0/index.html

seedless-sensat3y ago

winrid3y ago

Pipe it to a write-ahead file on the source host, and read it back and store an offset for reading when you commit. That will be a very optimal solution.

But really, I would benchmark sqlite first...

dkarpas3y ago

didgetmaster3y ago

dikei3y ago

jlokier3y ago

> you really can't beat a LSM design for write-heavy workload

However, RocksDB deviates from LSM-trees when writing in bulk (or can be made to), to reduce or avoid this form of write amplification.

The optimal balance depends on the write pattern, but neither standard LSM-trees nor B-trees are consistently better than the other for all write patterns.

vlovich1233y ago

> Depending on how the mem-table is implemented, they can write at practically RAM-speed.

Proof needed I think. When last I looked I could get it to just under a gib/s. The disk itself could do 2-3 and ram is 20.

foota3y ago

What kind of telemetry exactly?

wnolens3y ago

Likely needs to survive a crash (can't lose 100MB chunks)

digikata3y ago

Write Apache arrow parquet files directly? A lot of other dbs can ingest those directly anyway.

ddorian433y ago

Check how https://github.com/netdata/netdata does it.

RhodesianHunter3y ago

Any reason you can't shove it into Kafka?

nitinreddy883y ago

2 more replies

dboreham3y ago

Parent wants storage at the source node.

jlokier3y ago· 7 in thread

Writing sequentially to a layer generally requires the filesystem to update its free space structures as well as its extent tree or indirect blocks.

Even a simple operation like the LSM tree database deleting a layer file it has finished with, is not necessarily simple and quick at the filesystem layer.

eclark3y ago

>But really, the underlying filesystem is doing a lot of heavy lifting.

Lately linear reads and writes are not why one is choosing LSM's in a datacenter setting. Access times of even cheap slow ssd's are amazing.

They are used for controlling write amplification with tunable known costs. That is you write fewer hardware blocks to the flash chips with a well tuned rocksdb.

jlokier3y ago

I agree that linear reads and writes are not relevant these days in most settings.

I also agree that a well tuned RocksDB can perform very well, and that the authors have done the work, and that it has methods to reduce avoidable write amplification.

1 more reply

stingraycharles3y ago

> That is you write fewer hardware blocks to the flash chips with a well tuned rocksdb.

2 more replies

vlovich1233y ago

However, I think you’re making a mistake on a core part of your argument:

> More fragmented free space makes the allocation algorithms have to do more work, sometimes more I/O, just to allocate space for the LSM tree's "linear" writes.

jlokier3y ago

> However, I think you’re making a mistake on a core part of your argument:

No, that part has been misunderstood so I guess I didn't write it clearly enough.

These are the "more work, sometimes more I/O, just to allocate space".

1 more reply

l05k3y ago

I would recommend a related paper, "Building an Efficient Key-Value Store in a Flexible Address Space", which focused on the interaction problem between extent-based filesystem and key-value store.

midom3y ago

this misses multiple points - lots of block access happens via internal caches (hyperclock/lru/etc) and filesystem isn't in the critical path as much as you'd think.

files are mapped out in relatively large chunks, especially compaction outputs - there's prealloaction, and usually you will just have a flat file->block conversion without huge trees or anything.

based on performance profiles filesystem doesn't do any heavy lifting, there's not that much fragmentation (and you usually keep some free space for flash GC anyway),

compaction output write is one logical operation on filesystem for tens of megabytes of data.

> filesystems are the unsung heroes underlying some LSM tree databases

meh

zip12343y ago· 7 in thread

Well written article--clarification on how Meta uses it though. It is not Tao it is ZippyDb: https://engineering.fb.com/2021/08/06/core-data/zippydb/

ipozgaj3y ago

I am lucky enough to have worked on all three of these systems (TAO, ZippyDB, and currently MySQL) so can shed some light here.

eclark3y ago

/wave

1 more reply

KAdot3y ago

Author here.

> Well written article

Thanks!

> It is not Tao it is ZippyDb

[1]: https://engineering.fb.com/2013/06/25/core-data/tao-the-powe...

[2]: https://www.vldb.org/pvldb/vol13/p3217-matsunobu.pdf

zip12343y ago

TIL, didn't realize that it was used as the storage engine for MySQL.

midom3y ago

zip12343y ago

TIL, didn't realize that it was used as the storage engine for MySQL.

ckwalsh3y ago

Zippy does use it, but I think the author was specifically referring to MyRocks

https://myrocks.io/

dikei3y ago· 3 in thread

jeffbee3y ago

This goes for pretty much every C++ program. I doubt there are any useful programs for which the GNU allocator is optimal.

citrin_ru3y ago

dikei3y ago

For databases maybe.

However, most programs do not need to allocate so much memory, so frequently that an allocator become an issue.

polishdude203y ago· 3 in thread

remram3y ago

It's a thread: https://artem.krylysov.com/blog/2023/04/19/how-rocksdb-works...

> RocksDB runs a dedicated background thread that persists immutable memtables to disk.

They are using "process" to mean "mechanism", something that happens, not a literal OS process. I agree that it's a bit confusing to use the word both ways.

KAdot3y ago

I rephrased some sentences to avoid confusion. Thanks for the feedback!

adev_3y ago

To be perfectly exact, it is even a thread pool and it is tunable.

schwartzie3y ago· 2 in thread

Congratulations to the author for a remarkably clear, easy-to-follow, and informative post. Excellent technical writing!

tmulcahy3y ago

Agreed. This is the best explanation of Log-Structured Merge Trees I've seen. I finally feel confident that I understand the concept.

benjaminwootton3y ago

I was about to post the same. One of the best technical articles I've ever read.

willvarfar3y ago· 2 in thread

A bit of a tangent, but HNers often have the kind of hands-on experience that's hard to find in internet searches, so I'll ask away :)

A long time ago we had a big MySQL tokudb db and were keen to migrate to myrocks. But myrocks put every table into a single big file, rather than a file per partition.

Has this changed?

jorangreef3y ago

Hey Will! Joran from TigerBeetle here.

Partitioning data across files (or LSM trees) can be a remarkable win. For data retention policies, as well as for exploiting immutability in different workloads to reduce write amplification.

I did a lightning talk on this, and a few of our other LSM optimizations, at Jamie Brandon's HYTRADBOI conference last year: https://www.youtube.com/watch?v=yBBpUMR8dHw

All details in the talk. Or ping me if you have questions.

midom3y ago

something doesn't make sense here - MySQL/InnoDB does put tables into files, but partitions get separate file.

jamesatmeetapro3y ago· 1 in thread

Great writing. Looks like it is used extensively in Meta. I heard they even wanted to use it in Cassandra.

midom3y ago

https://developers.facebook.com/videos/f8-2018/cassandra-on-...

(there's not much Cassandra now, though)

kilotaras3y ago· 1 in thread

> To find a specific key, we could use binary search on the SST file blocks.

I don't think it's possible except when both key and value are fixed size (which is not the case in the example shown).

KAdot3y ago

adev_3y ago

RocksDB is an amazing piece of engineering that deserve to be more known.

It is battle tested. It does one job and does it well.

I have used it in the past as a middleware database taking an average of 2-3k req/sec with over 400 GB of data stored. It works like a charm.

If I had a single reproach to do to it, it would be around the instrumentation. It is not that straightforward to get proper metrics and reporting of the internals.

almog3y ago

You might want to also listen to the SE Daily episode with Dhruba and Igor of RockDB, they cover similar aspects of RocksDB in details: https://softwareengineeringdaily.com/2019/02/05/rocksdb-with...

attrutt3y ago

Damn, there's some astonishing writing skills going on here

Tim256593y ago

Great post on rocksdb

j / k navigate · click thread line to collapse