RocksDB – A persistent key-value store for fast storage environments (opens in new tab)

(rocksdb.org)

196 pointsMadeInSyria12y ago72 comments

72 comments

Very nice work, and the wiki is also quite nice -- I wish more projects had a page like https://github.com/facebook/rocksdb/wiki/Rocksdb-Architectur.... It's really nice to see a clear, terse summary of what makes this project interesting relative to its predecessors.

At my company (scalyr.com), we've built a more-or-less clone of LevelDB in Java, with a similar goal of extracting more performance on high-powered servers (and better integration with our Java codebase). I'll be digging through rocksdb to see what ideas we might borrow. A few things we've implemented that might be interesting for rocksdb:

* The application can force segments to be split at specified keys. This is very helpful if you write a block of data all at once and then don't touch it for a long time. The initial memtable compaction places this data in its own segment and then we can push that segment down to the deepest level without ever compacting it again. It can also eliminate the need for bloom filters for many use cases, as you often wind up with only one segment overlapping a particular key range.

* The application can specify different compression schemes for different parts of the keyspace. This is useful if you are storing different kinds of data in the same database.

* We don't use timestamps anywhere other than the memtable. This puts some constraints on snapshot management, but streamlines get/scan operations and reduces file size for small values.

Do you have benchmarks for scan performance? This is an important area for us. I don't have exact figures handy, but we get something like 2GB/second (using 8 threads) on an EC2 h1.4xlarge, uncached (reading from SSD) and decompressing on the fly. This is an area we've focused on.

I'd enjoy getting together to compare notes -- send me an e-mail if you're interested. steve @ (the domain mentioned above).

hyc_symas12y ago

SkyDB using LMDB gets 3GB/sec on a standalone PC. https://groups.google.com/forum/#!msg/skydb/CMKQSLf2WAw/zBO1...

bjconlan12y ago

Wow, Awesome link, LMDB always seems to fly under the radar, SkyDB+LMDB. Genius. (and written in go! I'm sold... well will at least give it a bash)

dhruba_b12y ago

Thanks for your comments Steve.

1. RocksDb has a feature that allows an application to determine when to close a file (i.e. segment). You can write your compaction code via compaction_filter_factory defined in https://github.com/facebook/rocksdb/blob/master/include/rock...

2. RocksDb also has a feature that allows an application to close a block inside a segment. https://github.com/facebook/rocksdb/commit/fd075d6edd68ddbc1...

3. RocksDb has a feature to use different compression algorithms for different parts of the database. In the Level Style Compaction, you can configure a different compression algorithm for different levels. In Universal Style Compaction, you can specify that you want compression only for x% earliest data in your database.

4. We have internal benchmarks for scan performance but because of lack of developer resources, we might not be able to open source those numbers.

It will be great to catch up in person.

rsynnott12y ago

> we've built a more-or-less clone of LevelDB in Java, with a similar goal of extracting more performance on high-powered servers (and better integration with our Java codebase).

This sounds quite interesting; have you considered open-sourcing it?

snewman12y ago

Yes, we'd like to release it someday, but it won't be any time soon unfortunately. There are a lot of dependencies on other parts of our codebase, e.g. for configuration and monitoring, which would need to be cleaned up.

We will probably at least publish a report describing the work in more detail, some time in the next few months, on our blog (http://blog.scalyr.com).