MDBM – High-speed database (opens in new tab)

(yahooeng.tumblr.com)

159 pointsthreepointone11y ago111 comments

111 comments

57 comments · 18 top-level

justin6611y ago· 10 in thread

This looks interesting. At this stage of the game a more meaningful benchmark might involve LMDB, Wiredtiger, and, yes, LevelDB.

hendzen11y ago

I don't think it's comparable to benchmark MDBM against LMDB or WiredTiger as keys are not kept in sorted order (no range queries), there is no support for transactions, and MDBM does not offer durability in the event of power loss.

MDBM is pretty much an optimized persistent hash table. LMDB and WiredTiger aim to be full-fledged ACID compliant database storage engines with functionality similar to that of BerkeleyDB or InnoDB.

hyc_symas11y ago

You make some good points. We benchmark LMDB against LevelDB and its derivatives even though none of the LevelDB family offer ACID transactions. (http://symas.com/mdb/ondisk/ ) Despite this fact, people will ask the question and try to make the comparison, so we run those tests. It's silly, but most people seem to pay attention to performance more than to safety/reliability.

From my totally biased perspective, MDBM is utter garbage. They use mmap but make absolutely zero effort to use it safely. This was the biggest obstacle to overcome in developing LMDB; I had a few lengthy conversations with the SleepyCat guys about it as well. It's the reason it took 2 years (from 2009 when we first started talking about it, to 2011 first code release) to get LMDB implemented. If you want to call something a "database" you have to do more than just mmap a file and start shoving data into it - you have to exert some kind of control over how and when the mapped data gets persisted to disk. Otherwise, if you just let the OS randomly flush things, you'll wind up with garbage. As Keith Bostic said to me (private email):

"The most significant problem with building an mmap'd back-end is implementing write-ahead-logging (WAL). (You probably know this, but just in case: the way databases usually guarantee consistency is by ensuring that log records describing each change are written to disk before their transaction commits, and before the database page that was changed. In other words, log record X must hit disk before the database page containing the change described by log record X.)

In Berkeley DB WAL is done by maintaining a relationship between the database pages and the log records. If a database page is being written to disk, there's a look-aside into the logging system to make sure the right log records have already been written. In a memory-mapped system, you would do this by locking modified pages into memory (mlock), and flushing them at specific times (msync), otherwise the VM might just push a database page with modifications to disk before its log record is written, and if you crash at that point it's all over but the screaming."

The harsh realities of working with mmap are what dictated LMDB's copy-on-write design - it's the only way to ensure consistency with an mmap without losing performance (due to multiple mlock/msync syscalls). None of these design considerations are evident in MDBM.

LMDB's mmap is read-only by default, because otherwise it's trivial to permanently corrupt a database by overwriting a record, writing past the end, etc. MDBM's mmap is read-write, and the only "protection" you get is a doc that tells you "be Vewwy vewwy careful!" Ridiculously sloppy.

LMDB's design and implementation are proven incorruptible. MDBM (and LevelDB and all its derivatives) are proven to be quite fragile. https://www.usenix.org/conference/osdi14/technical-sessions/...

Leaving reliability aside for a moment, there's also the issue of performance and efficiency. We used to use DBM-style hashes for the indexes in OpenLDAP, up to release 2.1. We abandoned them in favor of B-trees in OpenLDAP 2.2 because extensive benchmarking showed that BDB's B-trees were faster than its hash implementation at very large data sizes. The fundamental problem is that hash data structures are only fast when they are sparsely populated. When the number of data records you need to work with increases to fill the table, you start getting more and more hash collisions that result in lots of linear probes (or whatever other hash recovery strategy you're using). The other problem is that the very sparse/unordered nature of hashes makes them extremely cache unfriendly - you get zero locality-of-reference for groups of related queries. So as your data volumes increase, you get less and less benefit from the amount of RAM you have available. When the data exceeds the size of RAM, the number of disk seeks required for an arbitrary lookup is enormous, and every read is a random access. Using a hash for a large-scale data store is just horrible. (We tested this extensively a decade ago http://www.openldap.org/lists/openldap-devel/200401/msg00077... )

vinkelhake11y ago

Hey, I've been working on a graph db-like thing as a hobby project for the last six months and I'm using LMDB as backend. I tried many alternatives (LevelDB, Sqlite4's LSM, BDB etc.) before settling on LMDB. The alternatives all had some quirk that stopped me from using them.

Among other things, I like that LMDB has zero-copy reads and that's something I've taken care to preserve all the way through my layers.

Just wanted to say thanks for the great work. LMDB is a joy to work with.

1 more reply

Donch11y ago

"We abandoned them in favor of B-trees in OpenLDAP 2.2 because extensive benchmarking showed that BDB's B-trees were faster than its hash implementation at very large data sizes."

Didn't bdb's linear hashing scheme extend the size of the hash table enough to keep it at the required loadfactor?

1 more reply

justin6611y ago

Howard, since you're here and taking questions... :-)

Do you have any idea if a sqlite 4 release is imminent? Will lmdb work with it right out of the gate?

Thanks.

1 more reply

kitsune_11y ago

Thanks for your insight and the in-depth benchmarks you provide.

1 more reply

luckydude11y ago

You pretty clearly haven't used MDBM because the MDBM I worked on at SGI (and still use to this day) gets to any key in two page faults (aka 2 disk seeks) at the most. That was the whole point of it.

If you want I'll go shove a few GB into an mdbm, drop caches, and time a lookup.

2 more replies

swah11y ago

Isn't LevelDB a building block of Google's distributed file systems?

2 more replies

justin6611y ago

If you look at the MDBM page you'll notice that they currently benchmark against LevelDB, BerkeleyDB, and Kyoto Cabinet. What I was suggesting involves the same theme but better, newer competition.

I agree that it's an apples to oranges comparison in any case.

beagle311y ago

The timings they give there can only make sense on a fast SSD or when the database they benchmark on is completely cached.

It's an apples-to-oranges comparison only of MDBM wins significantly against LMDB. If they are comparable in timing, or e.g. MDBM is 20% faster, then it would be an apples-to-apples comparison, MDBM having 20% speed advantage, and LMDB having every other possible advantage (memory safety, ACIDity, ordered retrieval, multiple databases, etc.)

LMDB is truly, incredibly, really marvelous. On 64-bit it comes close to being the end-all-be-all local KV-store. If your databases are not more than a few tens of megs each, the same is true for 32-bit processors as well.

2 more replies

coreymgilmore11y ago· 7 in thread

Thoughts on using this as a cache instead of memcache or redis? Yes, it does not have nearly as many features or functions but when raw performance is needed I could see this working (given an api for using this via Node.JS, PHP, etc.).

hyc_symas11y ago

You'd be better off using MemcacheDB/LMDB http://symas.com/mdb/memcache/

swah11y ago

You seem to have written a great piece of software that you're very proud. I don't understand the ownership very well, but if you're allowed, why don't you promote your database with its own site, like every little javascript library out there?

Good examples: http://duktape.org/ (it might seem silly but that right column makes people want to try it!), http://redis.io (i bet this page wins many folks http://redis.io/topics/twitter-clone)

hendzen11y ago

why even pay the cost of memory mapping if its a transient embedded cache not shared between servers?

just use a std::unordered_map, or better yet a tbb::concurrent_unordered_map or whatever the equivalent is for your language

otterley11y ago

Because it's shared between processes on the same server.

nly11y ago

In theory STL implementations, if used with a custom allocator, should be able to pull this off... that's why the STL containers all have internal 'pointer' typedefs.

Practically speaking, Boost.Interprocess includes a shared memory hash table implementation. Boost Multi Index, which is a further generalisation of containers to allow the construction of database-like indexes, is also Interprocess compatible.

http://www.boost.org/doc/libs/1_57_0/doc/html/interprocess/a...

clutchski11y ago

Because it will persist between restarts?

hendzen11y ago

then it seems strange to call it a cache.

1 more reply

philliphaydon11y ago· 6 in thread

Do people get annoyed by all the JavaScript frameworks and Databases coming out in regards to adoption from a company point of view? I mean every other day a new database comes out and claims to be better in one way or another than something else and then its like "fuck I picked X when now there's Y"

It seems over the last year technology has been growing more rapidly than any other period.

Fun times but so hard to keep track of everything!

tacos11y ago

For those old timers who did distributed systems work there's not that much new under the sun. I look at something like this and say "ah, a quirky and somewhat dangerous cache layer." What's different is bloggers promoting it as a "database."

While I'm sure someone out there will see this and say "wow, that's exactly what I need!" chances are that if you have these sorts of scale issues you're going to have to figure it out on your own.

I'd rather see a write-up of how they arrived at this particular conclusion than another non-database.

kokey11y ago

I have found the way in the middle is to stick to the traditional things for the most critical work, things where the abilities and problems are well understood, until some of the new thing mature. Then, for less critical things that can run as separate services, try out the new things. That way people in a team can also get exposure to the new technologies and you can experience first hand the pros and cons of these technologies. Some people are more generally optimistic about new things which solves some of the frustrations they have with the old things (or perhaps their lack of experience and understanding of the old things), without realising there will be new problems that comes with the new solutions that haven't even been identified yet, never mind workarounds or solutions to these new, and yet unknown, problems. However, I think the biggest, while also avoidable, risk is usually adopting the new thing for a problem you don't even have. A lot of distributed databases exist because you are going to have a lot of concurrent users access similar data, related to themselves mostly, accessed at the same time, or have a lot of user data that you want to do big, ad-hoc searches and analysis across all the data. If you don't have the scale of 'a lot' or 'concurrent' or 'across all the data' that these things are designed to address, you might not even need this solution and especially not all the tradeoffs you have to make along with it. I agree it's an exciting time, and a time of discovery for everyone, but there will be many things learned the hard way and it's tricky to position yourself here especially when there's pressure around you.

optimusclimb11y ago

While I understand that feeling, I've come to realize it's an inevitability, that shouldn't affect your work.

At any given time, you either have a need/problem, or you don't. If you DO, you evaluate the current tech available, and hopefully select something that fits your needs. You build out around said tech, and if your choice was correct, that means it's either solving your problem, or on it's way to.

If something comes along while you're implementing with your chosen solution, that looks similar, but better, it's only noise - because hey, you found a solution.

Just as we don't all re-write all of our code whenever a new language comes along (unless the thing in question was desperately in need of a re-write anyway) even if newer languages are nicer, we needn't switch DBs or frameworks for the same reasons.

unclebucknasty11y ago

Yes, I find it annoying because:

1) it's non-stop

2) there seldom sems to be anything truly novel in a broadly meaningful way (i.e. esoteric, if anything)

3) there is rarely an objective improvement on existing options

I no longer feel compelled to replace or adopt though, precisely for those reasons.

lsen00111y ago

On the contrary, I find it awesome. Having choices is much better than not. It just shows the maturity of the entire tech ecosystem. Just take everything with a grain of salt, evaluate if any new tool/library matches your particular need/use case, do a quick proof of concept if so, and adapt if successful. Rinse and repeat. Change is good. Embrace it.

akbar50111y ago

> It seems over the last year technology has been growing more rapidly than any other period.

I tend to agree with this statement. The entire stack appears to be going through a revolution.

The data layer in particular is seeing very rapid change after being largely (not entirely) static for decades.

otterley11y ago· 5 in thread

I'm so excited that they finally open-sourced this. It's relatively old tech at Yahoo, stuff folks outside never got to see. It was difficult to explain to later colleagues the stuff I knew about shared-memory databases because I couldn't give them a frame of reference.

mdbm performance is even better on FreeBSD than Linux because FreeBSD supports MAP_NOSYNC, which causes the kernel not to flush dirty pages to disk until the region is unmapped. Perhaps mdbm's release will finally get the Linux kernel team to provide support for that flag.

jzawodn11y ago

Same here. I remember wishing we could Open Source it back in the early 2000s. Good to see this coming out so people can take a little credit for work they did back in the day.

luckydude11y ago

I've been providing people with source all along. SGI was pretty pleasant about letting me retain copyright on stuff like that.

cbsmith11y ago

Didn't Yahoo have copyright on a bunch of the modifications to mdbm?

1 more reply

hyc_symas11y ago

How is your MAP_NOSYNC any different from just using mlock?

otterley11y ago

mdbm uses a file-backed mmap region so you're implicitly saying you do want some sort of persistence. Also, mlock locks the heap in RAM to keep it from being swapped - mdbm's regions aren't heap space.

There's a mmap flag on Linux called MAP_LOCKED but I'm not sure how it behaves with MAP_SHARED, which mdbm uses (the man page isn't clear).

remon11y ago· 3 in thread

I'm not very comfortable with storage engines that directly build on memory mapped files. MongoDB's current storage engine is mmap based and it's sub optimal at best which is undoubtedly part of the reason they're building a completely new storage engine now (WiredTiger).

hyc_symas11y ago

Using mmap well takes great care. MongoDB was careless. There's good reason to believe the MDBM designers were careless too.

luckydude11y ago

MDBM guy here. Care to elaborate on what we got wrong?

hyc_symas11y ago

See below. https://news.ycombinator.com/item?id=8734356

1 more reply

mbrzusto11y ago· 3 in thread

How similar in performance is MDBM to GDBM (the GNU DBM)? They appear to be similar (if not identical) in functionality.

api11y ago

Not sure, but in my experience GDBM is a bit on the slow side. MDBM uses mmap(), so for that reason alone it should be faster.

luckydude11y ago

MDBM was designed to be fast with special care taken on the lookup path. The goal was to do lookups with as few cache misses as possible. You can get to any key with at most two page faults.

cbsmith11y ago

They might appear similar, but that's just because they share the same DBM interface heritage.

polskibus11y ago· 1 in thread

Can anyone say whether it would be hard to port it to Windows? Maybe there already is something for Windows that is as good as this ?

luckydude11y ago

I can't speak to the yahoo version, they've wacked it, but the base mdbm that we still use today works fine on windows, has for years.

discardorama11y ago· 1 in thread

How is MDBM for concurrent access? How does it handle locking (i.e., one big lock that blocks everyone else, or key-level locking)?

luckydude11y ago

So the SGI owned code, that I don't have, did page level locking. There two kinds of locks, rd/wr on the directory, and rd/wr on a page. If you are inserting a key you get a read lock on the directory and a write lock on the page. If it fits in the page then you are done. So you can have lots of concurrent writers until a page is full and you have to split it. Bob Mende did that work I think, you might track him down for details.

i_am_ralpht11y ago· 1 in thread

Where is the original open source release from Silicon Graphics which Yahoo based this work on? Did they ever make one?

luckydude11y ago

Nah, they didn't care and I didn't want to piss them off so I just handed the code to anyone who asked for it.

swah11y ago· 1 in thread

Where does it say that this database is persistent?

t1m11y ago

It is memory mapped, which means that it is persisted to disk, perhaps confusingly if you aren't familiar with mmap.

EGreg11y ago· 1 in thread

How is this different than memcache?

pjscott11y ago

Memcache is an in-memory cache. This is an on-disk key-value store.

PhuFighter11y ago

I'm curious to see what the total timings would be like to get the data in a useable form - as opposed to just fetching a record from a data store. As noted - these data stores just store and retrieve data and don't do things like joins or ordering, etc.

Could there be a comparison between these datastores and the traditional ACID compliant databases when it comes to retrieving actual data in a useful format? E.g. perhaps doing a join or an ordering of some sort? I don't expect databases (e.g. Oracle, MS SQL Server, DB2) to be faster in raw performance, but I do expect them to be faster in terms of total development time and bug fixing since the application developer wouldn't have to do the locking, page pinning/unpinning, etc. manually.

chatman11y ago

Let the horrors of MDBM not get to you. I've used it when I worked at Yahoo, and the client support for Java etc. sucks.

swah11y ago

Could not install this in Ubuntu 12.04 - basic commands are failing. I think they tested only in BSD?

ln -s -f -r /tmp/install/lib64/libmdbm.so.4 /tmp/install/lib64/libmdbm.so ln: invalid option -- 'r' Try `ln --help' for more informatio

qwerta11y ago

I dont want to brag. But there is also DBM inspired Java port. And in-memory mode outperforms java heap collections such as j.u.HashMap.

jwr11y ago

This is a very big deal, especially because of the BSD licensing.

extralam11y ago

yahoo back to IT company ?

extralam11y ago

interesting. follow

j / k navigate · click thread line to collapse

111 comments

57 comments · 18 top-level

justin6611y ago· 10 in thread

This looks interesting. At this stage of the game a more meaningful benchmark might involve LMDB, Wiredtiger, and, yes, LevelDB.

hendzen11y ago

MDBM is pretty much an optimized persistent hash table. LMDB and WiredTiger aim to be full-fledged ACID compliant database storage engines with functionality similar to that of BerkeleyDB or InnoDB.

hyc_symas11y ago

LMDB's design and implementation are proven incorruptible. MDBM (and LevelDB and all its derivatives) are proven to be quite fragile. https://www.usenix.org/conference/osdi14/technical-sessions/...

vinkelhake11y ago

Among other things, I like that LMDB has zero-copy reads and that's something I've taken care to preserve all the way through my layers.

Just wanted to say thanks for the great work. LMDB is a joy to work with.

1 more reply

Donch11y ago

"We abandoned them in favor of B-trees in OpenLDAP 2.2 because extensive benchmarking showed that BDB's B-trees were faster than its hash implementation at very large data sizes."

Didn't bdb's linear hashing scheme extend the size of the hash table enough to keep it at the required loadfactor?

1 more reply

justin6611y ago

Howard, since you're here and taking questions... :-)

Do you have any idea if a sqlite 4 release is imminent? Will lmdb work with it right out of the gate?

Thanks.

1 more reply

kitsune_11y ago

Thanks for your insight and the in-depth benchmarks you provide.

1 more reply

luckydude11y ago

You pretty clearly haven't used MDBM because the MDBM I worked on at SGI (and still use to this day) gets to any key in two page faults (aka 2 disk seeks) at the most. That was the whole point of it.

If you want I'll go shove a few GB into an mdbm, drop caches, and time a lookup.

2 more replies

swah11y ago

Isn't LevelDB a building block of Google's distributed file systems?

2 more replies

justin6611y ago

If you look at the MDBM page you'll notice that they currently benchmark against LevelDB, BerkeleyDB, and Kyoto Cabinet. What I was suggesting involves the same theme but better, newer competition.

I agree that it's an apples to oranges comparison in any case.

beagle311y ago

The timings they give there can only make sense on a fast SSD or when the database they benchmark on is completely cached.

2 more replies

coreymgilmore11y ago· 7 in thread

hyc_symas11y ago

You'd be better off using MemcacheDB/LMDB http://symas.com/mdb/memcache/

swah11y ago

Good examples: http://duktape.org/ (it might seem silly but that right column makes people want to try it!), http://redis.io (i bet this page wins many folks http://redis.io/topics/twitter-clone)

hendzen11y ago

why even pay the cost of memory mapping if its a transient embedded cache not shared between servers?

just use a std::unordered_map, or better yet a tbb::concurrent_unordered_map or whatever the equivalent is for your language

otterley11y ago

Because it's shared between processes on the same server.

nly11y ago

In theory STL implementations, if used with a custom allocator, should be able to pull this off... that's why the STL containers all have internal 'pointer' typedefs.

http://www.boost.org/doc/libs/1_57_0/doc/html/interprocess/a...

clutchski11y ago

Because it will persist between restarts?

hendzen11y ago

then it seems strange to call it a cache.

1 more reply

philliphaydon11y ago· 6 in thread

It seems over the last year technology has been growing more rapidly than any other period.

Fun times but so hard to keep track of everything!

tacos11y ago

While I'm sure someone out there will see this and say "wow, that's exactly what I need!" chances are that if you have these sorts of scale issues you're going to have to figure it out on your own.

I'd rather see a write-up of how they arrived at this particular conclusion than another non-database.

kokey11y ago

optimusclimb11y ago

While I understand that feeling, I've come to realize it's an inevitability, that shouldn't affect your work.

If something comes along while you're implementing with your chosen solution, that looks similar, but better, it's only noise - because hey, you found a solution.

unclebucknasty11y ago

Yes, I find it annoying because:

1) it's non-stop

2) there seldom sems to be anything truly novel in a broadly meaningful way (i.e. esoteric, if anything)

3) there is rarely an objective improvement on existing options

I no longer feel compelled to replace or adopt though, precisely for those reasons.

lsen00111y ago

akbar50111y ago

> It seems over the last year technology has been growing more rapidly than any other period.

I tend to agree with this statement. The entire stack appears to be going through a revolution.

The data layer in particular is seeing very rapid change after being largely (not entirely) static for decades.

otterley11y ago· 5 in thread

jzawodn11y ago

Same here. I remember wishing we could Open Source it back in the early 2000s. Good to see this coming out so people can take a little credit for work they did back in the day.

luckydude11y ago

I've been providing people with source all along. SGI was pretty pleasant about letting me retain copyright on stuff like that.

cbsmith11y ago

Didn't Yahoo have copyright on a bunch of the modifications to mdbm?

1 more reply

hyc_symas11y ago

How is your MAP_NOSYNC any different from just using mlock?

otterley11y ago

There's a mmap flag on Linux called MAP_LOCKED but I'm not sure how it behaves with MAP_SHARED, which mdbm uses (the man page isn't clear).

remon11y ago· 3 in thread

hyc_symas11y ago

Using mmap well takes great care. MongoDB was careless. There's good reason to believe the MDBM designers were careless too.

luckydude11y ago

MDBM guy here. Care to elaborate on what we got wrong?

hyc_symas11y ago

See below. https://news.ycombinator.com/item?id=8734356

1 more reply

mbrzusto11y ago· 3 in thread

How similar in performance is MDBM to GDBM (the GNU DBM)? They appear to be similar (if not identical) in functionality.

api11y ago

Not sure, but in my experience GDBM is a bit on the slow side. MDBM uses mmap(), so for that reason alone it should be faster.

luckydude11y ago

MDBM was designed to be fast with special care taken on the lookup path. The goal was to do lookups with as few cache misses as possible. You can get to any key with at most two page faults.

cbsmith11y ago

They might appear similar, but that's just because they share the same DBM interface heritage.

polskibus11y ago· 1 in thread

Can anyone say whether it would be hard to port it to Windows? Maybe there already is something for Windows that is as good as this ?

luckydude11y ago

I can't speak to the yahoo version, they've wacked it, but the base mdbm that we still use today works fine on windows, has for years.

discardorama11y ago· 1 in thread

How is MDBM for concurrent access? How does it handle locking (i.e., one big lock that blocks everyone else, or key-level locking)?

luckydude11y ago

i_am_ralpht11y ago· 1 in thread

Where is the original open source release from Silicon Graphics which Yahoo based this work on? Did they ever make one?

luckydude11y ago

Nah, they didn't care and I didn't want to piss them off so I just handed the code to anyone who asked for it.

swah11y ago· 1 in thread

Where does it say that this database is persistent?

t1m11y ago

It is memory mapped, which means that it is persisted to disk, perhaps confusingly if you aren't familiar with mmap.

EGreg11y ago· 1 in thread

How is this different than memcache?

pjscott11y ago

Memcache is an in-memory cache. This is an on-disk key-value store.

PhuFighter11y ago

chatman11y ago

Let the horrors of MDBM not get to you. I've used it when I worked at Yahoo, and the client support for Java etc. sucks.

swah11y ago

Could not install this in Ubuntu 12.04 - basic commands are failing. I think they tested only in BSD?

ln -s -f -r /tmp/install/lib64/libmdbm.so.4 /tmp/install/lib64/libmdbm.so ln: invalid option -- 'r' Try `ln --help' for more informatio

qwerta11y ago

I dont want to brag. But there is also DBM inspired Java port. And in-memory mode outperforms java heap collections such as j.u.HashMap.

jwr11y ago

This is a very big deal, especially because of the BSD licensing.

extralam11y ago

yahoo back to IT company ?

extralam11y ago

interesting. follow

j / k navigate · click thread line to collapse