Are You Sure You Want to Use MMAP in Your Database Management System? (2022) (opens in new tab)

(db.cs.cmu.edu)

192 pointsnethunters2y ago177 comments

177 comments

This is a pretty old argument and IMO it's far out of date/obsolete.

Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs. In the modern application/server environment, no user level process has accurate information about the total state of the machine, only the kernel (or hypervisor) does and it's an exercise in futility to try to manage paging etc at the user level.

As Dr. Michael Stonebraker put it: The Traditional RDBMS Wisdom is (Almost Certainly) All Wrong. https://slideshot.epfl.ch/play/suri_stonebraker (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.

Granted, even using mmap you still need to know wtf you're doing. MongoDB's original mmap backing store was a poster child for Doing It Wrong, getting all of the reliability problems and none of the performance benefits. LMDB is an example of doing it right: perfect crash-proof reliability, and perfect linear read scalability across arbitrarily many CPUs with zero-copy reads and no wasted effort, and a hot code path that fits into a CPU's 32KB L1 instruction cache.

gavinray2y ago

Out of curiosity, how many databases have you written?

This is co-authored by Pavlo, Viktor Leiss, with feedback from Neumann. I'm sorry, but if someone on the internet claims to know better than those 3, you're going to need some monumental evidence of your credibility.

Additionally, what you link here:

  > ... (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.

Is discussing "Main Memory" databases. These databases do no I/O outside of potential initial reads, because all of the data fits in-memory!

These databases represent a small portion of contemporary DBMS usage when compared to traditional RDBMS.

All you have to do is look at the bandwidth and reads/sec from the paper when using O_DIRECT "pread()"s versus mmap'ed IO.

LAC-Tech2y ago

This is a classic appeal to authority. Let's play the argument, not the man.

(My understanding is that the GP wrote LMDB, works on openLDAP, and was a maintainer for BerkelyDB for a number of years. But even if he'd only written 'hello, world!' I'm much more interested in the specific arguments).

hyc_symas2y ago

Correct, and thank you. I wrote LMDB, wrote a lot of OpenLDAP, and worked on BerkeleyDB for many years. And actually Andy Pavlo invited me to CMU to give a lecture on LMDB a few years back. https://www.youtube.com/watch?v=tEa5sAh-kVk

Andy and I have had this debate going for a long time already.

1 more reply

jandrewrogers2y ago

I think the real argument is more nuanced. Where you see mmap() fail badly on Linux, even for read-only workloads, is under a few specific conditions: very large storage volumes, highly concurrent access, non-trivial access patterns (e.g. high-dimensionality access methods). Most people do not operate data models under these conditions, but if you do then you can achieve large integer factor gains in throughput by not using mmap().

Interestingly, most of the reason for these problems has to do with theoretical limitations of cache replacement algorithms as drivers of I/O scheduling. There are alternative approaches to scheduling I/O that work much better in these cases but mmap() can’t express them, so in those cases bypassing mmap() offers large gains.

3 more replies

gavinray2y ago

The argument is that:

- Queries can trigger blocking page faults when accessing (transparently) evicted pages, causing unexpected I/O stalls

- mmap() complicates transactionality and error-handling

- Page table contention, single-threaded page eviction, and TLB shootdowns become bottlenecks

1 more reply

jemfinch2y ago

"Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers" is already an appeal to authority in itself.

We shouldn't apply a higher bar to the counterargument than we applied to the argument in the first place.

ilyt2y ago

Out of curiosity, do you have anything actually useful to add or are just throwing appeals to authority because you don't ?

Mikhail_Edoshin2y ago

Even thought the data resides mostly in-memory they still have to write transactions to disk to preserve them, don't they?

crabbone2y ago

> your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs.

There's nothing special about kernel programmers. In fact, if I had to compare, I'd go with storage people being the more experienced / knowledgeable ones. They have a highly competitive environment, which requires a lot more understanding and inventiveness to succeed, whereas kernel programmers proper don't compete -- Linux won many years ago. Kernel programmers who deal with stuff like drivers or various "extensions" are, largely, in the same group as storage (often time literally the same people).

As for "single process" argument... well, if you run a database inside an OS, then, obviously, that will never happen as OS has its own processes to run. But, if you ignore that -- no DBA worth their salt would put database in the environment where it has to share resources with applications. People who do that are, probably, Web developers who don't have high expectations from their database anyways and would have no idea how to configure / tune it for high performance, so, it doesn't matter how they run it, they aren't the target audience -- they are light years behind on what's possible to achieve with their resources.

This has nothing to do with mmap though. mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.

hyc_symas2y ago

> There's nothing special about kernel programmers.

Yes, that was a shorthand generalization for "people who've studied computer architecture" - which most application developers never have.

> no DBA worth their salt would put database in the environment where it has to share resources with applications.

Most applications today are running on smartphones/mobile devices. That means they're running with local embedded databases - it's all about "edge computing". There's far more DBs in use in the world than there are DBAs managing them.

> mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.

Well, you're half right. That's why by default LMDB uses a read-only mmap and uses regular (p)write syscalls for writes. But the central point of databases is to be able to persist data such that it can be retrieved again in the future, efficiently. And that's where the read characteristics of using mmap are superior.

xorcist2y ago

> "people who've studied computer architecture" - which most application developers never have

If you are developing an DBMS and haven't studied computer architecture, the best idea is probably to ask more experienced people to help out with your ideas.

From my limited knowledge, I don't think the article is old enough to be obsolete, just that there's a lot more to it.

Not to be gatekeeping or anything, but it is a pretty well studied field with lots of very knowledgeable people around, who are probably more than keep to help. There aren't too many qualified jobs around and you probably have a budget if you are developing a database commercially.

Johnny5552y ago

mmap doesn't allow their users to precisely control the persistence aspect

It's been a while since I've dealt with mmap(), but isn't this what msync() does? You can synchronously or asynchronously force dirty pages to be flushed to disk without waiting until munmap().

1 more reply

crabbone2y ago

> Most applications today are running on smartphones/mobile devices.

That's patently false. There are about 8 bn. people. Even if everyone has a smartphone or two, it's nothing compared to the total of all devices that can be called "computer". I think that "smart TV" alone will beat the number of smartphones. But even that is a drop in a bucket when it comes to the total of running programs on Earth / its orbit.

But, that's beside the point. Smartphones aren't designed to run database servers. Even if they indeed were the majority, they'd still be irrelevant for this conversation because they are a wrong platform for deploying databases. In other words, it doesn't matter how people deploy databases to smartphones -- they have no hopes of achieving good performance, and whether they use mmap or not is of no consequences -- they've lost the race before they even qualified for it.

> LMDB

Are we talking about this? https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Databa... If so, this is irrelevant for databases in general.

> LMDB databases may have only one writer at a time

(Taken from the page above) -- this isn't a serious contender for database server space. It's a toy database. You shouldn't give general advice based on whatever this system does or doesn't.

2 more replies

sakras2y ago

Can you comment on what the paper gets wrong? It says that scalability with mmap is poor due to page table contention and others. How does LMDB manage to scale well with mmap? Is page table contention just not an issue in practice?

tadfisher2y ago

Maybe someone should pull LMDB's mmap/paging system into a usable library. I'd love to use the k/v store part of course, but I keep hitting the default key size limitation and would prefer not to link statically.

hyc_symas2y ago

It wouldn't be much use without the B+tree as well; it's the B+tree's cache friendliness that allows applications to run so efficiently without the OS knowing any specifics of the app's usage patterns.

ori_b2y ago

Do you have benchmarks of lmdb when the working set is much larger than memory? I couldn't find any.

In my experience -- and in line with the article -- mmap works fine with small working sets. It seems that most benchmarks of lmdb have relatively small data sets.

hyc_symas2y ago

> Do you have benchmarks of lmdb when the working set is much larger than memory? I couldn't find any.

Where did you look? This is a sample using DB 5x and 50x larger than RAM http://www.lmdb.tech/bench/hyperdex/

There are plenty of other larger-than-RAM benchmarks there.

ori_b2y ago

Hm. That seems to be comparing against a 2013 era leveldb, which at the time also used mmap. (It's since switched the default for performance reasons)

It's also strange to me that there's no transition in performance when the data set size grows beyond cache.

jerrygenser2y ago

> Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs.

The article is about DBMS developers. For DBMS developers, "in practice" (a) and (b) are usually true I think.

danappelxx2y ago

Who is deploying databases in containers?

orbz2y ago

A disturbingly large number of deployments I’ve seen using Kubernetes or docker compose have databases deployed as such.

spockz2y ago

Given the ability to deploy pods to dedicated nodes based on label selectors, what is the actual performance impact of running a database in a container on a bare metal host with mounted volume versus running that same process with say systemd on that same node? Basically, shouldn’t the overhead of running a container be minimal?

1 more reply

danappelxx2y ago

IMO if you’re concerned about performance and yet are deploying databases this way — mmap should not even be on the radar.

1 more reply

crabbone2y ago

Nobody who matters.

Those who do that don't know what they are doing (even if they outnumber the other side hundred to one, they "don't count" because they aren't aiming for good performance anyways).

Well, maybe not quite... of course it's possible that someone would want to deploy a database in a container because of the convenience of assembling all dependencies in a single "package", however, they would never run database on the same node as applications -- that's insanity.

But, even the idea of deploying a database alongside something like kubelet service is cringe... This service is very "fat" and can spike in memory / CPU usage. I would be very strongly opposed to an idea of running a database on the same VM that runs Kubernetes or any container runtime that requires a service to run it.

Obviously, it says nothing about the number of processes that will run on the database node. At the minimum, you'd want to run some stuff for monitoring, that's beside all the system services... but I don't think GP meant "one process" literally. Neither that is realistic nor is it necessary.

hyc_symas2y ago

>but I don't think GP meant "one process" literally. Neither that is realistic nor is it necessary.

The point was simply about other processes that could be competing for resources - CPU, memory, or I/O. It is expensive for a user-level process to perform accounting for all of these resources, and without such accounting you can't optimally allocate them.

If there are other apps that can suddenly spike memory usage then any careful buffer tuning you've done goes out the window. Likewise for any I/O scheduling you've done, etc.

morelisp2y ago

I'm running prod databases in containers so the server infra team doesn't have to know anything about how that specific database works or how to upgrade it, they just need to know how to issue generic container start/stop commands if they want to do some maintenance.

(But just in containers, not in Kubernetes. I'm not crazy.)

didip2y ago

My group and a bunch of my peer groups.

And we are running them at the scale that most people can’t even imagine.

huahaiy2y ago

Embedded DB

jandrewrogers2y ago

Another interesting limitation of mmap() is that real-world storage volumes can exceed the virtual address space a CPU can address. A 64-bit CPU may have 64-bit pointers but typically cannot address anywhere close to 64 bits of memory, virtually or physically. A normal buffer pool does not have this limitation. You can get EC2 instances on AWS with more direct-attached storage than addressable virtual address space on the local microarchitecture.

glandium2y ago

To put concrete numbers: x86-64 is limited to 48 bits for virtual addresses, which is "only" 256TiB (281TB).

hyc_symas2y ago

All of that is true, but I don't think it's a realistic concern. You're going to be sharding your data across multiple nodes before it gets that large. Nobody wants to sit around backing up or restoring a monolithic 256 TiB database.

jandrewrogers2y ago

Technically you get quite a bit less than the 256 TB theoretical in practice.

It is a realistic concern, I’ve lived it for more than a decade across many orgs, though I shared your opinion at one point. Storage density is massively important for both workload scalability and economic efficiency. Low storage density means buying a ton of server hardware that sits idle under max load and vastly larger clusters than would otherwise be necessary, which have their own costs.

When your database is sufficiently large, backup and restore often isn’t even a technical possibility so that requirement is a red herring. The kinds of workloads that can be recovered from backup at that scale on a single server, and some can, benefit massively from the economics of running it on a single server. A solution that has 10x the AWS bill for the same workload performance doesn’t get chosen.

At scale, hardware footprint economics is one of the central business decision drivers. Data isn’t getting smaller. It is increasingly ordinary for innocuous organizations to have a single table with a trillion records in it.

For better or worse, the market increasingly drives my technical design decisions to optimize for hardware/cloud costs above all else, and dense storage is a huge win for that.

1 more reply

Svetlitski2y ago

Starting with Ice Lake there’s support for 5-level paging, which increases this to 128 PiB. Can’t say that I’ve ever seen this used in the wild though.

jandrewrogers2y ago

Yeah, there mostly isn’t a use case for it in databases. If you have that much storage you’ll need to bypass the kernel cache and scheduler anyway for other reasons. That was true even at the 48-bit limit.

stevefan19992y ago

Intel now extended the page table level to 5-level making this number not so valid. Granted, PL5 creates more TLB pressure and longer memory access time due to that.

pjdesno2y ago

Not just databases - we ran into the same issues when we needed a high-performance caching HTTP reverse proxy for a research project. We were just going to drop in Varnish, which is mmap-based, but performance sucked and we had to write our own.

Note that Varnish dates to 2006, in the days of hard disk drives, SCSI, and 2-core server CPUs. Mmap might well have been as good or even better than I/O back then - a lot of the issues discussed in this paper (TLB shootdown overhead, single flush thread) get much worse as the core count increases.

Sesse__2y ago

Varnish' design wasn't very fast even for 2006-era hardware. It _was_ fast compared to Squid, though (which was the only real competitor at the time), and most importantly, much more flexible for the origin server case. But it came from a culture of “the FreeBSD kernel is so awesome that the best thing userspace can do is to offload as many decisions as humanly possible to the kernel”, which caused, well, suboptimal performance.

AFAIK the persistent backend was dropped pretty early on (eventually replaced with a more traditional read()/write()-based one as part of Varnish Plus), and the general recommendation became just to use malloc and hope you didn't swap.

tayo422y ago

Varnish has a file system backed cache that depends on the page cache to keep it fast.

What did you differently in your custom one that was faster then varnish?

pjdesno2y ago

Simple multithreaded read/write. On a 20-core 40-thread machine with a couple of fast NVMe drives it was way faster.

wood_spirit2y ago

Old timers will recall when using mmap was a prominently promoted selling point for the “no sql” dbms.

ren_engineer2y ago

seems like all databases are moving towards the middle. Postgres has JSON support, MongoDB has transactions and also a columnar extension for OLAP type data. NoSQL seems almost meaningless as a term now. Feels like a move towards a winner takes all multi-modal database that can work with most types of data fairly well. Postgres with all of it's specialized extensions seems like it will be the most popular choice. The convenience of not having to manage multiple databases is hard to beat unless performance is exponentially better, Postgres with these extensions can probably be "good enough" for a lot of companies

reminds me of how industries typically start out dominated by vertically integrated companies, move to specialized horizontal companies, then generally move back to vertical integration due to efficiency. Car industry started this way with Ford, went away from it, and now Tesla is doing it again. Lots of other examples in other industries

TheGeminon2y ago

The pendulum swing is common in any system, and is a really effective mechanism for evaluation.

You almost always want somewhere in the middle, but it’s often much easier to move back after a large jump in one direction than to push towards the middle.

nemo44x2y ago

For documents it made access fast since there’s no joins, etc. that require paging from all over. The problem ended up being updates and compaction issues.

wood_spirit2y ago

My memory is that the problem was ACID. The document stores didn’t promise to be reliable because apparently that didn’t scale.

And there was a very well known cartoon video discussion about it with “web scale” and “just write to dev null” and other classics that became memes :)

cratermoon2y ago

Did you ever read Pat Helland's article, "Life Beyond Distributed Transactions: An apostate’s opinion" https://dl.acm.org/doi/10.1145/3012426.3025012? "This article explores and names some of the practical approaches used in the implementation of large-scale mission-critical applications in a world that rejects distributed transactions."

1 more reply

jerrygenser2y ago

Document stores often are reliable and more fault tolerant. But yes they trade ACID.

There are some applications that require high throughput (usually write) but can be fine with read consistency.

Couple of examples - consumer facing comment systems where other users are OK to miss your comment by 30 seconds - timeseries logging where you are usually reading infrequently but writing very much in a denormalized format so joins aren't as critical

For general CRUD, ACID is important though.

dang2y ago

Are You Sure You Want to Use MMAP in Your Database Management System? [pdf] - https://news.ycombinator.com/item?id=31504052 - May 2022 (43 comments)

Are you sure you want to use MMAP in your database management system? [pdf] - https://news.ycombinator.com/item?id=29936104 - Jan 2022 (127 comments)

dist1ll2y ago

Many general-purpose OS abstractions start leaking when you're working on systems-like software.

You notice it when web servers are doing kernel bypass to for zero-copy, low-latency networking, or database engines throw away the kernel's page cache to implement their own file buffer.

kentonv2y ago

Yes. I think mmap() is misunderstood as being an advanced tool for systems hackers, but it's actually the opposite: it's a tool to make application code simpler by leaving the systems stuff to the kernel.

With mmap, you get to avoid thinking about how much data to buffer at once, caching data to speed up repeated access, or shedding that cache when memory pressure is high. The kernel does all that. It may not do it in the absolute ideal way for your program but the benefit is you don't have to think about these logistics.

But if you're already writing intense systems code then you can probably do a better job than the kernel by optimizing for your use case.

arter42y ago

Web servers doing kernel bypass for zero-copy networking? Do you have a specific example in mind? I'm curious.

dist1ll2y ago

The most common example is DPDK [1]. It's a framework for building bespoke networking stacks that are usable from userspace, without involving the kernel.

You'll find DPDK mentioned a lot in the networking/HPC/data center literature. An example of a backend framework that uses DPDK is the seastar framework [2]. Also, I recently stumbled upon a paper for efficient RPC networks in data centers [3].

If you want to learn more, the p99 conference has tons of speakers talking about some interesting challenges in that space.

[1] https://www.dpdk.org/.

[2] https://github.com/scylladb/seastar

[3] https://github.com/erpc-io/eRPC

arter42y ago

Interesting. I hear a lot more about sendfile(), kTLS and general kernel space tricks than I do about DPDK and userspace networking, but maybe it's just me.

I do wonder what trend is going to win: bypass the kernel or embrace the kernel for everything?

The way I see it, latency decreases either way (as long as you don't have to switch back and forth between kernel and user space), but userspace seems better from a security standpoint.

Then again, everyone is doing eBPF, so probably the "embrace the kernel" approach is going to win. Who knows.

1 more reply

ori_b2y ago

The downside, of course, is that each program owns one instance of the hardware. Applications don't share the network card. This isn't a general purpose solution.

That may be acceptable for your purposes, or it may not.

1 more reply

kentonv2y ago

Probably the most common example is sendfile() for writing file contents out to a socket without reading them into userspace:

https://man7.org/linux/man-pages/man2/sendfile.2.html

mrfox3212y ago

Isn't that the opposite? That is, bypassing user space, not kernel space?

1 more reply

arter42y ago

Yes, I knew about sendfile() but I wasnt't aware of any web server using that (though I know Kafka uses it).

Then I found out Apache supports it via the EnableSendfile directive. Nice.

>This directive controls whether httpd may use the sendfile support from the kernel to transmit file contents to the client. By default, when the handling of a request requires no access to the data within a file -- for example, when delivering a static file -- Apache httpd uses sendfile to deliver the file contents without ever reading the file if the OS supports it.

2 more replies

loeg2y ago

Sendfile isn’t kernel bypass.

kwohlfahrt2y ago

It sounds like a lot of the performance issues are TLB-related. Am I right in thinking huge-pages would help here? If so, it's a bit unfortunate they didn't test this in the paper.

Edit: Hm, it might not be possible to mmap files with huge-pages. This LWN article[1] from 5 years ago talks about the work that would be required, but I haven't seen any follow-ups.

[1]: https://lwn.net/Articles/718102/

hyc_symas2y ago

Huge pages aren't pageable, so they wouldn't be particularly advantageous for a mmap DB anyway, you'd have to do traditional I/O & buffer management for everything.

ori_b2y ago

No, huge pages wouldn't help. They would change when the TLB gets flushed, but the flushes would still be there.

Dwedit2y ago

Memory-Mapped Files = access violations when a disk read fails. If you're not prepared to handle those, don't use memory-mapped files. (Access violation exceptions are the same thing that happens when you attempt to read a null pointer)

Then there's the part with writes being delayed. Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.

kentonv2y ago

> Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.

This is not specific to mmap -- regular old write() calls have the same behavior. You need to fsync() (or, with mmap, msync()) to guarantee data is on disk.

crabbone2y ago

> This is not specific to mmap -- regular old write() calls have the same behavior.

This is not true. This depends on how the file was opened. You may request DIRECT | SYNC when opening and the writes are acknowledged when they are actually written. This is obviously a lot slower than writing to cache, but this is the way for "simple" user-space applications to implement their own cache.

In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs. Sometimes it's even configurable. This is why databases also offer various modes of synchronization -- you need to know how your appliance works and configure the database accordingly.

kentonv2y ago

> This is not true. This depends on how the file was opened. You may request DIRECT | SYNC

Well sure, but 99.9% of people don't do that (and shouldn't, unless they really know what they are doing).

> In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs.

What network-attached storage actually uses O_SYNC behavior without being asked? I'd be quite surprised if any did this as it would make typical workloads incredibly slow in order to provide a guarantee they didn't ask for.

1 more reply

tsimionescu2y ago

It's fun to remember that fsync() on Linux on ext4 at least offers no real guarantee that the data was successfully written to disk. This happens when write errors from background buffered writes are handled internally by the kernel, and they cleanup the error situation (mark dirty pages clean etc). Since the kernel can't know if a later call to fsync() will ever happen, it can't just keep the error around. So, when the call does happen, it will not return any error code. I don't know for sure, but msync() may well have the same behavior.

Here is an LWN article discussing the whole problem as the Postgres team found out about it.

https://lwn.net/Articles/752063/

afr0ck2y ago

Linux throws a SIGBUS. A process should anticipate such I/O failures by implementing a SIGBUS handler, especially a database server.

For the second part of your comment, on Linux systems, there is the msync() system call that can be used to flush the page cache on demand.

crabbone2y ago

> msync() system call that can be used to flush the page cache on demand.

for everyone, not just the file you mapped to memory. I.e. the guarantee is that your file will be written, but there's no way to do that w/o affecting others. This is not such a hot idea in an environment where multiple threads / processes are doing I/O.

afr0ck2y ago

msync() affects only the pages that part of the mmap area you ask for in the arguments. From the man pages:

> int msync(void addr[.length], size_t length, int flags);

> msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to the filesystem

1 more reply

wmf2y ago

I wonder how many apps don't handle errors from read() anyway.

sidewndr462y ago

does that get delivered as SIGSEGV to the process or something else?

afr0ck2y ago

On Linux, it's a SIGBUS.

mpweiher2y ago

Yes, I definitely would want to use mmap() in my storage system. And would love to see the limitations that make this tricky addressed.

zffr2y ago

The TLDR is that MMAP sorta does what you want, but DBMSes need more control over how/when data is paged in/out of memory. Without this extra control, there can be issues with transactional safety, and performance.

benlivengood2y ago

For all of its usefulness in the good old days of rusty disks I wonder if virtual memory is worth having for dedicated databases, caches, and storage heads. Avoiding TLB flushes entirely sounds like a huge win for massively multithreaded software and memory management in a large shared flat address space doesn't sound impossibly hard.

justin_2y ago

This is the kind of debate that has been going on surrounding virtual memory forever[0][1]. If you can keep everything in memory, then you're golden. But eventually you won't, and you'll need to rely on secondary storage.

Is there a performance benefit to be had by managing the memory and paging yourself? Yes. But eventually you will also consider running processes next to your database, for logging, auditing, ingesting data, running backups, etc. Virtual memory across the whole system helps with that, especially if other people will be using your database in ways you can't predict. As for the efficiency of MMUs and the OS, seems like for almost all cases it's "satisfactory" enough[1].

[0] http://denninginstitute.com/pjd/PUBS/bvm.pdf

[1] From 1969! https://dl.acm.org/doi/pdf/10.1145/363626.363629

benlivengood2y ago

I guess things like mshare could be extended to the entire process address spaces and the kernel could avoid TLB invalidation on context switches between them. Core affinity could be used to keep other programs from scheduling on the cores intended for processes sharing the whole address space.

hyc_symas2y ago

The jump in address sizes starts to get too unwieldy. 32 bit addresses were ok, 64 bit addresses start to get clunky, 128 bit would be exorbitant for CPU real estate. There's a reason AMD64 still only supported 40 physical address bits when it was introduced, and later only expanded to 48 bits.

The reality is there will always be a hierarchy for storage, and paging will always be the best mechanism to deal with it. Because primary memory will always be most expensive, no matter what technology it's based on. There will always be something slower, cheaper, and denser that will be used for secondary storage. There will always be cheaper storage. And its capacity will exceed primary, and it will always be most efficient to reference secondary storage in chunks - pages - and not at individual byte addresses.

moonchild2y ago

I don't really see what those two things have to do with each other. When you don't use mmap, you manage the disc<->ram storage virtualisation yourself. Hardware paging, then, is pure overhead. The parent doesn't argue against layering of storage media, nor against chunking in general. Only against mmus as a mechanism for implementing it.

hyc_symas2y ago

The mention of a large shared flat address space implied no paging, to me. Maybe I just read something into it that wasn't there.

1 more reply

jasonhansel2y ago

I've become convinced that there are very few, if any, reasons to MMAP a file on disk. It seems to simplify things in the common case, but in the end it adds a massive amount of unnecessary complexity.

vvanders2y ago

It's incredibly useful in read-only, memory constrained scenarios. I.E. we used to mmap all of our animation data on many rendering engines I worked on where having ~20-50mb of animation data and only "paying" a couple 10s of kb based on usage patterns was very handy. It becomes even more powerful when you have multiple processes sharing that data and the kernel is able to re-use clean pages across processes.

From reading the paper most of the concerns are around the write side. LMDB is the primary implementation that I know which leans heavily into mmap but it also comes with a number of constraints there(single writer, read locks can lead to unbounded appending to the WAL, etc). As with any tech choice it's about knowing constraints/trade-offs and making appropriate choices for your domain.

AnotherGoodName2y ago

Complexity? You mmap it in and then read the multi terrabyte file as if it was an array.

The opposite with actual file io sucks in terms of complexity. I get that you can write bespoke code that performs better but mmap is a one liner to turn a file into an array.

Dwedit2y ago

Need to handle the exceptions/signals every time a disk read fails. With classic IO, you know when the read will happen. But with memory-mapped files, the exception can happen at any time you are reading from the memory range.

As for why disk reads fail, yes that's a thing. Less common on internal storage (bad sectors), but more common on removable USB devices or Network drives (especially on wifi).

Sesse__2y ago

Multi-terabyte? Better hope you have lots of spare RAM for all those page structures the kernel has to keep.

gavinray2y ago

"mmap" in the general case is incredibly useful.

There's so much you get "for free" and the UX/DX of reads/writes to it, especially if you're primarily operating on structs instead of raw byte/string data.

(Example, reading a file and "reinterpret_cast<>"'ing it from bytes to in-memory struct representations)

It's just that for the _particular_ case of a DBMS that relies on optimal I/O and transactionality, the general-purpose kernel implementation of mmap falls short of what you can implement by hand.

neerajsi2y ago

I've been thinking for the past few years about how to get a scenario like 'git clone' of a large repo to go fast. One thought is to memory map the destination files being written by git and then copy/unzip the data there. You'd save a copy versus the staging buffer that you'd currently be passing to write(). However, the overhead of managing the tlb shootdowns would likely be fatal except for the largest output files.

jjtheblunt2y ago

if you truss starting up a binary, the OS normally mmaps the binary, at least in tests i ran.

AnotherGoodName2y ago

A well written bespoke function can beat a generalized function at a specific task.

If you have the resources to write and maintain the bespoke method great. The large database developers probably have this. For others please don't take this link and go around claiming mmap is bad though. That gets tiresome and is misguided. Mmap is a shortcut to access large files in a non linear fashion. It's good at that too. Just not as good as a bespoke function.

dist1ll2y ago

This paper isn't aimed at random developers, and it's not a criticism of mmap in general.

This is an appeal to core database engineers to stop using the wrong tool for the job.

formerly_proven2y ago

mmap can be handy but usually is not a good idea when you care about ACID properties. So it tends to be most useful outside databases.

josephg2y ago

Can you give some examples where mmap is useful?

duped2y ago

I once improved a parser's performance a huge amount (iirc, something like 500x) when parsing large (>1GB) text files by mmap'ing the files instead of reading them into a byte array. It's not a magic bullet but it was alright for that application.

Another technique that can only be done with mmap is to map two contiguous regions of virtual memory to the same underlying buffer. This allows you to use a ring buffer but only read from/write to what looks like a contiguous region of memory.

dataflow2y ago

If your data is likely to already be in the system cache, memory mapping can achieve zero copying of the data, whereas reading will perform at least one memcpy. So there can be a performance advantage depending on the usage pattern.

Also, I've never tested this, but I believe mapped files will get flushed as long as the system stays running. So if you only need resilience against abnormal termination rather than system crashes, it seems like a good option?

1 more reply

whartung2y ago

It’s useful in a world of several processes sharing things. This is much less common today in a world of “single process” containers and VMs as well as monolithic processes using threads or async techniques.

However, Java can build a special library file of the core JRE classes that it can mmap into memory with the intent to speed up startup times, mostly for small Java programs.

Guile scheme will mmap files that have been compiled to byte code. You can visualize a contrived (especially today) scenario where Guile is used for CGI handlers, having the bulk of their code mapped, the overall memory impact of simultaneous handlers is much lower, as well as start up times.

The process model is less common today so the value of this goes down, but it can still have its place.

fatboy2y ago

One place I've seen it used was a lib by a guy called DHoerl for reading images that are too big to fit in memory (this was years ago on iOS).

A very over-simplified and probably a bit incorrect description of what it did was to create a smaller version of the image - one that could fit in memory - by sub sampling every nth pixel, which was addressed via mmap.

It actually dealt with jpegs so I have no idea how that bit worked as they are not bitmaps.

otterley2y ago

glibc itself uses mmap under the covers when doing malloc in certain situations. Granted, it's anonymous and not file-backed, but it's still proven to be performant. See, e.g, mallopt(3).

SoftTalker2y ago

This reads more like "don't write your own DBMS" than "don't use mmap."

jFriedensreich2y ago

maybe a stupid question but what is wrong with coffee and spicy food?

orf2y ago

For the majority of the world, nothing. But if your diet consists of fairly bland food then it can result in unpleasant trips to the toilet.

mattnewton2y ago

Acid reflux I thought

pizza2y ago

to put it crudely I think the punchline is the spicy food hurts on the way out, and the coffee makes that happen with greater velocity

toxik2y ago

Just doesn’t taste good together I think

j / k navigate · click thread line to collapse

177 comments

hyc_symas2y ago

This is a pretty old argument and IMO it's far out of date/obsolete.

gavinray2y ago

Out of curiosity, how many databases have you written?

Additionally, what you link here:

  > ... (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.

Is discussing "Main Memory" databases. These databases do no I/O outside of potential initial reads, because all of the data fits in-memory!

These databases represent a small portion of contemporary DBMS usage when compared to traditional RDBMS.

All you have to do is look at the bandwidth and reads/sec from the paper when using O_DIRECT "pread()"s versus mmap'ed IO.

LAC-Tech2y ago

This is a classic appeal to authority. Let's play the argument, not the man.

hyc_symas2y ago

Andy and I have had this debate going for a long time already.

1 more reply

jandrewrogers2y ago

3 more replies

gavinray2y ago

The argument is that:

- Queries can trigger blocking page faults when accessing (transparently) evicted pages, causing unexpected I/O stalls

- mmap() complicates transactionality and error-handling

- Page table contention, single-threaded page eviction, and TLB shootdowns become bottlenecks

1 more reply

jemfinch2y ago

"Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers" is already an appeal to authority in itself.

We shouldn't apply a higher bar to the counterargument than we applied to the argument in the first place.

ilyt2y ago

Out of curiosity, do you have anything actually useful to add or are just throwing appeals to authority because you don't ?

Mikhail_Edoshin2y ago

Even thought the data resides mostly in-memory they still have to write transactions to disk to preserve them, don't they?

crabbone2y ago

> your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs.

hyc_symas2y ago

> There's nothing special about kernel programmers.

Yes, that was a shorthand generalization for "people who've studied computer architecture" - which most application developers never have.

> no DBA worth their salt would put database in the environment where it has to share resources with applications.

xorcist2y ago

> "people who've studied computer architecture" - which most application developers never have

If you are developing an DBMS and haven't studied computer architecture, the best idea is probably to ask more experienced people to help out with your ideas.

From my limited knowledge, I don't think the article is old enough to be obsolete, just that there's a lot more to it.

Johnny5552y ago

mmap doesn't allow their users to precisely control the persistence aspect

It's been a while since I've dealt with mmap(), but isn't this what msync() does? You can synchronously or asynchronously force dirty pages to be flushed to disk without waiting until munmap().

1 more reply

crabbone2y ago

> Most applications today are running on smartphones/mobile devices.

> LMDB

Are we talking about this? https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Databa... If so, this is irrelevant for databases in general.

> LMDB databases may have only one writer at a time

(Taken from the page above) -- this isn't a serious contender for database server space. It's a toy database. You shouldn't give general advice based on whatever this system does or doesn't.

2 more replies

sakras2y ago

tadfisher2y ago

hyc_symas2y ago

ori_b2y ago

Do you have benchmarks of lmdb when the working set is much larger than memory? I couldn't find any.

In my experience -- and in line with the article -- mmap works fine with small working sets. It seems that most benchmarks of lmdb have relatively small data sets.

hyc_symas2y ago

> Do you have benchmarks of lmdb when the working set is much larger than memory? I couldn't find any.

Where did you look? This is a sample using DB 5x and 50x larger than RAM http://www.lmdb.tech/bench/hyperdex/

There are plenty of other larger-than-RAM benchmarks there.

ori_b2y ago

Hm. That seems to be comparing against a 2013 era leveldb, which at the time also used mmap. (It's since switched the default for performance reasons)

It's also strange to me that there's no transition in performance when the data set size grows beyond cache.

jerrygenser2y ago

The article is about DBMS developers. For DBMS developers, "in practice" (a) and (b) are usually true I think.

danappelxx2y ago

Who is deploying databases in containers?

orbz2y ago

A disturbingly large number of deployments I’ve seen using Kubernetes or docker compose have databases deployed as such.

spockz2y ago

1 more reply

danappelxx2y ago

IMO if you’re concerned about performance and yet are deploying databases this way — mmap should not even be on the radar.

1 more reply

crabbone2y ago

Nobody who matters.

Those who do that don't know what they are doing (even if they outnumber the other side hundred to one, they "don't count" because they aren't aiming for good performance anyways).

hyc_symas2y ago

>but I don't think GP meant "one process" literally. Neither that is realistic nor is it necessary.

If there are other apps that can suddenly spike memory usage then any careful buffer tuning you've done goes out the window. Likewise for any I/O scheduling you've done, etc.

morelisp2y ago

(But just in containers, not in Kubernetes. I'm not crazy.)

didip2y ago

My group and a bunch of my peer groups.

And we are running them at the scale that most people can’t even imagine.

huahaiy2y ago

Embedded DB

jandrewrogers2y ago

glandium2y ago

To put concrete numbers: x86-64 is limited to 48 bits for virtual addresses, which is "only" 256TiB (281TB).

hyc_symas2y ago

jandrewrogers2y ago

Technically you get quite a bit less than the 256 TB theoretical in practice.

For better or worse, the market increasingly drives my technical design decisions to optimize for hardware/cloud costs above all else, and dense storage is a huge win for that.

1 more reply

Svetlitski2y ago

Starting with Ice Lake there’s support for 5-level paging, which increases this to 128 PiB. Can’t say that I’ve ever seen this used in the wild though.

jandrewrogers2y ago

stevefan19992y ago

Intel now extended the page table level to 5-level making this number not so valid. Granted, PL5 creates more TLB pressure and longer memory access time due to that.

pjdesno2y ago

Sesse__2y ago

tayo422y ago

Varnish has a file system backed cache that depends on the page cache to keep it fast.

What did you differently in your custom one that was faster then varnish?

pjdesno2y ago

Simple multithreaded read/write. On a 20-core 40-thread machine with a couple of fast NVMe drives it was way faster.

wood_spirit2y ago

Old timers will recall when using mmap was a prominently promoted selling point for the “no sql” dbms.

ren_engineer2y ago

TheGeminon2y ago

The pendulum swing is common in any system, and is a really effective mechanism for evaluation.

You almost always want somewhere in the middle, but it’s often much easier to move back after a large jump in one direction than to push towards the middle.

nemo44x2y ago

For documents it made access fast since there’s no joins, etc. that require paging from all over. The problem ended up being updates and compaction issues.

wood_spirit2y ago

My memory is that the problem was ACID. The document stores didn’t promise to be reliable because apparently that didn’t scale.

And there was a very well known cartoon video discussion about it with “web scale” and “just write to dev null” and other classics that became memes :)

cratermoon2y ago

1 more reply

jerrygenser2y ago

Document stores often are reliable and more fault tolerant. But yes they trade ACID.

There are some applications that require high throughput (usually write) but can be fine with read consistency.

For general CRUD, ACID is important though.

dang2y ago

Are You Sure You Want to Use MMAP in Your Database Management System? [pdf] - https://news.ycombinator.com/item?id=31504052 - May 2022 (43 comments)

Are you sure you want to use MMAP in your database management system? [pdf] - https://news.ycombinator.com/item?id=29936104 - Jan 2022 (127 comments)

dist1ll2y ago

Many general-purpose OS abstractions start leaking when you're working on systems-like software.

You notice it when web servers are doing kernel bypass to for zero-copy, low-latency networking, or database engines throw away the kernel's page cache to implement their own file buffer.

kentonv2y ago

But if you're already writing intense systems code then you can probably do a better job than the kernel by optimizing for your use case.

arter42y ago

Web servers doing kernel bypass for zero-copy networking? Do you have a specific example in mind? I'm curious.

dist1ll2y ago

The most common example is DPDK [1]. It's a framework for building bespoke networking stacks that are usable from userspace, without involving the kernel.

If you want to learn more, the p99 conference has tons of speakers talking about some interesting challenges in that space.

[1] https://www.dpdk.org/.

[2] https://github.com/scylladb/seastar

[3] https://github.com/erpc-io/eRPC

arter42y ago

Interesting. I hear a lot more about sendfile(), kTLS and general kernel space tricks than I do about DPDK and userspace networking, but maybe it's just me.

I do wonder what trend is going to win: bypass the kernel or embrace the kernel for everything?

The way I see it, latency decreases either way (as long as you don't have to switch back and forth between kernel and user space), but userspace seems better from a security standpoint.

Then again, everyone is doing eBPF, so probably the "embrace the kernel" approach is going to win. Who knows.

1 more reply

ori_b2y ago

The downside, of course, is that each program owns one instance of the hardware. Applications don't share the network card. This isn't a general purpose solution.

That may be acceptable for your purposes, or it may not.

1 more reply

kentonv2y ago

Probably the most common example is sendfile() for writing file contents out to a socket without reading them into userspace:

https://man7.org/linux/man-pages/man2/sendfile.2.html

mrfox3212y ago

Isn't that the opposite? That is, bypassing user space, not kernel space?

1 more reply

arter42y ago

Yes, I knew about sendfile() but I wasnt't aware of any web server using that (though I know Kafka uses it).

Then I found out Apache supports it via the EnableSendfile directive. Nice.

2 more replies

loeg2y ago

Sendfile isn’t kernel bypass.

kwohlfahrt2y ago

It sounds like a lot of the performance issues are TLB-related. Am I right in thinking huge-pages would help here? If so, it's a bit unfortunate they didn't test this in the paper.

Edit: Hm, it might not be possible to mmap files with huge-pages. This LWN article[1] from 5 years ago talks about the work that would be required, but I haven't seen any follow-ups.

[1]: https://lwn.net/Articles/718102/

hyc_symas2y ago

Huge pages aren't pageable, so they wouldn't be particularly advantageous for a mmap DB anyway, you'd have to do traditional I/O & buffer management for everything.

ori_b2y ago

No, huge pages wouldn't help. They would change when the TLB gets flushed, but the flushes would still be there.

Dwedit2y ago

kentonv2y ago

> Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.

This is not specific to mmap -- regular old write() calls have the same behavior. You need to fsync() (or, with mmap, msync()) to guarantee data is on disk.

crabbone2y ago

> This is not specific to mmap -- regular old write() calls have the same behavior.

kentonv2y ago

> This is not true. This depends on how the file was opened. You may request DIRECT | SYNC

Well sure, but 99.9% of people don't do that (and shouldn't, unless they really know what they are doing).

> In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs.

1 more reply

tsimionescu2y ago

Here is an LWN article discussing the whole problem as the Postgres team found out about it.

https://lwn.net/Articles/752063/

afr0ck2y ago

Linux throws a SIGBUS. A process should anticipate such I/O failures by implementing a SIGBUS handler, especially a database server.

For the second part of your comment, on Linux systems, there is the msync() system call that can be used to flush the page cache on demand.

crabbone2y ago

> msync() system call that can be used to flush the page cache on demand.

afr0ck2y ago

msync() affects only the pages that part of the mmap area you ask for in the arguments. From the man pages:

> int msync(void addr[.length], size_t length, int flags);

> msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to the filesystem

1 more reply

wmf2y ago

I wonder how many apps don't handle errors from read() anyway.

sidewndr462y ago

does that get delivered as SIGSEGV to the process or something else?

afr0ck2y ago

On Linux, it's a SIGBUS.

mpweiher2y ago

Yes, I definitely would want to use mmap() in my storage system. And would love to see the limitations that make this tricky addressed.

zffr2y ago

benlivengood2y ago

justin_2y ago

[0] http://denninginstitute.com/pjd/PUBS/bvm.pdf

[1] From 1969! https://dl.acm.org/doi/pdf/10.1145/363626.363629

benlivengood2y ago

hyc_symas2y ago

moonchild2y ago

hyc_symas2y ago

The mention of a large shared flat address space implied no paging, to me. Maybe I just read something into it that wasn't there.

1 more reply

jasonhansel2y ago

vvanders2y ago

AnotherGoodName2y ago

Complexity? You mmap it in and then read the multi terrabyte file as if it was an array.

The opposite with actual file io sucks in terms of complexity. I get that you can write bespoke code that performs better but mmap is a one liner to turn a file into an array.

Dwedit2y ago

As for why disk reads fail, yes that's a thing. Less common on internal storage (bad sectors), but more common on removable USB devices or Network drives (especially on wifi).

Sesse__2y ago

Multi-terabyte? Better hope you have lots of spare RAM for all those page structures the kernel has to keep.

gavinray2y ago

"mmap" in the general case is incredibly useful.

There's so much you get "for free" and the UX/DX of reads/writes to it, especially if you're primarily operating on structs instead of raw byte/string data.

(Example, reading a file and "reinterpret_cast<>"'ing it from bytes to in-memory struct representations)

It's just that for the _particular_ case of a DBMS that relies on optimal I/O and transactionality, the general-purpose kernel implementation of mmap falls short of what you can implement by hand.

neerajsi2y ago

jjtheblunt2y ago

if you truss starting up a binary, the OS normally mmaps the binary, at least in tests i ran.

AnotherGoodName2y ago

A well written bespoke function can beat a generalized function at a specific task.

dist1ll2y ago

This paper isn't aimed at random developers, and it's not a criticism of mmap in general.

This is an appeal to core database engineers to stop using the wrong tool for the job.

formerly_proven2y ago

mmap can be handy but usually is not a good idea when you care about ACID properties. So it tends to be most useful outside databases.

josephg2y ago

Can you give some examples where mmap is useful?

duped2y ago

dataflow2y ago

1 more reply

whartung2y ago

However, Java can build a special library file of the core JRE classes that it can mmap into memory with the intent to speed up startup times, mostly for small Java programs.

The process model is less common today so the value of this goes down, but it can still have its place.

fatboy2y ago

One place I've seen it used was a lib by a guy called DHoerl for reading images that are too big to fit in memory (this was years ago on iOS).

It actually dealt with jpegs so I have no idea how that bit worked as they are not bitmaps.

otterley2y ago

glibc itself uses mmap under the covers when doing malloc in certain situations. Granted, it's anonymous and not file-backed, but it's still proven to be performant. See, e.g, mallopt(3).

SoftTalker2y ago

This reads more like "don't write your own DBMS" than "don't use mmap."

jFriedensreich2y ago

maybe a stupid question but what is wrong with coffee and spicy food?

orf2y ago

For the majority of the world, nothing. But if your diet consists of fairly bland food then it can result in unpleasant trips to the toilet.

mattnewton2y ago

Acid reflux I thought

pizza2y ago

to put it crudely I think the punchline is the spicy food hurts on the way out, and the coffee makes that happen with greater velocity

toxik2y ago

Just doesn’t taste good together I think

j / k navigate · click thread line to collapse