Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs. In the modern application/server environment, no user level process has accurate information about the total state of the machine, only the kernel (or hypervisor) does and it's an exercise in futility to try to manage paging etc at the user level.
As Dr. Michael Stonebraker put it: The Traditional RDBMS Wisdom is (Almost Certainly) All Wrong. https://slideshot.epfl.ch/play/suri_stonebraker (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.
Granted, even using mmap you still need to know wtf you're doing. MongoDB's original mmap backing store was a poster child for Doing It Wrong, getting all of the reliability problems and none of the performance benefits. LMDB is an example of doing it right: perfect crash-proof reliability, and perfect linear read scalability across arbitrarily many CPUs with zero-copy reads and no wasted effort, and a hot code path that fits into a CPU's 32KB L1 instruction cache.
This is co-authored by Pavlo, Viktor Leiss, with feedback from Neumann. I'm sorry, but if someone on the internet claims to know better than those 3, you're going to need some monumental evidence of your credibility.
Additionally, what you link here:
> ... (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.
Is discussing "Main Memory" databases. These databases do no I/O outside of potential initial reads, because all of the data fits in-memory!These databases represent a small portion of contemporary DBMS usage when compared to traditional RDBMS.
All you have to do is look at the bandwidth and reads/sec from the paper when using O_DIRECT "pread()"s versus mmap'ed IO.
(My understanding is that the GP wrote LMDB, works on openLDAP, and was a maintainer for BerkelyDB for a number of years. But even if he'd only written 'hello, world!' I'm much more interested in the specific arguments).
Andy and I have had this debate going for a long time already.
Interestingly, most of the reason for these problems has to do with theoretical limitations of cache replacement algorithms as drivers of I/O scheduling. There are alternative approaches to scheduling I/O that work much better in these cases but mmap() can’t express them, so in those cases bypassing mmap() offers large gains.
- Queries can trigger blocking page faults when accessing (transparently) evicted pages, causing unexpected I/O stalls
- mmap() complicates transactionality and error-handling
- Page table contention, single-threaded page eviction, and TLB shootdowns become bottlenecks
We shouldn't apply a higher bar to the counterargument than we applied to the argument in the first place.
There's nothing special about kernel programmers. In fact, if I had to compare, I'd go with storage people being the more experienced / knowledgeable ones. They have a highly competitive environment, which requires a lot more understanding and inventiveness to succeed, whereas kernel programmers proper don't compete -- Linux won many years ago. Kernel programmers who deal with stuff like drivers or various "extensions" are, largely, in the same group as storage (often time literally the same people).
As for "single process" argument... well, if you run a database inside an OS, then, obviously, that will never happen as OS has its own processes to run. But, if you ignore that -- no DBA worth their salt would put database in the environment where it has to share resources with applications. People who do that are, probably, Web developers who don't have high expectations from their database anyways and would have no idea how to configure / tune it for high performance, so, it doesn't matter how they run it, they aren't the target audience -- they are light years behind on what's possible to achieve with their resources.
This has nothing to do with mmap though. mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.
Yes, that was a shorthand generalization for "people who've studied computer architecture" - which most application developers never have.
> no DBA worth their salt would put database in the environment where it has to share resources with applications.
Most applications today are running on smartphones/mobile devices. That means they're running with local embedded databases - it's all about "edge computing". There's far more DBs in use in the world than there are DBAs managing them.
> mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.
Well, you're half right. That's why by default LMDB uses a read-only mmap and uses regular (p)write syscalls for writes. But the central point of databases is to be able to persist data such that it can be retrieved again in the future, efficiently. And that's where the read characteristics of using mmap are superior.
If you are developing an DBMS and haven't studied computer architecture, the best idea is probably to ask more experienced people to help out with your ideas.
From my limited knowledge, I don't think the article is old enough to be obsolete, just that there's a lot more to it.
Not to be gatekeeping or anything, but it is a pretty well studied field with lots of very knowledgeable people around, who are probably more than keep to help. There aren't too many qualified jobs around and you probably have a budget if you are developing a database commercially.
It's been a while since I've dealt with mmap(), but isn't this what msync() does? You can synchronously or asynchronously force dirty pages to be flushed to disk without waiting until munmap().
That's patently false. There are about 8 bn. people. Even if everyone has a smartphone or two, it's nothing compared to the total of all devices that can be called "computer". I think that "smart TV" alone will beat the number of smartphones. But even that is a drop in a bucket when it comes to the total of running programs on Earth / its orbit.
But, that's beside the point. Smartphones aren't designed to run database servers. Even if they indeed were the majority, they'd still be irrelevant for this conversation because they are a wrong platform for deploying databases. In other words, it doesn't matter how people deploy databases to smartphones -- they have no hopes of achieving good performance, and whether they use mmap or not is of no consequences -- they've lost the race before they even qualified for it.
> LMDB
Are we talking about this? https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Databa... If so, this is irrelevant for databases in general.
> LMDB databases may have only one writer at a time
(Taken from the page above) -- this isn't a serious contender for database server space. It's a toy database. You shouldn't give general advice based on whatever this system does or doesn't.
In my experience -- and in line with the article -- mmap works fine with small working sets. It seems that most benchmarks of lmdb have relatively small data sets.
Where did you look? This is a sample using DB 5x and 50x larger than RAM http://www.lmdb.tech/bench/hyperdex/
There are plenty of other larger-than-RAM benchmarks there.
It's also strange to me that there's no transition in performance when the data set size grows beyond cache.
The article is about DBMS developers. For DBMS developers, "in practice" (a) and (b) are usually true I think.
Those who do that don't know what they are doing (even if they outnumber the other side hundred to one, they "don't count" because they aren't aiming for good performance anyways).
Well, maybe not quite... of course it's possible that someone would want to deploy a database in a container because of the convenience of assembling all dependencies in a single "package", however, they would never run database on the same node as applications -- that's insanity.
But, even the idea of deploying a database alongside something like kubelet service is cringe... This service is very "fat" and can spike in memory / CPU usage. I would be very strongly opposed to an idea of running a database on the same VM that runs Kubernetes or any container runtime that requires a service to run it.
Obviously, it says nothing about the number of processes that will run on the database node. At the minimum, you'd want to run some stuff for monitoring, that's beside all the system services... but I don't think GP meant "one process" literally. Neither that is realistic nor is it necessary.
The point was simply about other processes that could be competing for resources - CPU, memory, or I/O. It is expensive for a user-level process to perform accounting for all of these resources, and without such accounting you can't optimally allocate them.
If there are other apps that can suddenly spike memory usage then any careful buffer tuning you've done goes out the window. Likewise for any I/O scheduling you've done, etc.
(But just in containers, not in Kubernetes. I'm not crazy.)
And we are running them at the scale that most people can’t even imagine.
It is a realistic concern, I’ve lived it for more than a decade across many orgs, though I shared your opinion at one point. Storage density is massively important for both workload scalability and economic efficiency. Low storage density means buying a ton of server hardware that sits idle under max load and vastly larger clusters than would otherwise be necessary, which have their own costs.
When your database is sufficiently large, backup and restore often isn’t even a technical possibility so that requirement is a red herring. The kinds of workloads that can be recovered from backup at that scale on a single server, and some can, benefit massively from the economics of running it on a single server. A solution that has 10x the AWS bill for the same workload performance doesn’t get chosen.
At scale, hardware footprint economics is one of the central business decision drivers. Data isn’t getting smaller. It is increasingly ordinary for innocuous organizations to have a single table with a trillion records in it.
For better or worse, the market increasingly drives my technical design decisions to optimize for hardware/cloud costs above all else, and dense storage is a huge win for that.
Note that Varnish dates to 2006, in the days of hard disk drives, SCSI, and 2-core server CPUs. Mmap might well have been as good or even better than I/O back then - a lot of the issues discussed in this paper (TLB shootdown overhead, single flush thread) get much worse as the core count increases.
AFAIK the persistent backend was dropped pretty early on (eventually replaced with a more traditional read()/write()-based one as part of Varnish Plus), and the general recommendation became just to use malloc and hope you didn't swap.
What did you differently in your custom one that was faster then varnish?
reminds me of how industries typically start out dominated by vertically integrated companies, move to specialized horizontal companies, then generally move back to vertical integration due to efficiency. Car industry started this way with Ford, went away from it, and now Tesla is doing it again. Lots of other examples in other industries
You almost always want somewhere in the middle, but it’s often much easier to move back after a large jump in one direction than to push towards the middle.
And there was a very well known cartoon video discussion about it with “web scale” and “just write to dev null” and other classics that became memes :)
There are some applications that require high throughput (usually write) but can be fine with read consistency.
Couple of examples - consumer facing comment systems where other users are OK to miss your comment by 30 seconds - timeseries logging where you are usually reading infrequently but writing very much in a denormalized format so joins aren't as critical
For general CRUD, ACID is important though.
Are You Sure You Want to Use MMAP in Your Database Management System? [pdf] - https://news.ycombinator.com/item?id=31504052 - May 2022 (43 comments)
Are you sure you want to use MMAP in your database management system? [pdf] - https://news.ycombinator.com/item?id=29936104 - Jan 2022 (127 comments)
You notice it when web servers are doing kernel bypass to for zero-copy, low-latency networking, or database engines throw away the kernel's page cache to implement their own file buffer.
With mmap, you get to avoid thinking about how much data to buffer at once, caching data to speed up repeated access, or shedding that cache when memory pressure is high. The kernel does all that. It may not do it in the absolute ideal way for your program but the benefit is you don't have to think about these logistics.
But if you're already writing intense systems code then you can probably do a better job than the kernel by optimizing for your use case.
You'll find DPDK mentioned a lot in the networking/HPC/data center literature. An example of a backend framework that uses DPDK is the seastar framework [2]. Also, I recently stumbled upon a paper for efficient RPC networks in data centers [3].
If you want to learn more, the p99 conference has tons of speakers talking about some interesting challenges in that space.
I do wonder what trend is going to win: bypass the kernel or embrace the kernel for everything?
The way I see it, latency decreases either way (as long as you don't have to switch back and forth between kernel and user space), but userspace seems better from a security standpoint.
Then again, everyone is doing eBPF, so probably the "embrace the kernel" approach is going to win. Who knows.
That may be acceptable for your purposes, or it may not.
Then I found out Apache supports it via the EnableSendfile directive. Nice.
>This directive controls whether httpd may use the sendfile support from the kernel to transmit file contents to the client. By default, when the handling of a request requires no access to the data within a file -- for example, when delivering a static file -- Apache httpd uses sendfile to deliver the file contents without ever reading the file if the OS supports it.
Edit: Hm, it might not be possible to mmap files with huge-pages. This LWN article[1] from 5 years ago talks about the work that would be required, but I haven't seen any follow-ups.
Then there's the part with writes being delayed. Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.
This is not specific to mmap -- regular old write() calls have the same behavior. You need to fsync() (or, with mmap, msync()) to guarantee data is on disk.
This is not true. This depends on how the file was opened. You may request DIRECT | SYNC when opening and the writes are acknowledged when they are actually written. This is obviously a lot slower than writing to cache, but this is the way for "simple" user-space applications to implement their own cache.
In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs. Sometimes it's even configurable. This is why databases also offer various modes of synchronization -- you need to know how your appliance works and configure the database accordingly.
Well sure, but 99.9% of people don't do that (and shouldn't, unless they really know what they are doing).
> In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs.
What network-attached storage actually uses O_SYNC behavior without being asked? I'd be quite surprised if any did this as it would make typical workloads incredibly slow in order to provide a guarantee they didn't ask for.
Here is an LWN article discussing the whole problem as the Postgres team found out about it.
For the second part of your comment, on Linux systems, there is the msync() system call that can be used to flush the page cache on demand.
for everyone, not just the file you mapped to memory. I.e. the guarantee is that your file will be written, but there's no way to do that w/o affecting others. This is not such a hot idea in an environment where multiple threads / processes are doing I/O.
> int msync(void addr[.length], size_t length, int flags);
> msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to the filesystem
Is there a performance benefit to be had by managing the memory and paging yourself? Yes. But eventually you will also consider running processes next to your database, for logging, auditing, ingesting data, running backups, etc. Virtual memory across the whole system helps with that, especially if other people will be using your database in ways you can't predict. As for the efficiency of MMUs and the OS, seems like for almost all cases it's "satisfactory" enough[1].
[0] http://denninginstitute.com/pjd/PUBS/bvm.pdf
[1] From 1969! https://dl.acm.org/doi/pdf/10.1145/363626.363629
The reality is there will always be a hierarchy for storage, and paging will always be the best mechanism to deal with it. Because primary memory will always be most expensive, no matter what technology it's based on. There will always be something slower, cheaper, and denser that will be used for secondary storage. There will always be cheaper storage. And its capacity will exceed primary, and it will always be most efficient to reference secondary storage in chunks - pages - and not at individual byte addresses.
From reading the paper most of the concerns are around the write side. LMDB is the primary implementation that I know which leans heavily into mmap but it also comes with a number of constraints there(single writer, read locks can lead to unbounded appending to the WAL, etc). As with any tech choice it's about knowing constraints/trade-offs and making appropriate choices for your domain.
The opposite with actual file io sucks in terms of complexity. I get that you can write bespoke code that performs better but mmap is a one liner to turn a file into an array.
As for why disk reads fail, yes that's a thing. Less common on internal storage (bad sectors), but more common on removable USB devices or Network drives (especially on wifi).
There's so much you get "for free" and the UX/DX of reads/writes to it, especially if you're primarily operating on structs instead of raw byte/string data.
(Example, reading a file and "reinterpret_cast<>"'ing it from bytes to in-memory struct representations)
It's just that for the _particular_ case of a DBMS that relies on optimal I/O and transactionality, the general-purpose kernel implementation of mmap falls short of what you can implement by hand.
If you have the resources to write and maintain the bespoke method great. The large database developers probably have this. For others please don't take this link and go around claiming mmap is bad though. That gets tiresome and is misguided. Mmap is a shortcut to access large files in a non linear fashion. It's good at that too. Just not as good as a bespoke function.
This is an appeal to core database engineers to stop using the wrong tool for the job.
Another technique that can only be done with mmap is to map two contiguous regions of virtual memory to the same underlying buffer. This allows you to use a ring buffer but only read from/write to what looks like a contiguous region of memory.
Also, I've never tested this, but I believe mapped files will get flushed as long as the system stays running. So if you only need resilience against abnormal termination rather than system crashes, it seems like a good option?
However, Java can build a special library file of the core JRE classes that it can mmap into memory with the intent to speed up startup times, mostly for small Java programs.
Guile scheme will mmap files that have been compiled to byte code. You can visualize a contrived (especially today) scenario where Guile is used for CGI handlers, having the bulk of their code mapped, the overall memory impact of simultaneous handlers is much lower, as well as start up times.
The process model is less common today so the value of this goes down, but it can still have its place.
A very over-simplified and probably a bit incorrect description of what it did was to create a smaller version of the image - one that could fit in memory - by sub sampling every nth pixel, which was addressed via mmap.
It actually dealt with jpegs so I have no idea how that bit worked as they are not bitmaps.