Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs. In the modern application/server environment, no user level process has accurate information about the total state of the machine, only the kernel (or hypervisor) does and it's an exercise in futility to try to manage paging etc at the user level.
As Dr. Michael Stonebraker put it: The Traditional RDBMS Wisdom is (Almost Certainly) All Wrong. https://slideshot.epfl.ch/play/suri_stonebraker (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.
Granted, even using mmap you still need to know wtf you're doing. MongoDB's original mmap backing store was a poster child for Doing It Wrong, getting all of the reliability problems and none of the performance benefits. LMDB is an example of doing it right: perfect crash-proof reliability, and perfect linear read scalability across arbitrarily many CPUs with zero-copy reads and no wasted effort, and a hot code path that fits into a CPU's 32KB L1 instruction cache.
This is co-authored by Pavlo, Viktor Leiss, with feedback from Neumann. I'm sorry, but if someone on the internet claims to know better than those 3, you're going to need some monumental evidence of your credibility.
Additionally, what you link here:
> ... (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.
Is discussing "Main Memory" databases. These databases do no I/O outside of potential initial reads, because all of the data fits in-memory!These databases represent a small portion of contemporary DBMS usage when compared to traditional RDBMS.
All you have to do is look at the bandwidth and reads/sec from the paper when using O_DIRECT "pread()"s versus mmap'ed IO.
(My understanding is that the GP wrote LMDB, works on openLDAP, and was a maintainer for BerkelyDB for a number of years. But even if he'd only written 'hello, world!' I'm much more interested in the specific arguments).
There's nothing special about kernel programmers. In fact, if I had to compare, I'd go with storage people being the more experienced / knowledgeable ones. They have a highly competitive environment, which requires a lot more understanding and inventiveness to succeed, whereas kernel programmers proper don't compete -- Linux won many years ago. Kernel programmers who deal with stuff like drivers or various "extensions" are, largely, in the same group as storage (often time literally the same people).
As for "single process" argument... well, if you run a database inside an OS, then, obviously, that will never happen as OS has its own processes to run. But, if you ignore that -- no DBA worth their salt would put database in the environment where it has to share resources with applications. People who do that are, probably, Web developers who don't have high expectations from their database anyways and would have no idea how to configure / tune it for high performance, so, it doesn't matter how they run it, they aren't the target audience -- they are light years behind on what's possible to achieve with their resources.
This has nothing to do with mmap though. mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.
Yes, that was a shorthand generalization for "people who've studied computer architecture" - which most application developers never have.
> no DBA worth their salt would put database in the environment where it has to share resources with applications.
Most applications today are running on smartphones/mobile devices. That means they're running with local embedded databases - it's all about "edge computing". There's far more DBs in use in the world than there are DBAs managing them.
> mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.
Well, you're half right. That's why by default LMDB uses a read-only mmap and uses regular (p)write syscalls for writes. But the central point of databases is to be able to persist data such that it can be retrieved again in the future, efficiently. And that's where the read characteristics of using mmap are superior.
In my experience -- and in line with the article -- mmap works fine with small working sets. It seems that most benchmarks of lmdb have relatively small data sets.
Where did you look? This is a sample using DB 5x and 50x larger than RAM http://www.lmdb.tech/bench/hyperdex/
There are plenty of other larger-than-RAM benchmarks there.
The article is about DBMS developers. For DBMS developers, "in practice" (a) and (b) are usually true I think.
Those who do that don't know what they are doing (even if they outnumber the other side hundred to one, they "don't count" because they aren't aiming for good performance anyways).
Well, maybe not quite... of course it's possible that someone would want to deploy a database in a container because of the convenience of assembling all dependencies in a single "package", however, they would never run database on the same node as applications -- that's insanity.
But, even the idea of deploying a database alongside something like kubelet service is cringe... This service is very "fat" and can spike in memory / CPU usage. I would be very strongly opposed to an idea of running a database on the same VM that runs Kubernetes or any container runtime that requires a service to run it.
Obviously, it says nothing about the number of processes that will run on the database node. At the minimum, you'd want to run some stuff for monitoring, that's beside all the system services... but I don't think GP meant "one process" literally. Neither that is realistic nor is it necessary.
(But just in containers, not in Kubernetes. I'm not crazy.)
And we are running them at the scale that most people can’t even imagine.
Note that Varnish dates to 2006, in the days of hard disk drives, SCSI, and 2-core server CPUs. Mmap might well have been as good or even better than I/O back then - a lot of the issues discussed in this paper (TLB shootdown overhead, single flush thread) get much worse as the core count increases.
AFAIK the persistent backend was dropped pretty early on (eventually replaced with a more traditional read()/write()-based one as part of Varnish Plus), and the general recommendation became just to use malloc and hope you didn't swap.
What did you differently in your custom one that was faster then varnish?
reminds me of how industries typically start out dominated by vertically integrated companies, move to specialized horizontal companies, then generally move back to vertical integration due to efficiency. Car industry started this way with Ford, went away from it, and now Tesla is doing it again. Lots of other examples in other industries
You almost always want somewhere in the middle, but it’s often much easier to move back after a large jump in one direction than to push towards the middle.
And there was a very well known cartoon video discussion about it with “web scale” and “just write to dev null” and other classics that became memes :)
Are You Sure You Want to Use MMAP in Your Database Management System? [pdf] - https://news.ycombinator.com/item?id=31504052 - May 2022 (43 comments)
Are you sure you want to use MMAP in your database management system? [pdf] - https://news.ycombinator.com/item?id=29936104 - Jan 2022 (127 comments)
You notice it when web servers are doing kernel bypass to for zero-copy, low-latency networking, or database engines throw away the kernel's page cache to implement their own file buffer.
With mmap, you get to avoid thinking about how much data to buffer at once, caching data to speed up repeated access, or shedding that cache when memory pressure is high. The kernel does all that. It may not do it in the absolute ideal way for your program but the benefit is you don't have to think about these logistics.
But if you're already writing intense systems code then you can probably do a better job than the kernel by optimizing for your use case.
You'll find DPDK mentioned a lot in the networking/HPC/data center literature. An example of a backend framework that uses DPDK is the seastar framework [2]. Also, I recently stumbled upon a paper for efficient RPC networks in data centers [3].
If you want to learn more, the p99 conference has tons of speakers talking about some interesting challenges in that space.
Edit: Hm, it might not be possible to mmap files with huge-pages. This LWN article[1] from 5 years ago talks about the work that would be required, but I haven't seen any follow-ups.
Then there's the part with writes being delayed. Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.
This is not specific to mmap -- regular old write() calls have the same behavior. You need to fsync() (or, with mmap, msync()) to guarantee data is on disk.
This is not true. This depends on how the file was opened. You may request DIRECT | SYNC when opening and the writes are acknowledged when they are actually written. This is obviously a lot slower than writing to cache, but this is the way for "simple" user-space applications to implement their own cache.
In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs. Sometimes it's even configurable. This is why databases also offer various modes of synchronization -- you need to know how your appliance works and configure the database accordingly.
Here is an LWN article discussing the whole problem as the Postgres team found out about it.
For the second part of your comment, on Linux systems, there is the msync() system call that can be used to flush the page cache on demand.
for everyone, not just the file you mapped to memory. I.e. the guarantee is that your file will be written, but there's no way to do that w/o affecting others. This is not such a hot idea in an environment where multiple threads / processes are doing I/O.
Is there a performance benefit to be had by managing the memory and paging yourself? Yes. But eventually you will also consider running processes next to your database, for logging, auditing, ingesting data, running backups, etc. Virtual memory across the whole system helps with that, especially if other people will be using your database in ways you can't predict. As for the efficiency of MMUs and the OS, seems like for almost all cases it's "satisfactory" enough[1].
[0] http://denninginstitute.com/pjd/PUBS/bvm.pdf
[1] From 1969! https://dl.acm.org/doi/pdf/10.1145/363626.363629
The reality is there will always be a hierarchy for storage, and paging will always be the best mechanism to deal with it. Because primary memory will always be most expensive, no matter what technology it's based on. There will always be something slower, cheaper, and denser that will be used for secondary storage. There will always be cheaper storage. And its capacity will exceed primary, and it will always be most efficient to reference secondary storage in chunks - pages - and not at individual byte addresses.
From reading the paper most of the concerns are around the write side. LMDB is the primary implementation that I know which leans heavily into mmap but it also comes with a number of constraints there(single writer, read locks can lead to unbounded appending to the WAL, etc). As with any tech choice it's about knowing constraints/trade-offs and making appropriate choices for your domain.
The opposite with actual file io sucks in terms of complexity. I get that you can write bespoke code that performs better but mmap is a one liner to turn a file into an array.
As for why disk reads fail, yes that's a thing. Less common on internal storage (bad sectors), but more common on removable USB devices or Network drives (especially on wifi).
There's so much you get "for free" and the UX/DX of reads/writes to it, especially if you're primarily operating on structs instead of raw byte/string data.
(Example, reading a file and "reinterpret_cast<>"'ing it from bytes to in-memory struct representations)
It's just that for the _particular_ case of a DBMS that relies on optimal I/O and transactionality, the general-purpose kernel implementation of mmap falls short of what you can implement by hand.
If you have the resources to write and maintain the bespoke method great. The large database developers probably have this. For others please don't take this link and go around claiming mmap is bad though. That gets tiresome and is misguided. Mmap is a shortcut to access large files in a non linear fashion. It's good at that too. Just not as good as a bespoke function.
This is an appeal to core database engineers to stop using the wrong tool for the job.