You need to care if using mmap directly to map files or other resources into the virtual memory address space. The default page size can be queried using for example sysconf() on Linux. I guess something like garbage collectors in language run-times would also use mmap directly as it's most likely to side step malloc/new.
An application would normally not use madvise, unless also using mmap for some special purpose.
It depends on the CPU architecture how flexible it is with different page sizes. For example, from what I recall, MIPS was extremely flexible and allowed any even power of two size for any TLB entry.
x86_64 only support three different page sizes, 4 kB, 2 MB and 1 GB and there are limitations wrt the number of TLB entries that can be used for the larger page sizes.
So, yea, there are bound to be regressions if just trying to switch to 2 MB as a default but I think it should be doable. Not all archs use 4 kB to begin with.
Just switching to larger regular page size (e.g. 64k) on platforms that support it would not have problems associated with THP.
Why did they assume that? 4k pages are a feature of the memory management unit of the cpu. Optional support for large pages came to x86 with pentium in the mid-1990’s. Presumably all x86 cpus out there today have large page support, but the assumption of the 4k default is deeply ingrained.
"fork() e.g. Redis: Calling fork marks all of the process's pages as copy-on-write. Then when a single byte on a page is modified, the page must be copied. Redis uses fork to create a read-only "snapshot" of memory, when writing a checkpoint to disk."
So prior, that fork forced only a rewrite of some collection of 4kb pages. Afterwards, obviously much larger rewrites.
[0] The state of ASLR on Android 5 (2015), https://archive.is/ADx65 (copperhead.co).
But transparent hugepages continue to be a massive source of bugs, weird behaviors, and total system failures in my experience. I just got a bug report this week where a simple THP enabled system simply spun out of control with a kernel task locking the system at 100% CPU for minutes, with a 10 line reproducer via mmap(2). This is in combination with qemu/libvirt in a virtual machine, and it's possible the virtualization stack is just exposing bugs, but, like. This is very well tested stuff! I'm not sure "Google enabled it fleet wide, so it can be done" is very re-assuring to me when most of us don't have fleetops/infra/kernel teams capable of dealing with this stuff. The person who reported this bug said they started seeing odd behavior a month ago, before boiling it down; it wasn't readily apparent at all. Is this just a massive footgun for our distro users? I dunno. Something that works in the p95 and then collapses horrifically in the p99 cases like this doesn't feel great. I try not to be superstitious about things like this but, yeah. It's weird.
Anyway. This reminds me I have to submit some patches to disable jemalloc in a few aarch64 packages so I can use them on Asahi Linux. 4k pages will haunt us until the end of time.
I think it is useful to distinguish just larger regular pages (i.e. 16k or 64k pages on arm64) and extraordinary huge pages (2M on systems where regular pages are 4k). In the first case, the system uses these larger pages uniformly, as there is no smaller page.
Linux has a MAP_HUGETLB constant, which you can pass to mmap(), which opts your application into 2MiB or 1GiB pages. Unfortunately, last time I tried it it required the system admin to enable support in the system for that, which wasn't on by default (on Debian at least). So from the perspective of an ordinary application developer that's useless.
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$ grep HUGEPAGE /boot/config-5.10.0-20-amd64
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not setWhy??
Going from 4k to 2M doesn't waste a fixed amount of memory, it wastes a percentage of memory. So the amount of memory lost continues to scale with memory size. Losing 10% of your memory when you only have 32MB hurts just as much as losing 10% when you have 32GB, because we didn't have a 1000x times increase in free RAM. Applications just use more RAM than they used to.
Now an increase to 8k, 16k, or even 64k might be defensible. And even that last one will draw some scowls from very serious people. But going straight to 2M pages by default will not be a good tradeoff. Especially since the current system with differently sized pages gets you most of the performance gain for a much more reasonable cost.
Between a computer with 1GB and a computer with 64GB, any vaguely reasonable default page size you pick (whether that's 64k or 2M) is far from the total memory size, so the percentage wasted doesn't get much smaller as RAM continues increasing. I.e. the relative benefit of larger pages grows much slower than total system memory.
So in other words, my point is that default page size should not just grow linearly with memory size, the best tradeoff is something a little more complex. Saying that RAM grew 1000x, therefore page size should grow a lot is aiming too high too fast. Especially given that hardware already supports mixed page sizes (despite lackluster software support at the moment)
Take for example bash: its RSS is about ~4 MB, but it has 43 memory maps (on my system). One is > 1MB, two are 512-1024 kB, most are very small. Each requires at least one page, so with mandatory 2 MB pages, its RSS would be 86 MB instead of 4 MB.
Edit: I did a quick search. upside of less TLB thrashing, downside of potentially more disk thrashing as pages are paged in and out.
dang or hn mod might have to fix it
It's strange to me that this is an issue so late in the game.
It's a problem with NVMe especially large ones: we're just starting to see 4kn.
Even for a 4kn drive, the block size they report (and the erase block size) are not the ones used internally. There are some tools to infer the correct size using breakpoint detection in latency measurements.
There're also some simple rule of thumb: 2^16 for Samsung >1Tb (64k)
This should all just be a single boolean flag in the database config telling it to use huge pages which it gets from mmap dynamically. Why is any of the filesystem, permission, static allocation malarkey necessary?
While I can see why special permissions are needed to grab them, the whole filesystem thingy is clunky as hell. I have no idea why they didn't put them by default in /sys or /proc.
FWIW, those bits shouldn't be necessary with postgres. If huge_pages is try or on, we'll specify MAP_HUGETLB (or (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT if huge_page_size is set). If mmap() fails we'll error (=on) out or fall back to non-huge allocation (=try)
However, you do need to allocate huge pages on the system level for this to succeed. But it's indeed just a /proc (or /sys if you want more control). /proc/sys/vm/nr_hugepages, or /sys/kernel/mm/hugepages/hugepages-kB/nr_hugepages.
One of the more awkward bits about the kernel config is that they are calculated in pages, so you need to do the conversion yourself :(
echo $(( (32 * 1024*1024*1024) / (2 * 1024 * 1024) + 1)) | tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages && cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepagesI remember (admittedly years ago) spending a lot of time trying to debug server crippling performance problems, ultimately learning that transparent huge pages were the cause.
Proceed with caution.
It might have been more straightforward to teach the problematic workloads to opt out of THP once the problems were discovered.
It was not at all a pleasant experience due to lack of documentation as well as the flaky implementation. But I was surprised by how much overhead the TLB accounted for.
Apart from using larger/huge pages, you may want to take steps to minimize TLB misses. The Goto BLAS paper talks about that.
https://docs.splunk.com/Documentation/Splunk/9.0.3/ReleaseNo...
I have notes-to-myself which I can't find at the moment, but the TL:DR is know what you're doing. If you disable it make a note so if the server is re-purposed you don't hamstring an unwitting inheriting admin. I was up against the wall on a Linux indexer which was for mysterious reasons under-performing by a ridiculous percentage, all kinds of crazy latency, until I disabled the THP. This was years ago. If you disable, make it a systemd service so it can be discovered, people don't check /sys/kernel/ as a frequent place to look.
Edited: THP not TLB (the translation look-aside buffer).