I also found a simple solution: call malloc_trim() after a GC. This reduces memory usage by 70%.
https://www.joyfulbikeshedding.com/blog/2019-03-14-what-caus...
This means there's a wave of new allocations moving through your address space, and it's leaving behind a fragmented mess. Calling malloc_trim() won't help with the address space fragmentation, it will only free memory pages caught up in the mess. At some point the allocation wave will hit the top of the address space and allocations will start to fail. Usually this is not a problem in 64-bit processes of course, because it will take a very long time to run out of 64-bits, but on 32-bit processes this was a real problem.
This is what the MMAP_THRESHOLD tunable solves. It makes that allocations larger than that many Bytes are served via their own mmap that can be munmapped in independence.
I use env MALLOC_MMAP_THRESHOLD_=65536 to reduce the memory-fragmentation wasted RAM of my program from 6.5 GB to 0.8 GB.
The benefit of this is that you don't have to decide at which points to call malloc_trim(). But it's expected to be a bit slower because mmap() takes a while. Choosing between malloc_trim() vs MALLOC_MMAP_THRESHOLD_ is dual to choosing between GC vs reference counting -- higher memory use for a while and having to choose when to clean up vs higher per-operation cost.
According to the docs [1,2] it should be called automatically when the free space exceeds the default M_TRIM_THRESHOLD of 128 KiB.
Is it because of this bug [3] or for another reason?
[1] https://man7.org/linux/man-pages/man3/malloc_trim.3.html
[2] https://man7.org/linux/man-pages/man3/mallopt.3.html
[3] https://sourceware.org/bugzilla/show_bug.cgi?id=14827CPython is less affected than CRuby because CPython has a specialized allocator called obmalloc for small objects up to 512 bytes.
CRuby < 2.6 doesn't have an allocator like this and hits malloc for anything bigger than 24 bytes. 2.6+ can allocate using the "transient heap" which helps but isn't as effective as CPython's obmalloc.
Correct. Not only that, it also dramatically shortens the lifetime of many allocated objects, so the GHC garbage collector is designed for super-cheap allocation and to not be impact by large amounts of garbage. The trade-off here is that this means that the default GC doesn't work very well with very large "live sets" (although there's a bunch of tools and workarounds to make it manageable).
For example, my Haskell data processing code averages an allocation rate of ~4 GB/s, despite never going over ~10 MB resident memory or below 98% productivity.
Linear types being added in GHC 8.12 would be a big deal because it would allow programmers to be able to write allocation-free code that can use mutable data structures with a pure API (as opposed to the ST state monad), much like how Rust solves this with the ownership system.
Why, though?
Is it because GHC's linear types are a superset of Rust's linear types? I guess that could rule out some features the compiler is able to prove (or not).
Sure, it wouldn't solve object-level fragmentation, but at least you'd get rid of block-level fragmentation.
It seems that the problem with pinning is that a megablock will end up containing only a single block, leaving the space for the other blocks unused. The block can't be moved to another megablock because it is pinned, so it can't free the entire megablock.
My suggestion is to unmap the pages corresponding to the empty space in the megablock. So if a megablock only contains a single block, unmap the 255 empty pages.