Understanding Memory Fragmentation in Haskell (opens in new tab)

(well-typed.com)

67 pointstirumaraiselvan5y ago28 comments

28 comments

22 comments · 5 top-level

FooBarWidget5y ago· 7 in thread

A year ago, I research similar memory issues in Ruby. The established hypothesis in the community was that it was due to memory fragmentation. But I found that the largest culprit was actually the glibc memory allocator, which doesn't like to return memory to the OS. In multithreaded scenarios this issue is amplified even more, due to the use of separate heap arenas per thread.

I also found a simple solution: call malloc_trim() after a GC. This reduces memory usage by 70%.

https://www.joyfulbikeshedding.com/blog/2019-03-14-what-caus...

flohofwoe5y ago

Hmm... if your process keeps grabbing new memory from the OS, even though 70% of the memory it had already allocated is free to use then that's a sure sign of rampant memory fragmentation though. Because even though there is a lot of free mapped memory in the process overall, there are no continuous ranges of free memory that are big enough to fulfill at least some of the new allocations (so the allocator needs to grab fresh memory pages from the OS).

This means there's a wave of new allocations moving through your address space, and it's leaving behind a fragmented mess. Calling malloc_trim() won't help with the address space fragmentation, it will only free memory pages caught up in the mess. At some point the allocation wave will hit the top of the address space and allocations will start to fail. Usually this is not a problem in 64-bit processes of course, because it will take a very long time to run out of 64-bits, but on 32-bit processes this was a real problem.

nh25y ago

This is correct. malloc_trim() can make the unused memory pages (within an mmap() that malloc did) available for use by other processes using madvise() (turning them from grey squares into white squares in the linked article's visualisation), but it does leave holes in the address space.

This is what the MMAP_THRESHOLD tunable solves. It makes that allocations larger than that many Bytes are served via their own mmap that can be munmapped in independence.

I use env MALLOC_MMAP_THRESHOLD_=65536 to reduce the memory-fragmentation wasted RAM of my program from 6.5 GB to 0.8 GB.

The benefit of this is that you don't have to decide at which points to call malloc_trim(). But it's expected to be a bit slower because mmap() takes a while. Choosing between malloc_trim() vs MALLOC_MMAP_THRESHOLD_ is dual to choosing between GC vs reference counting -- higher memory use for a while and having to choose when to clean up vs higher per-operation cost.

1 more reply

fluffy875y ago

This is precisely what happens if you have a dynamic array and you grow it with a growth factor of 2.

nh25y ago

Why do you have to call malloc_trim()?

According to the docs [1,2] it should be called automatically when the free space exceeds the default M_TRIM_THRESHOLD of 128 KiB.

Is it because of this bug [3] or for another reason?

    [1] https://man7.org/linux/man-pages/man3/malloc_trim.3.html
    [2] https://man7.org/linux/man-pages/man3/mallopt.3.html
    [3] https://sourceware.org/bugzilla/show_bug.cgi?id=14827

seppel5y ago

Replacing the glibc memory allocation with tcmalloc or jemalloc did wonders in many projects I worked with.

wozer5y ago

Interesting. I wonder if Python has the same problem. (Last time I observed serious fragmentation in Python, I was using Python 2, though.)

jashmatthews5y ago

Yup https://zapier.com/engineering/celery-python-jemalloc/

CPython is less affected than CRuby because CPython has a specialized allocator called obmalloc for small objects up to 512 bytes.

CRuby < 2.6 doesn't have an allocator like this and hits malloc for anything bigger than 24 bytes. 2.6+ can allocate using the "transient heap" which helps but isn't as effective as CPython's obmalloc.

brundolf5y ago· 5 in thread

Is this sort of multi-tier allocation used in any other garbage collected languages? Or is it specifically used in Haskell because of immutability (which would presumably to result in a higher-than-normal frequency of allocation/deallocation)?

merijnv5y ago

> which would presumably to result in a higher-than-normal frequency of allocation/deallocation

Correct. Not only that, it also dramatically shortens the lifetime of many allocated objects, so the GHC garbage collector is designed for super-cheap allocation and to not be impact by large amounts of garbage. The trade-off here is that this means that the default GC doesn't work very well with very large "live sets" (although there's a bunch of tools and workarounds to make it manageable).

For example, my Haskell data processing code averages an allocation rate of ~4 GB/s, despite never going over ~10 MB resident memory or below 98% productivity.

the84725y ago

Java's G1GC is similar. You have TLABs, regions and regions grouped into generations/special types and then there's the never-ending work on escape analysis to stack-allocate things.

eru5y ago

Erlang also has pervasive immutability. But not sure what they are doing for allocation.

jlouis5y ago

There is a whole memory allocation system used to combat fragmentation toward the OS level. Erlang also uses a multi-tiered approach, but note that a lot of things are easier because there is far less sharing going on and more things are isolated.

jashmatthews5y ago

IIUC it's enormous amounts of copying since Erlang processes are "shared nothing" and have separately collected private heaps. To reduce copying BEAM has a separate shared heap using refcounting to share big blobs.

siraben5y ago· 3 in thread

Having written Haskell for over a year now for personal projects, understanding the memory model can be one of the hardest aspects of Haskell, which can make it frustrating to write allocation-free code (although some techniques like deforestation are done by the compiler to eliminate intermediate structures entirely).

Linear types being added in GHC 8.12 would be a big deal because it would allow programmers to be able to write allocation-free code that can use mutable data structures with a pure API (as opposed to the ST state monad), much like how Rust solves this with the ownership system.

platz5y ago

I don't understand how ghc linear types allows allocation-free code. GHC'ss linear types don't give you uniqueness types in the same way rust does

zetalemur5y ago

> GHC'ss linear types don't give you uniqueness types in the same way rust does

Why, though?

Is it because GHC's linear types are a superset of Rust's linear types? I guess that could rule out some features the compiler is able to prove (or not).

1 more reply

whateveracct5y ago

You can use them to write much safer off-heap memory management APIs in userspace. But GHC can't do anything for free, no.

crote5y ago· 2 in thread

Wouldn't the GC be able to `munmap` the space between blocks?

Sure, it wouldn't solve object-level fragmentation, but at least you'd get rid of block-level fragmentation.

chrisseaton5y ago

You can only unmap a full page, but there aren’t any full empty pages because the memory is fragmented.

crote5y ago

According to the link, GHC creates 1MiB megablocks, consisting of 4KiB blocks. So, each block would be a page, and a megablock consists of 256 pages.

It seems that the problem with pinning is that a megablock will end up containing only a single block, leaving the space for the other blocks unused. The block can't be moved to another megablock because it is pinned, so it can't free the entire megablock.

My suggestion is to unmap the pages corresponding to the empty space in the megablock. So if a megablock only contains a single block, unmap the 255 empty pages.

1 more reply

tirumaraiselvanOP5y ago

More discussion here: https://www.reddit.com/r/haskell/comments/id8m9w/welltyped_u...

j / k navigate · click thread line to collapse

28 comments

22 comments · 5 top-level

FooBarWidget5y ago· 7 in thread

I also found a simple solution: call malloc_trim() after a GC. This reduces memory usage by 70%.

https://www.joyfulbikeshedding.com/blog/2019-03-14-what-caus...

flohofwoe5y ago

nh25y ago

This is what the MMAP_THRESHOLD tunable solves. It makes that allocations larger than that many Bytes are served via their own mmap that can be munmapped in independence.

I use env MALLOC_MMAP_THRESHOLD_=65536 to reduce the memory-fragmentation wasted RAM of my program from 6.5 GB to 0.8 GB.

1 more reply

fluffy875y ago

This is precisely what happens if you have a dynamic array and you grow it with a growth factor of 2.

nh25y ago

Why do you have to call malloc_trim()?

According to the docs [1,2] it should be called automatically when the free space exceeds the default M_TRIM_THRESHOLD of 128 KiB.

Is it because of this bug [3] or for another reason?

    [1] https://man7.org/linux/man-pages/man3/malloc_trim.3.html
    [2] https://man7.org/linux/man-pages/man3/mallopt.3.html
    [3] https://sourceware.org/bugzilla/show_bug.cgi?id=14827

seppel5y ago

Replacing the glibc memory allocation with tcmalloc or jemalloc did wonders in many projects I worked with.

wozer5y ago

Interesting. I wonder if Python has the same problem. (Last time I observed serious fragmentation in Python, I was using Python 2, though.)

jashmatthews5y ago

Yup https://zapier.com/engineering/celery-python-jemalloc/

CPython is less affected than CRuby because CPython has a specialized allocator called obmalloc for small objects up to 512 bytes.

brundolf5y ago· 5 in thread

merijnv5y ago

> which would presumably to result in a higher-than-normal frequency of allocation/deallocation

For example, my Haskell data processing code averages an allocation rate of ~4 GB/s, despite never going over ~10 MB resident memory or below 98% productivity.

the84725y ago

Java's G1GC is similar. You have TLABs, regions and regions grouped into generations/special types and then there's the never-ending work on escape analysis to stack-allocate things.

eru5y ago

Erlang also has pervasive immutability. But not sure what they are doing for allocation.

jlouis5y ago

jashmatthews5y ago

siraben5y ago· 3 in thread

platz5y ago

I don't understand how ghc linear types allows allocation-free code. GHC'ss linear types don't give you uniqueness types in the same way rust does

zetalemur5y ago

> GHC'ss linear types don't give you uniqueness types in the same way rust does

Why, though?

Is it because GHC's linear types are a superset of Rust's linear types? I guess that could rule out some features the compiler is able to prove (or not).

1 more reply

whateveracct5y ago

You can use them to write much safer off-heap memory management APIs in userspace. But GHC can't do anything for free, no.

crote5y ago· 2 in thread

Wouldn't the GC be able to `munmap` the space between blocks?

Sure, it wouldn't solve object-level fragmentation, but at least you'd get rid of block-level fragmentation.

chrisseaton5y ago

You can only unmap a full page, but there aren’t any full empty pages because the memory is fragmented.

crote5y ago

According to the link, GHC creates 1MiB megablocks, consisting of 4KiB blocks. So, each block would be a page, and a megablock consists of 256 pages.

My suggestion is to unmap the pages corresponding to the empty space in the megablock. So if a megablock only contains a single block, unmap the 255 empty pages.

1 more reply

tirumaraiselvanOP5y ago

More discussion here: https://www.reddit.com/r/haskell/comments/id8m9w/welltyped_u...

j / k navigate · click thread line to collapse