Mimalloc Cigarette: Losing one week of my life catching a memory leak (Rust) (opens in new tab)

(pwy.io)

136 pointsPatryk271y ago108 comments

108 comments

44 comments · 14 top-level

kibwen1y ago· 10 in thread

Level 1 systems programmer: "wow, it feels so nice having control over my memory and getting out from under the thumb of a garbage collector"

Level 2 systems programmer: "oh no, my memory allocator is a garbage collector"

matklad1y ago

The answer is clear: just don’t have a malloc implementation in your process' address space!

thebruce87m1y ago

Welcome to embedded! It’s no heaps of fun!

2 more replies

poikroequ1y ago

A bump allocator is all anyone really needs

1 more reply

seanthemon1y ago

At the very bottom of everything is a garbage collector..

hinkley1y ago

Soil is just the biggest swap meet in the world. Where every microbe, invertebrate and tree is just looking for someone else’s trash to turn into treasure.

riwsky1y ago

Market forces: the ultimate garbage collector

ckocagil1y ago

"stackoverflow please help me how do i fix memory fragmentation"

amelius1y ago

Level 3 system programmer: "get me out of this straight jacket and give me my garbage collector back so I can get stuff done"

ComputerGuru1y ago

That's not how system programmers think..

4 more replies

forrestthewoods1y ago

No. Just no.

For as painful as the debugging story was I have spent vastly more amounts of time working around garbage collectors to ship performant code.

3 more replies

hinkley1y ago· 6 in thread

We had learned helplessness on a drag and drop bug in jquery UI. I had like three hours every second or third Friday and would just step through the code trying to find the bug. That code was so sketchy the jquery team was trying to rewrite it from scratch one component at a time, and wouldn’t entertain any bug discussions on the old code even though they were a year behind already.

After almost six months, I finally found a spot where I could monkey patch a function to wrap it with a short circuit if the coordinates were out of bounds. Not only fixed the bug but made drag and drop several times faster. Couldn’t share this with the world because they weren’t accepting PRs against the old widgets.

I’ve worked harder on bug fixes, but I think that’s the longest I’ve worked on one.

giancarlostoro1y ago

One of my favorite most elusive bugs was a one liner change. I didn't understand the problem because nobody could reproduce it, or show it. Months later, after my boss told his boss it was fixed, despite never being able to test that it was fixed, I figured it out and fixed it. We had a gift card form, and we stored it in localStorage, if for any reason the person left the tab, and came back months later, it would show the old gift card with its old dated balance, it was a client-side bug. The fix was to use sessionStorage.

arghwhat1y ago

For web, my favorite is JIT miscompilations. A tie between a mobile Safari bug that caused basic math operations to return 0 regardless of input values (basic, positive Numbers, no shenanigans), or a mobile Samsung browser bug where concatenating a specific single-character string with another single-character string would yield a Number.

Debugging errors in JS crypto and compression implementations that only occur at random, after at least some ten thousand iterations, on a mobile browser back when those were awful, and only if the debugger is closed/detached as opening it disabled the JIT was not fun.

It taught me to go into debugging with no assumptions about what can and cannot be to blame, which has been very useful later in even trickier scenarios.

2 more replies

contingencies1y ago

It seems in the context of your story the old adage that organizations reproduce software in their own architecture again rings true, with multilayered bureaucracy, lies and promises resulting in "client state".

1 more reply

WalterBright1y ago

My longest one was an uninitialized declaration of a local variable, which acquired ever-changing values.

This is why D, by default, initializes all variables. Note that the optimizer removes dead assignments, so this is runtime cost-free. D's implementation of C, ImportC, also default initializes all locals. Why let that stupid C bug continue?

Another that repeatedly bit me was adding a field, and neglecting to add initialization of it to all the constructors.

This is why D guarantees that all fields are initialized.

hinkley1y ago

The first bug I remember writing was making native calls in Java to process data. I didn’t understand why in the examples they kept rerunning the handle dereference in every loop.

If native code calls back into Java, and the GC kicks in, all the objects the native code can see can be compacted and moved. So my implementation worked fine for all of the smaller test fixtures, and blew up half the time with the largest. Because I skipped a line to make it “go faster”.

I finally realized I was seeing raw Java objects in the middle of my “array” and changing the value of final fields into illegal pairings which blew everything the fuck up.

ckocagil1y ago

Valgrind didn't catch it?

2 more replies

zokier1y ago· 5 in thread

I wonder if there is something that could be done on language design level to have better "sympathy" to memory allocation, i.e. built upon having mmap/munmap as primitives instead of malloc/free; where language patterns are built around allocating pages instead of arbitrarily sized objects. Probably not practical for general high-level languages, but for e.g. embedded or high-performance stuff might make sense?

PaulDavisThe1st1y ago

This seems to fail to understand that we already have both levels.

Every OS will provide some mechanism to get more pages. But it turns out that managing the use of those pages requires specialized handling, depending on the use case, as well as a bunch of boilerplate. Hence, we also have malloc and its many, many cousins to allocate arbitrary size objects.

You're always welcome to use brk(2) or your OS's equivalent if you just want pages. The question is, what are you going to do with each page once you have it? That's where the next level comes in ...

eschneider1y ago

In general for embedded, you don't page memory even if you're running something like embedded linux.

For high performance stuff where you need low, predictable latency, you're probably not going to want to use dynamic memory at all.

loeg1y ago

Not exactly what you're getting at, but you could maybe imagine an explicit version of malloc where allocations are destined either for thread-local only use, or shared use. Then locally freeing remote thread-local memory is an invalid operation and these kinds of assume-locality optimizations are valid on many structures. I think you can imagine a version of mmap that allows for thread-local mappings to help detect accidental misuse of local allocation.

bsder1y ago

Zig passes allocators around explicitly. There is no implicit memory allocator.

The downside is that it makes things like "print" a pain in the ass.

The upside is that you can have multiple memory allocators with hugely different characteristics (arena for per frame resources, bump allocator for network resources, etc.).

dathinab1y ago

most modern memory allocators use internally mmap, this is why it most times makes sense to not use the system allocate for long running programs

Generally given that page size isn't something you know at compiler (or even install size) and it can vary between each restart and it being between anything between ~4KiB and 1GiB and most natural memory objects being much less then 4KiB but some being potentially much more then 1GiB you kind don't want to leak anything related to page sizes into your business logic if it can be helper. If you still need to most languages have memory/allocation pools you can use to get a bit more control about memory allocation/free and reuse.

Also the performance issues mentioned have not much to do with memory pages or anything like that _instead they are rooted in concurrency controls of a global resource (memory)_. I.e. thread local concurrency syncronization vs. process concurrency synchronization.

mainly instead of using a fully general purpose allocator they used an allocator whiche is still general purpose but has a design bias which improves same-thread (de)allocation perf at cost of cross thread (de)allocation perf. And they where doing a ton of cross thread (de)allocations leading to noticeable performance degradation.

The thing is even if you hypothetically only had allocations at sizes multiple of a memory page or use a ton of manual mmap you still would want to use a allocator and not always directly free freed memory back to the OS as doing so and doing a syscall on every allocation tends to lead to major performance degradation (in many use cases). So you still need concurrency controls but they come at a cost, especially for cross thread synchronization. Even just lock-free controls based on atomic have a cost over thread local controls caused often largely by cache invalidation/synchronization.

IceTDrinker1y ago· 4 in thread

PSA: do not use floating point for monetary amounts

SAI_Peregrinus1y ago

MS Excel uses floating point, and it's used a ton in finance. Don't use floating-point for monetary amounts if you don't know what rounding mode you've set.

koverstreet1y ago

It's somewhat acceptable with double precision floats - never single precision floats.

But far better to just use integer cents.

1 more reply

nurettin1y ago

I have used single precision floats in my latest project just to disprove this baloney.

smh1y ago

You are using 32 bit floats to represent money?

Does your project correctly calculate $300,000.00 + $0.01, (or even just correctly represent the value $300,000.01) and if so, how?

1 more reply

PaulDavisThe1st1y ago· 2 in thread

A perfect demonstration of how many of harder problems we face writing (especially non-browser-based) software are in fact not addressed by language changes.

The concept of memory that is allocated by a thread and can only be deallocated by that thread is useful and valid, but as TFA demonstrates, can also cause problems if you're not careful with your overall architecture. If the language you're using even allows you to use this concept, it almost certainly will not protect you from having to get the architecture corect.

the-smug-one1y ago

I think Rust's language design is in part to blame, as it does not force the programmer to think sufficiently of the layout of the memory, instead allowing them to defer to a "global allocator".

PaulDavisThe1st1y ago

This identical problem could easily occur in a C or C++ codebase.

1 more reply

CraigJPerry1y ago· 1 in thread

Tangent: what’s the ideal data structure for this problem?

If there were 20million rooms in the world with a price for each day of the year, we’d be looking at around 7billion prices per year. That’d be say 4Tb of storage without indexes.

The problem space seems to have a bunch of options to partition - by locality, by date etc.

I’m curious if there’s a commonly understood match for this problem?

FWIW with that dataset size, my first experiments would be with SQL server because that data will fit in ram. I don’t know if that’s where I’d end up - but I’m pretty sure it’s where I’d start my performance testing grappling with this problem.

jrpelkonen1y ago

I think your premise is somewhat off. There might be 20 million hotel rooms in a world, but surely they are not individually priced, e.g. all king bed rooms in a given hotel have the same price per given day.

loeg1y ago· 1 in thread

Sort of tl;dr: mimalloc doesn't actually free memory in a way that it can be reused on threads other than the one that allocated it; the free call marks regions for eventual delayed reclaim by the original thread. If the original thread calls malloc again, those regions are collected (1/N malloc calls). Or (C) you can explicitly invoke mi_collect[1] in the allocating thread (the Rust crate does not seem to expose this API).

[1]: https://github.com/microsoft/mimalloc/blob/dev/src/heap.c#L1...

Arnavion1y ago

The mimalloc crate just provides the GlobalAlloc impl that can be registered with libstd as the global allocator using the `#[global_allocator]` attr.

The underlying sys crate provides the binding for mimalloc API like `mi_collect`: https://docs.rs/libmimalloc-sys/0.1.39/libmimalloc_sys/fn.mi...

znpy1y ago· 1 in thread

> Allocators have different characteristics for a reason - they do some things differently between each other. What do you think mimalloc does that could account for this behavior?

Interestingly, it would seem that Java programmers play with garbage collectors while Rust programmers play with memory allocators.

sbt5671y ago

> Rust programmers

*system

Arnavion1y ago

jemalloc also has its own funny problem with threads - if you have a multi-threaded application that uses jemalloc on all threads except the main thread, then the cleanup that jemalloc runs on main thread exit will segfault. In $dayjob we use jemalloc as a sub-allocator in specific arenas. (*) The application itself is fine in production because it allocates from the main thread too, but the unit test framework only runs tests in spawned threads and the main thread of the test binary just orchestrates them. So the test binary triggers this segfault reliably.

( https://github.com/jemalloc/jemalloc/issues/1317 Unlike what the title says, it's not Windows-specific.)

(*): The application uses libc malloc normally, but at some places it allocates pages using `mmap(non_anonymous_tempfile)` and then uses jemalloc to partition them. jemalloc has a feature called "extent hooks" where you can customize how jemalloc gets underlying pages for its allocations, which we use to give it pages via such mmap's. Then the higher layers of the code that just want to allocate don't have to care whether those allocations came from libc malloc or mmap-backed disk file.

rurban1y ago

The Annotated C++ Reference Manual:

“C programmers think memory management is too important to be left to the computer. LISP programmers think memory management is too important to be left to the user.”

malkia1y ago

In C++, your https://en.cppreference.com/w/cpp/memory/new/new_handler should call mi_collect.

Exuma1y ago

I really love the design of this blog

bsder1y ago

Welcome to systems programming. Allocators are invisible--until they aren't.

om81y ago

TLDR: use shitty allocators, win shitty memory leaks

j / k navigate · click thread line to collapse

108 comments

44 comments · 14 top-level

kibwen1y ago· 10 in thread

Level 1 systems programmer: "wow, it feels so nice having control over my memory and getting out from under the thumb of a garbage collector"

Level 2 systems programmer: "oh no, my memory allocator is a garbage collector"

matklad1y ago

The answer is clear: just don’t have a malloc implementation in your process' address space!

thebruce87m1y ago

Welcome to embedded! It’s no heaps of fun!

2 more replies

poikroequ1y ago

A bump allocator is all anyone really needs

1 more reply

seanthemon1y ago

At the very bottom of everything is a garbage collector..

hinkley1y ago

Soil is just the biggest swap meet in the world. Where every microbe, invertebrate and tree is just looking for someone else’s trash to turn into treasure.

riwsky1y ago

Market forces: the ultimate garbage collector

ckocagil1y ago

"stackoverflow please help me how do i fix memory fragmentation"

amelius1y ago

Level 3 system programmer: "get me out of this straight jacket and give me my garbage collector back so I can get stuff done"

ComputerGuru1y ago

That's not how system programmers think..

4 more replies

forrestthewoods1y ago

No. Just no.

For as painful as the debugging story was I have spent vastly more amounts of time working around garbage collectors to ship performant code.

3 more replies

hinkley1y ago· 6 in thread

I’ve worked harder on bug fixes, but I think that’s the longest I’ve worked on one.

giancarlostoro1y ago

arghwhat1y ago

It taught me to go into debugging with no assumptions about what can and cannot be to blame, which has been very useful later in even trickier scenarios.

2 more replies

contingencies1y ago

1 more reply

WalterBright1y ago

My longest one was an uninitialized declaration of a local variable, which acquired ever-changing values.

Another that repeatedly bit me was adding a field, and neglecting to add initialization of it to all the constructors.

This is why D guarantees that all fields are initialized.

hinkley1y ago

The first bug I remember writing was making native calls in Java to process data. I didn’t understand why in the examples they kept rerunning the handle dereference in every loop.

I finally realized I was seeing raw Java objects in the middle of my “array” and changing the value of final fields into illegal pairings which blew everything the fuck up.

ckocagil1y ago

Valgrind didn't catch it?

2 more replies

zokier1y ago· 5 in thread

PaulDavisThe1st1y ago

This seems to fail to understand that we already have both levels.

You're always welcome to use brk(2) or your OS's equivalent if you just want pages. The question is, what are you going to do with each page once you have it? That's where the next level comes in ...

eschneider1y ago

In general for embedded, you don't page memory even if you're running something like embedded linux.

For high performance stuff where you need low, predictable latency, you're probably not going to want to use dynamic memory at all.

loeg1y ago

bsder1y ago

Zig passes allocators around explicitly. There is no implicit memory allocator.

The downside is that it makes things like "print" a pain in the ass.

The upside is that you can have multiple memory allocators with hugely different characteristics (arena for per frame resources, bump allocator for network resources, etc.).

dathinab1y ago

most modern memory allocators use internally mmap, this is why it most times makes sense to not use the system allocate for long running programs

IceTDrinker1y ago· 4 in thread

PSA: do not use floating point for monetary amounts

SAI_Peregrinus1y ago

MS Excel uses floating point, and it's used a ton in finance. Don't use floating-point for monetary amounts if you don't know what rounding mode you've set.

koverstreet1y ago

It's somewhat acceptable with double precision floats - never single precision floats.

But far better to just use integer cents.

1 more reply

nurettin1y ago

I have used single precision floats in my latest project just to disprove this baloney.

smh1y ago

You are using 32 bit floats to represent money?

Does your project correctly calculate $300,000.00 + $0.01, (or even just correctly represent the value $300,000.01) and if so, how?

1 more reply

PaulDavisThe1st1y ago· 2 in thread

A perfect demonstration of how many of harder problems we face writing (especially non-browser-based) software are in fact not addressed by language changes.

the-smug-one1y ago

I think Rust's language design is in part to blame, as it does not force the programmer to think sufficiently of the layout of the memory, instead allowing them to defer to a "global allocator".

PaulDavisThe1st1y ago

This identical problem could easily occur in a C or C++ codebase.

1 more reply

CraigJPerry1y ago· 1 in thread

Tangent: what’s the ideal data structure for this problem?

If there were 20million rooms in the world with a price for each day of the year, we’d be looking at around 7billion prices per year. That’d be say 4Tb of storage without indexes.

The problem space seems to have a bunch of options to partition - by locality, by date etc.

I’m curious if there’s a commonly understood match for this problem?

jrpelkonen1y ago

loeg1y ago· 1 in thread

[1]: https://github.com/microsoft/mimalloc/blob/dev/src/heap.c#L1...

Arnavion1y ago

The mimalloc crate just provides the GlobalAlloc impl that can be registered with libstd as the global allocator using the `#[global_allocator]` attr.

The underlying sys crate provides the binding for mimalloc API like `mi_collect`: https://docs.rs/libmimalloc-sys/0.1.39/libmimalloc_sys/fn.mi...

znpy1y ago· 1 in thread

> Allocators have different characteristics for a reason - they do some things differently between each other. What do you think mimalloc does that could account for this behavior?

Interestingly, it would seem that Java programmers play with garbage collectors while Rust programmers play with memory allocators.

sbt5671y ago

> Rust programmers

*system

Arnavion1y ago

( https://github.com/jemalloc/jemalloc/issues/1317 Unlike what the title says, it's not Windows-specific.)

rurban1y ago

The Annotated C++ Reference Manual:

“C programmers think memory management is too important to be left to the computer. LISP programmers think memory management is too important to be left to the user.”

malkia1y ago

In C++, your https://en.cppreference.com/w/cpp/memory/new/new_handler should call mi_collect.

Exuma1y ago

I really love the design of this blog

bsder1y ago

Welcome to systems programming. Allocators are invisible--until they aren't.

om81y ago

TLDR: use shitty allocators, win shitty memory leaks

j / k navigate · click thread line to collapse