Why does musl make my Rust code so slow? (opens in new tab)

(andygrove.io)

154 pointsandygrove6y ago72 comments

72 comments

35 comments · 9 top-level

BubRoss6y ago· 8 in thread

This actually someone asking and not an investigation and explanation. There isn't even a lot of due diligence to figuring it out - no profiling or resource usage other than CPUs. Also it is musl combined with docker causing a 30x slowdown.

If something is running 30x slower from linking in a different libc, I'm guessing it should not be that difficult to narrow down the cause at least a little bit.

vardump6y ago

Yeah, upvoted. 30x slower on 48 core (?) system sounds suspiciously like excessive lock contention (or some other shared resource). Non-NUMA aware allocator (or other code) might also contribute to the issue.

Should be fairly easy to investigate.

Animats6y ago

He has so many layers in there it's going to be tough to find the problem.

"Ballista is an experimental distributed compute platform, powered by Apache Arrow, with support for Rust and JVM (Java, Kotlin, and Scala)."

Plus he's got Docker, the Rust library, musl, and jemalloc sometimes. There's no application. All this is just infrastructure.

Musl doesn't do much on its own. But it does do stdio buffering. Could it be that the buffering system is making too many I/O calls, like flushing on every write?

1 more reply

rvz6y ago

Exactly, not sure why you're downvoted. His blog-post doesn't explain why it is slow but instead 'switches the allocator' and says he has "fixed" the issue. This is just a workaround, not a fix.

I'm quite surprised that there's no mention of profiling the actual allocator causing this regression to properly narrow the fault down to the source. Instead this blog-post encourages cargo-culting development to "fix slow code".

fluffything6y ago

Yeah, you should be upvoted.

The author put zero effort into figuring things out.

MiroF6y ago

Benchmarking in Docker in general is a mistake I believe.

mschuster916y ago

Why? The only overhead you have in Docker is on syscalls (due to permission checks, namespaces, ...), everything else runs at 100% native speed - unlike assisted virtualization (at least IOMMU overhead plus overhead for anything involving the filesystem) or emulated virtualization (obvious overheads here).

2 more replies

andygroveOP6y ago

I think it makes sense for software that is intended to run in Docker and frameworks like Kubernetes that use Docker.

nitrogen6y ago

You should probably be measuring your app's performance in a production-like environment though.

1 more reply

jessermeyer6y ago· 6 in thread

For those curious, Musl's malloc implementation is currently being re-written for higher performance and robustness, see https://github.com/richfelker/mallocng-draft

scott_s6y ago

Curiously, it doesn't adopt the now-standard approach for multithreaded support: per-thread memory pools, allowing one thread allocating and deallocating the same memory to avoid synchronization. This uses one lock guarding allocation, which means that it can be a bottleneck in a multithreaded workload.

dalias6y ago

The justifications are partly the same as what Daniel Micay has written extensively on in the rational for hardened_malloc (https://github.com/GrapheneOS/hardened_malloc) - unsynchronized per-thread state inherently sacrifices global consistency for performance and makes it impossible to detect a lot of types of memory usage errors (DF/UAF, etc) that could otherwise be caught.

However musl has the additional constraint of being compatible with small/very-low-memory environments. Lack of global consistency inherently means you will end up using memory less efficiently and requesting significantly more from the system. The new malloc about to go upstream in musl is, to my knowledge, the first/only advanced hardened allocator using slab-type design rather than traditional dlmalloc type split/merge, but also designed for extremely low overhead/waste at low to moderate usage rather than extreme performance. And in the vast majority of applications, this is perfectly reasonable. Even Firefox for example does very well with it.

With that said, new malloc is expected to be somewhat faster than old on lots of workloads (and considerably faster than old would be if we fixed the flaws in old that motivated it), but it's not a performance-oriented allocator. If you really want/need that you should probably link jemalloc or similar (and accept all the tradeoffs that come with that). In Rust programs without "unsafe", it may make sense to do that by default.

2 more replies

liuliu6y ago

Do you have any extra readings on the rationale of building their own malloc rather than integrating mimalloc or jemalloc?

jessermeyer6y ago

Not any first hand, but reading their principles suggests that simplicity and ease of deployment are probably relevant. https://musl.libc.org/about.html

1 more reply

harrygeez6y ago

well one of the stated goals of musl is to be simple and correct, and all those mallocs are anything but simple

1 more reply

AndyKelley6y ago

This is a libc implementation. The null hypothesis is that it implements libc, rather than porting a different libc implementation.

iou6y ago· 4 in thread

Swap out the allocator https://users.rust-lang.org/t/optimizing-rust-binaries-obser...

andygroveOP6y ago

Thanks! That really does seem to be the issue and I wouldn't have known about this, had I not asked. I will try this out and will update the blog post in ~8 hours time.

masklinn6y ago

IME allocations is one of the main things making rust programs slow without diving into the more arcane stuff. So looking into unnecessary allocations and/or the performances of the allocator would be one of the first things to do (right after checking if you're compiling with optimisations).

Given your CPU graphs, and the large number of cores, I expect musl's allocator simply has very poor behaviour with respect to multithreading (e.g. limited or no threadlocal arenas, size-classing, etc…) leading to a lot of crosstalk, extreme contention on allocations, etc...

1 more reply

asveikau6y ago

Without familiarity with rust, I wasn't sure what they meant by "system allocator". Apparently that means libc's malloc. (Or HeapAlloc on Windows)

So I guess they statically link jemalloc but can optionally use libc malloc.

oconnor6636y ago

It's the other way around now, though the post that iou linked is from an earlier period when things worked differently. By default today, Rust programs use the same allocator that C programs use, which I think is provided by libc on Linux, and you have the option of using a custom allocator, like jemalloc. Historically however, all Rust programs used to use jemalloc by default.

2 more replies

termie6y ago· 4 in thread

run a perf trace on both and see what jumps out

wyldfire6y ago

You may not even need to go that deep.

Just strace (follow forks) and look what commands get exec'd.

"Why does musl make my Rust code so slow?" But he's measuring mostly the compiler performance in "cargo build". Is he writing the same amount of data to disk in the same experiments? Seems like there's a lot of opportunity for some shallow investigation to find out more.

MiroF6y ago

If he's benchmarking on docker, I'm not sure that perf works in docker.

ori_b6y ago

Then benchmark outside docker?

the84726y ago

Depends on the kernel.perf_event_paranoid sysctl.

underdeserver6y ago· 3 in thread

...Not the Intel guy, if anyone else had to pause for a second.

andygroveOP6y ago

I get that a lot!

biesnecker6y ago

You still being alive probably helps for disambiguation. :-)

hinkley6y ago

Are you familiar with SwiftOnSecurity on twitter?

Do you have any hobbies that would be out of character for Intel's Andy Grove? I think the world has room for a ficttionalized Andy Grove talking about how to cook french pastries, train bonsai, intermittent fasting, or preparing for a marathon.

1 more reply

pedrocr6y ago· 1 in thread

Swapping out the allocator for jemalloc would be my first try. It's easy to do and often results in better performance. 30x requires some kind of pathological case though.

mrits6y ago

In a commercial product I worked on I went against the vendors advice to try out jemalloc. It took a 100GB memory hold (that took 48 hours to happen) to staying steady at around 2-4GB and only peaking at 100GB for a few seconds a day.

Same exact code but just swapped out the jemalloc at the command line.

dalias6y ago

We'd be happy to address specific problems on the mailing list. I believe it's a known issue that the Rust compiler is making really heavy use of rapid allocation/freeing cycles, and would benefit from linking a performance-oriented malloc replacement. Doing so is inherently a tradeoff between many factors including performance, memory overhead, safety against erroneous usage by programs, etc.

One statement in your post, which some readers pointed out was apparently added later, "Others have suggested that the performance problems in musl go deeper than that and that there are fundamental issues with threading in musl, potentially making it unsuitable for my use case," seems wrong unless they just meant that the malloc implementation is not thread-caching/thread-local-arena-based. The threads implementation in musl is the only one I'm aware of that doesn't still have significant bugs in some of the synchronization primitives or in cancellation. It's missing a few optional and somewhat obscure features like priority-ceiling mutexes, and Linux doesn't even admit a fully correct implementation in some regards like interaction of thread priorities with some synchronization primitives, but all the basic functionality is there and was written with extreme attention to correctness, and musl aims to be a very good choice in situations where this matters.

pjc506y ago

.. where's the profile output?

renewiltord6y ago

Post was not very illuminating. Very little content. It's pretty much a "if musl is slow, it may be the allocator (eom)" which fits in the headline and would have saved me the click.

j / k navigate · click thread line to collapse

72 comments

35 comments · 9 top-level

BubRoss6y ago· 8 in thread

If something is running 30x slower from linking in a different libc, I'm guessing it should not be that difficult to narrow down the cause at least a little bit.

vardump6y ago

Should be fairly easy to investigate.

Animats6y ago

He has so many layers in there it's going to be tough to find the problem.

"Ballista is an experimental distributed compute platform, powered by Apache Arrow, with support for Rust and JVM (Java, Kotlin, and Scala)."

Plus he's got Docker, the Rust library, musl, and jemalloc sometimes. There's no application. All this is just infrastructure.

Musl doesn't do much on its own. But it does do stdio buffering. Could it be that the buffering system is making too many I/O calls, like flushing on every write?

1 more reply

rvz6y ago

Exactly, not sure why you're downvoted. His blog-post doesn't explain why it is slow but instead 'switches the allocator' and says he has "fixed" the issue. This is just a workaround, not a fix.

fluffything6y ago

Yeah, you should be upvoted.

The author put zero effort into figuring things out.

MiroF6y ago

Benchmarking in Docker in general is a mistake I believe.

mschuster916y ago

2 more replies

andygroveOP6y ago

I think it makes sense for software that is intended to run in Docker and frameworks like Kubernetes that use Docker.

nitrogen6y ago

You should probably be measuring your app's performance in a production-like environment though.

1 more reply

jessermeyer6y ago· 6 in thread

For those curious, Musl's malloc implementation is currently being re-written for higher performance and robustness, see https://github.com/richfelker/mallocng-draft

scott_s6y ago

dalias6y ago

2 more replies

liuliu6y ago

Do you have any extra readings on the rationale of building their own malloc rather than integrating mimalloc or jemalloc?

jessermeyer6y ago

Not any first hand, but reading their principles suggests that simplicity and ease of deployment are probably relevant. https://musl.libc.org/about.html

1 more reply

harrygeez6y ago

well one of the stated goals of musl is to be simple and correct, and all those mallocs are anything but simple

1 more reply

AndyKelley6y ago

This is a libc implementation. The null hypothesis is that it implements libc, rather than porting a different libc implementation.

iou6y ago· 4 in thread

Swap out the allocator https://users.rust-lang.org/t/optimizing-rust-binaries-obser...

andygroveOP6y ago

Thanks! That really does seem to be the issue and I wouldn't have known about this, had I not asked. I will try this out and will update the blog post in ~8 hours time.

masklinn6y ago

1 more reply

asveikau6y ago

Without familiarity with rust, I wasn't sure what they meant by "system allocator". Apparently that means libc's malloc. (Or HeapAlloc on Windows)

So I guess they statically link jemalloc but can optionally use libc malloc.

oconnor6636y ago

2 more replies

termie6y ago· 4 in thread

run a perf trace on both and see what jumps out

wyldfire6y ago

You may not even need to go that deep.

Just strace (follow forks) and look what commands get exec'd.

MiroF6y ago

If he's benchmarking on docker, I'm not sure that perf works in docker.

ori_b6y ago

Then benchmark outside docker?

the84726y ago

Depends on the kernel.perf_event_paranoid sysctl.

underdeserver6y ago· 3 in thread

...Not the Intel guy, if anyone else had to pause for a second.

andygroveOP6y ago

I get that a lot!

biesnecker6y ago

You still being alive probably helps for disambiguation. :-)

hinkley6y ago

Are you familiar with SwiftOnSecurity on twitter?

1 more reply

pedrocr6y ago· 1 in thread

Swapping out the allocator for jemalloc would be my first try. It's easy to do and often results in better performance. 30x requires some kind of pathological case though.

mrits6y ago

Same exact code but just swapped out the jemalloc at the command line.

dalias6y ago

pjc506y ago

.. where's the profile output?

renewiltord6y ago

Post was not very illuminating. Very little content. It's pretty much a "if musl is slow, it may be the allocator (eom)" which fits in the headline and would have saved me the click.

j / k navigate · click thread line to collapse