If something is running 30x slower from linking in a different libc, I'm guessing it should not be that difficult to narrow down the cause at least a little bit.
Should be fairly easy to investigate.
"Ballista is an experimental distributed compute platform, powered by Apache Arrow, with support for Rust and JVM (Java, Kotlin, and Scala)."
Plus he's got Docker, the Rust library, musl, and jemalloc sometimes. There's no application. All this is just infrastructure.
Musl doesn't do much on its own. But it does do stdio buffering. Could it be that the buffering system is making too many I/O calls, like flushing on every write?
I'm quite surprised that there's no mention of profiling the actual allocator causing this regression to properly narrow the fault down to the source. Instead this blog-post encourages cargo-culting development to "fix slow code".
The author put zero effort into figuring things out.
However musl has the additional constraint of being compatible with small/very-low-memory environments. Lack of global consistency inherently means you will end up using memory less efficiently and requesting significantly more from the system. The new malloc about to go upstream in musl is, to my knowledge, the first/only advanced hardened allocator using slab-type design rather than traditional dlmalloc type split/merge, but also designed for extremely low overhead/waste at low to moderate usage rather than extreme performance. And in the vast majority of applications, this is perfectly reasonable. Even Firefox for example does very well with it.
With that said, new malloc is expected to be somewhat faster than old on lots of workloads (and considerably faster than old would be if we fixed the flaws in old that motivated it), but it's not a performance-oriented allocator. If you really want/need that you should probably link jemalloc or similar (and accept all the tradeoffs that come with that). In Rust programs without "unsafe", it may make sense to do that by default.
Given your CPU graphs, and the large number of cores, I expect musl's allocator simply has very poor behaviour with respect to multithreading (e.g. limited or no threadlocal arenas, size-classing, etc…) leading to a lot of crosstalk, extreme contention on allocations, etc...
So I guess they statically link jemalloc but can optionally use libc malloc.
Just strace (follow forks) and look what commands get exec'd.
"Why does musl make my Rust code so slow?" But he's measuring mostly the compiler performance in "cargo build". Is he writing the same amount of data to disk in the same experiments? Seems like there's a lot of opportunity for some shallow investigation to find out more.
Do you have any hobbies that would be out of character for Intel's Andy Grove? I think the world has room for a ficttionalized Andy Grove talking about how to cook french pastries, train bonsai, intermittent fasting, or preparing for a marathon.
Same exact code but just swapped out the jemalloc at the command line.
One statement in your post, which some readers pointed out was apparently added later, "Others have suggested that the performance problems in musl go deeper than that and that there are fundamental issues with threading in musl, potentially making it unsuitable for my use case," seems wrong unless they just meant that the malloc implementation is not thread-caching/thread-local-arena-based. The threads implementation in musl is the only one I'm aware of that doesn't still have significant bugs in some of the synchronization primitives or in cancellation. It's missing a few optional and somewhat obscure features like priority-ceiling mutexes, and Linux doesn't even admit a fully correct implementation in some regards like interaction of thread priorities with some synchronization primitives, but all the basic functionality is there and was written with extreme attention to correctness, and musl aims to be a very good choice in situations where this matters.