IME allocations is one of the main things making rust programs slow without diving into the more arcane stuff. So looking into unnecessary allocations and/or the performances of the allocator would be one of the first things to do (right after checking if you're compiling with optimisations).
Given your CPU graphs, and the large number of cores, I expect musl's allocator simply has very poor behaviour with respect to multithreading (e.g. limited or no threadlocal arenas, size-classing, etc…) leading to a lot of crosstalk, extreme contention on allocations, etc...