I don't have access to the Zulip chat, but the other benchmarks are basically testing allocating in a hot loop. I'm not surprised that doesn't scale linearly, and it's certainly not representative of real world code I've ever written.
If you have code you wrote to achieve something and hit a wall with scaling, I'm happy to take a look.