https://papers.freebsd.org/1998/phk-malloc/
(And got better malloc(3) performance at the same time.)
No malloc has absolute detection of all corruption and even tools like Valgrind don't catch everything.
Let's not pretend that a few pointer hacks and checks have solved a problem.
In addition to all the security problems inherent in mingling metadata directly next to user-data, the linked list gives O(Nalloc) performance on free(3) and realloc(3) whereas my malloc had O(1) performance.
That may not seem like a lot, but when the 'O' is a page-in from disk, and Nalloc is from C++ code, it is nothing to sneeze at.
Not only did that make my malloc faster, but it would, not "could", but "would" detect several classes of malloc-usage errors, double-free etc, unconditionally.
Over the next 10-ish years, that practically eliminated entire classes of malloc-mistakes, making a lot of FOSS software safer, no matter which malloc you used it with.
Given the definition of the malloc(3) family API, there is no way you can detect all corruption without hardware support, but there are people working on that too, notably Robert Watsons CHERI project at Cambridge.
So yeah, nice pointer arithmetic, but how about people solved the real problem instead ?
(On big memory and multi-core systems use jemalloc, on small memory and single-core systems use phkmalloc.)
Did you talk to musl team as well?
A desktop environment and consumer-grade hardware add a lot of noise.
Ideally you'd always benchmark on completely idle dedicated servers, but those are rarely available.
If you _really_ want to avoid any contention, then build the benchmark as a Linux kernel module and call stop_machine(), which will prevent anything else including interrupt handlers from running.
In [mimalloc](https://github.com/microsoft/mimalloc) we use a similar strategy to protect the free list in secure mode. However, there are some weaknesses associated with using a plain _xor_ -- if the attacker can guess `P` that reveals the key immediately (`P^key^P == key`) or even if there is a read overflow, and the attacker can read multiple encodings, they can xor with each other, say `(P1^key)^(P2^key)` and then we have`(P1^P2)` which may reveal information about the pointers (like alignment).
What we use in _mimalloc_ instead is two keys `key1` and `key2` and encode as `((P^key2)<<<key1)+key1`. Since these operations are not associative, the above approaches do not work so well any more even if the `P` can be guesstimated. For example, for the read case we can subtract two entries to discard the `+key1` term, but that leads to `((P1^key2)<<<key1) - ((P2^key2)<<<key1)` at best. (We include the left-rotation since xor and addition are otherwise linear in the lowest bit).
Just some thoughts. Of course, this may be too much for the use-case. However, we found just little extra overhead for the extra operations (as most programs are dominated by memory access) so it may be of benefit.