I've read parts 1-5, there doesn't seem to be a part6. Thanks again for the link, it is regrettable that there is duplication of effort. His approach is quite similar to ours (including HeapSort, sorting network, unrolled in-place partition), and re-invents what he calls "double-pumped partitioning" (introduced by Bramas).
Copying the blog post to arxiv and citing some related work so it comes up in alerts and cited-by searches would [have made/make] this good work visible to a wider circle, including the algorithms community.
It's nice to see a C++ version, this will be easier for me to benchmark, and it also seems to support other built-in types now, as well as AVX2+AVX-512 (though not other platforms). A quick list of differences based on reading the blog:
* Nifty idea for simplifying the branch in partition. (We also found that branchless code is worse)
* Major effort to make all reads vector-aligned. This could make a difference especially on AVX-512. I had tried two variants but it was difficult to make this safe for asan (no out of bounds reads allowed).
* The lookup table is 2KB, which could be halved using 32-bit broadcast + variable shift instead of PDEP.
* The sorting network seems to be half the size and bitonic. I haven't yet compared the number of permute instructions (not obvious from the source because it's generated).
* The pivot is only a median of three. We use a much larger sample.