I experimented with the proposed parallel data type extensions to the C++ standard library. I got impressive performance gains for calculating APFS fletcher checksums without resorting to compiler intrinsics or inline assembly.
Gains were even more impressive when adding some simple loop unrolling: https://jtsylve.blog/post/2022/12/24/Blazingly-Fast-er-SIMD-...