The cost is that the binary carries around both AVX2 and AVX-512 codepaths, but that is not an issue IMO.
An issue with the abstractions that does not go away is that the optimal code architecture -- well above the level of the SIMD wrappers -- is dependent on the capabilities of the silicon. The wrappers can't solve for that. And if you optimize the code architecture for the silicon architecture, it quickly approximates writing architecture-specific intrinsics with an additional layer of indirection, which significantly reduces any notional benefit from the abstractions.
The wrappers can't abstract enough, and higher level abstractions (written with architecture aware intrinsics) are often too use case specific to reuse widely.
One counterexample: our portable vqsort [1] outperforms AVX-512-specific intrinsics [2].
I agree that high-level design may differ. You seem aware that Highway, and probably also other wrappers, supports specializing code for some target(s), but possibly misunderstand how, given the "additional layer of indirection" claim. Wrappers give you a portable baseline, and remove some of the potholes and ugly syntax, but boil down to inlined wrapper functions.
If you want to specialize, that is supported. And what is the downside? Even if you say the benefit of a wrapper is reduced vs manually written intrinsics (and reinventing all the workarounds for their missing instructions), do you not agree that the benefit is still nonzero?
[1]: https://github.com/google/highway/tree/master/hwy/contrib/so... [2]: https://github.com/Voultapher/sort-research-rs/blob/38f37eef...
Did you not read the article? It's using AVX intrinsics and NEON intrinsics.
When not busy unnecessarily rewriting everything for each ISA, it is easier to see and have time for vital optimizations such as unrolling :)
[1]: https://www.reddit.com/r/cpp/comments/1gzob1g/understanding_... [2]: https://github.com/google/highway/blob/master/hwy/contrib/do...