> but only sequences of inlined sqrt calls within an unrolled loop
Somewhat-relatedly, that's also a problem with vectorized math libraries, affecting both gcc and clang, where the vectorized function has a different result to the scalar standard-libm one (and indeed gcc wants at least "-fno-math-errno -funsafe-math-optimizations -ffinite-math-only" to even allow using vector math libraries, even though it takes explicit flags to enable one (clang's fine with just "-fno-math-errno")).
For what it's worth, clang has __arithmetic_fence for doing the exact thing you're using inline asm for I believe; and the clang/llvm instruction-level constrained arith I noted would be the sane way to achieve this.
The code sample shown in P3375 should be just consequences of fma contraction on gcc/clang I believe? i.e. -ffp-contract=off makes the results consistent for both gcc and clang. I do think it's somewhat funky that -ffp-contract=on is the default, but oh well the spec allows it and it is a perf boost (and a precision boost.. if one isn't expecting the less precise result) and one can easily opt out.
Outside of -ffast-math and -ffp-contract=on (and pre-SSE x86-32 (i.e. ≥26-year-old CPUs) where doing things properly is a massive slowdown) I don't think clang and gcc should ever be doing any optimizations that change numerical values (besides NaN bit patterns).
Just optimization-fencing everything, while a nice and easy proof-of-concept, isn't something compiler vendors would just accept as the solution to implement; that's a ~tripling of IR instructions for each fp operation, which'd probably turn into a quite good compilation speed slowdown, besides also breaking a good number of correct optimizations. (though, again, this shouldn't even be necessary)
And -ffast-math should be left alone, besides perhaps desiring support to disable it at a given scope/function; I can definitely imagine that, were some future architecture to add a division instruction that can have 1ULP of error and is faster than the regular division, that compilers would absolutely use it for all divisions on -ffast-math, and you couldn't work around that with just optimization fences.