std::min(max, std::max(min, v));
maxsd xmm0, xmm1
minsd xmm0, xmm2
std::min(std::max(v, min), max); maxsd xmm1, xmm0
minsd xmm2, xmm1
movapd xmm0, xmm2
For min/max on x86 if any operand is NaN the instruction copies the second operand into the first. So the compiler can't reorder the second case to look like the first (to leave the result in xmm0 for the return value).The reason for this NaN behavior is that minsd is implemented to look like `(a < b) ? a : b`, where if any of a or b is NaN the condition is false, and the expression evaluates to b.
Possibly std::clamp has the comparisons ordered like the second case?
┌─/usr/include/c++/12/bits/stl_algo.h──────────────────────────────────────────────────────────────────────────────────────
│ 3617 \* @pre `_Tp` is LessThanComparable and `(__hi < __lo)` is false.
│ 3618 \*/
│ 3619 template<typename _Tp>
│ 3620 constexpr const _Tp&
│ 3621 clamp(const _Tp& __val, const _Tp& __lo, const _Tp& __hi)
│ 3622 {
│ 3623 __glibcxx_assert(!(__hi < __lo));
│ > 3624 return std::min(std::max(__val, __lo), __hi);
│ 3625 }
│ 3626The various code snippets in the article don't compute the same "function". The order between the min() and max() matters even when done "by hand". This is apparent when min is greater than max as the results differ in the choice of the boundaries.
Funny that for such simple functions the discussion can become quickly so difficult/interesting.
Some toying around with the various implementations in C [1]:
Then I realized that I was writing about compiling for ARM and this post is about x86. Which is extra weird! Why is the compiler better tuned for ARM than x86 in this case?
Never did figure out what gcc's problem was.
I would try a more specific flag like -ffinite-math-only.
https://gcc.godbolt.org/z/fGaP6roe9
I see the same behavior on clang 17 as well
I deal with a lot of floating point professionally day to day, and I use fast math all the time, since the tradeoff for higher performance and the relatively small loss of accuracy are acceptable. Maybe the biggest issue I run into is lack of denorms with CUDA fast-math, and it’s pretty rare for me to care about numbers smaller than 10^-38. Heck, I’d say I can tolerate 8 or 16 bits of mantissa most of the time, and fast-math floats are way more accurate than that. And we know a lot of neural network training these days can tolerate less than 8 bits of mantissa.
Compilers are pretty skittish about changing the order of floating point operations (for good reason) and ffast-math is the thing that lets them transform equations to try and generate faster code.
IE, instead of doing "n / 10" doing "n * 0.1". The issue, of course, being that things like 0.1 can't be perfectly represented with floats but 100 / 10 can be. So now you've introduced a tiny bit of error where it might not have existed.
fast-math is one of the dumbest things we have as an industry IMO.
Always specify your target.
Some linux distros also give you the option to either get a version compatible with ancient hardware or the optimized x86-64-v3 version, which seems like a good compromise.
Sounds to me like you are missing a validation step before calling your logic. When it comes to parsing, trusting user input is a recipe for disaster in the form of buffer overruns and potential exploits.
As they used to say in the Soviet Union: "trust, but verify".
clamp(min(a,b), max(a,b))
classic c++Correct me if I'm wrong.
When AVX isn’t enabled, the std::min + std::max example still uses fewer instructions. Looks like a random register allocation failure.
So when you have an algorithm like clamp which requires v to be "preserved" throughout the computation you can't overwrite xmm0 with the first instruction, basically you need to "save" and "restore" it which means an extra instruction.
I'm not sure why this causes the extra assembly to be generated in the "realistic" code example though. See https://godbolt.org/z/hd44KjMMn
It feels backwards that you need to order your comparisons so as to generate optimal assembly.
This specific test (click the godbolt links) does not reproduce the issue.
Using this microbenchmark on an Intel Sapphire Rapids CPU, compiled with march=k8 to get the older form, takes ~980ns, while compiling with march=native gives ~570ns. It's not at all clear that the imperfection the article describes is really relevant in context, because the compiler transforms this function into something quite different.
Given a few million calls of clamp, most would be no-ops in practice. Modern CPUs are very good at dynamically observing this.
A seasoned hardware architect once told me that Intel went all-in on predication for Itanium, under the assumption that a Sufficiently Smart Compiler could figure it out, and then discovered to their horror that their compiler team's best efforts were not Sufficiently Smart. He implied that this was why Intel pushed to get a profile-guided optimization step added to the SPEC CPU benchmark, since profiling was the only way to get sufficiently accurate data.
I've never gone back to see whether the timeline checks out, but it's a good story.