Std: Clamp generates less efficient assembly than std:min(max,std:max(min,v)) (opens in new tab)

(1f6042.blogspot.com)

174 pointsx1f6042y ago142 comments

142 comments

49 comments · 11 top-level

svantana2y ago· 8 in thread

I'm a heavy std::clamp user, but I'm considering replacing it with min+max because of the uncertainty about what will happen when lo > hi. On windows it triggers an assertion, while other platforms just do a min+max in one or the other order. Of course, this should never happen but can be difficult to guarantee when the limits are derived from user inputs.

lpapez2y ago

> Of course, this should never happen but can be difficult to guarantee when the limits are derived from user inputs.

Sounds to me like you are missing a validation step before calling your logic. When it comes to parsing, trusting user input is a recipe for disaster in the form of buffer overruns and potential exploits.

As they used to say in the Soviet Union: "trust, but verify".

iknowstuff2y ago

The answer is of course

    clamp(min(a,b), max(a,b))

classic c++

PaulDavisThe1st2y ago

That was what Reagan said about the Soviet Union, not what was said in the Soviet Union.

Correct me if I'm wrong.

5 more replies

lifthrasiir2y ago

Pretty sure that their behaviors on NaN arguments will also differ.

wegfawefgawefg2y ago

I hope they fix it. Thats quite a basic functional unit for it to be a footgun all on its own.

camblomquist2y ago

Don't get your hopes up, the behavior when lo > hi is explicitly undefined.

dahart2y ago

Will min+max help you? What do you expect the answer to be when lo > hi? What certainty should std::clamp have? Using min+max on a number that’s between lo+hi when lo>hi will always return either lo or hi, and never your input value.

svantana2y ago

Sure, that was the point - min(max()) forces you to give explicit preference to lo or hi, whereas with clamp it's up to the std library. I trust my users to bend my software to their will, but I don't want different behavior on mac and windows (for example).

1 more reply

celegans252y ago· 7 in thread

On gcc 13, the difference in assembly between the min(max()) version and std::clamp is eliminated when I add the -ffast-math flag. I suspect that the two implementations handle one of the arguments being NaN a bit differently.

https://gcc.godbolt.org/z/fGaP6roe9

I see the same behavior on clang 17 as well

https://gcc.godbolt.org/z/6jvnoxWhb

gumby2y ago

You (celegans25) probably know this but here is a PSA that -ffast-math is really -finaccurate-math. The knowledgeable developer will know when to use it (almost never) while the naive user will have bugs.

mort962y ago

What you really should enable is the fun and safe math optimizations, with -funsafe-math-optimizations.

3 more replies

dahart2y ago

Why do you say almost never? Don’t let the name scare you; all floating point math is inaccurate. Fast math is only slightly less accurate, I think typically it’s a 1 or maybe 2 LSB difference. At least in CUDA it is, and I think many (most?) people & situations can tolerate 22 bits of mantissa compared to 23, and many (most?) people/situations aren’t paying attention to inf/nan/exception issues at all.

I deal with a lot of floating point professionally day to day, and I use fast math all the time, since the tradeoff for higher performance and the relatively small loss of accuracy are acceptable. Maybe the biggest issue I run into is lack of denorms with CUDA fast-math, and it’s pretty rare for me to care about numbers smaller than 10^-38. Heck, I’d say I can tolerate 8 or 16 bits of mantissa most of the time, and fast-math floats are way more accurate than that. And we know a lot of neural network training these days can tolerate less than 8 bits of mantissa.

4 more replies

alexey-salmin2y ago

If your code ventures into the domain where fast-math matters and you're not a mathematician trying to solve a lyapunov-unstable problem with very tricky numeric methods, then most likely your code is already broken.

planede2y ago

Another PSA is that dynamic libraries compiled with fast-math will also introduce inaccuracies in unrelated libraries in the same executable, as they introduce dynamic initialization that globally changes the floating point environment.

2 more replies

cogman102y ago

Ehh, not so much inaccurate, more of a "floating point numbers are tricky, let's act like they aren't".

Compilers are pretty skittish about changing the order of floating point operations (for good reason) and ffast-math is the thing that lets them transform equations to try and generate faster code.

IE, instead of doing "n / 10" doing "n * 0.1". The issue, of course, being that things like 0.1 can't be perfectly represented with floats but 100 / 10 can be. So now you've introduced a tiny bit of error where it might not have existed.

2 more replies

mhh__2y ago

One of the things that you can do with D and as far as I know Julia is enable specific optimizations locally e.g. allow FMAs here and there, not globally.

fast-math is one of the dumbest things we have as an industry IMO.

1 more reply

fooker2y ago· 7 in thread

If you benchmark these, you'll likely find the version with the jump edges out the one with the conditional instruction in practice.

jeffbee2y ago

FYI. https://quick-bench.com/q/sK9t9GoFDRkx9XxloUUbB8Q3ht4'

Using this microbenchmark on an Intel Sapphire Rapids CPU, compiled with march=k8 to get the older form, takes ~980ns, while compiling with march=native gives ~570ns. It's not at all clear that the imperfection the article describes is really relevant in context, because the compiler transforms this function into something quite different.

fooker2y ago

With random test cases, branch prediction can't help.

pclmulqdq2y ago

Compilers often under-generate conditional instructions. They implicitly assume (correctly) that most branches you write are 90/10 (ie very predictable), not 50/50. The branches that actually are 50/50 suffer from being treated as being 90/10.

fooker2y ago

The branches in this example are not 50/50.

Given a few million calls of clamp, most would be no-ops in practice. Modern CPUs are very good at dynamically observing this.

1 more reply

IainIreland2y ago

It's hard to predict statically which branches will be dynamically unpredictable.

A seasoned hardware architect once told me that Intel went all-in on predication for Itanium, under the assumption that a Sufficiently Smart Compiler could figure it out, and then discovered to their horror that their compiler team's best efforts were not Sufficiently Smart. He implied that this was why Intel pushed to get a profile-guided optimization step added to the SPEC CPU benchmark, since profiling was the only way to get sufficiently accurate data.

I've never gone back to see whether the timeline checks out, but it's a good story.

1 more reply

svantana2y ago

That must depend on the platform and the surrounding code, no?

fooker2y ago

Yes. On platform - most modern cpus are happier with predictable branches than exotic instructions.

On surrounding code - for sure.

cmovq2y ago· 6 in thread

Depending on the order of the arguments to min max you'll get an extra move instruction [1]:

std::min(max, std::max(min, v));

        maxsd   xmm0, xmm1
        minsd   xmm0, xmm2

std::min(std::max(v, min), max);

        maxsd   xmm1, xmm0
        minsd   xmm2, xmm1
        movapd  xmm0, xmm2

For min/max on x86 if any operand is NaN the instruction copies the second operand into the first. So the compiler can't reorder the second case to look like the first (to leave the result in xmm0 for the return value).

The reason for this NaN behavior is that minsd is implemented to look like `(a < b) ? a : b`, where if any of a or b is NaN the condition is false, and the expression evaluates to b.

Possibly std::clamp has the comparisons ordered like the second case?

[1]: https://godbolt.org/z/coes8Gdhz

x1f604OP2y ago

I think the libstdc++ implementation does indeed have the comparisons ordered in the way that you describe. I stepped into the std::clamp() call in gdb and got this:

    ┌─/usr/include/c++/12/bits/stl_algo.h──────────────────────────────────────────────────────────────────────────────────────
    │     3617     \*  @pre `_Tp` is LessThanComparable and `(__hi < __lo)` is false.
    │     3618     \*/
    │     3619    template<typename _Tp>
    │     3620      constexpr const _Tp&
    │     3621      clamp(const _Tp& __val, const _Tp& __lo, const _Tp& __hi)
    │     3622      {
    │     3623        __glibcxx_assert(!(__hi < __lo));
    │  >  3624        return std::min(std::max(__val, __lo), __hi);
    │     3625      }
    │     3626

cmovq2y ago

Thanks for sharing. I don't know if the C++ standard mandates one behavior or another, it really depends on how you want clamp to behave if the value is NaN. std::clamp returns NaN, while the reverse order returns the min value.

2 more replies

miohtama2y ago

Sir cmovq, you have deserved your username.

lebubule2y ago

Yes, I arrived at the same conclusion.

The various code snippets in the article don't compute the same "function". The order between the min() and max() matters even when done "by hand". This is apparent when min is greater than max as the results differ in the choice of the boundaries.

Funny that for such simple functions the discussion can become quickly so difficult/interesting.

Some toying around with the various implementations in C [1]:

[1]: https://godbolt.org/z/d4Tcdojx3

x1f604OP2y ago

Yes, you are correct, the faster clamp is incorrect because it does not return v when v is equal to lo and hi.

vitorsr2y ago

It seems that this is close to the most likely reason. See also:

https://godbolt.org/z/q7e3MrE66

jeffbee2y ago· 4 in thread

Clang generates the shortest of these if you target sandybridge, or x86-64-v3, or later. The real article that's buried in this article is that compilers target k8-generic unless you tell them otherwise, and the features and cost model of opteron are obsolete.

Always specify your target.

josephg2y ago

Yep. Adding "-C target-cpu=native" to rustc on my desktop computer consistently gets a ~10-15% performance boost compared to the default target. The default target is extremely conservative. As far as I can tell, it doesn't take advantage of any CPU features added in the last 20 years. (The k8 came out in 2003.)

wongarsu2y ago

Red Hat Enterprise Linux has upgraded their default target to x86-64-v2 and is considering switching to x86-64-v3 for RHEL 10 (which should release around 2026?). I'd take that as a sign that those might be reasonable choices for newly released software.

Some linux distros also give you the option to either get a version compatible with ancient hardware or the optimized x86-64-v3 version, which seems like a good compromise.

jeffbee2y ago

Those Gentoo people were onto something.

2 more replies

x1f604OP2y ago

Even with -march=x86-64-v4 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:

https://godbolt.org/z/hd44KjMMn

tambre2y ago· 4 in thread

Both recent GCC and Clang are able to generate the most optimal version for std::clamp() if you add something like -march=znver1, even at -O1 [0]. Interesting!

[0] https://godbolt.org/z/YsMMo7Kjz

GrumpySloth2y ago

But then it uses AVX instructions. (You can replace -march=znver1 with just -mavx.)

When AVX isn’t enabled, the std::min + std::max example still uses fewer instructions. Looks like a random register allocation failure.

gpderetta2y ago

The additional "movapd xmm0, xmm2" is mostly free as it is handled by renaming, but yes, it seems a quirk of the register allocator. It wouldn't be the first time I see GCC trying to move stuff around without obvious reasons.

x1f604OP2y ago

I don't think it's a register allocation failure but is in fact necessitated by the ABI requirement (calling convention) for the first parameter to be in xmm0 and the return value to also be placed into xmm0.

So when you have an algorithm like clamp which requires v to be "preserved" throughout the computation you can't overwrite xmm0 with the first instruction, basically you need to "save" and "restore" it which means an extra instruction.

I'm not sure why this causes the extra assembly to be generated in the "realistic" code example though. See https://godbolt.org/z/hd44KjMMn

x1f604OP2y ago

Even with -march=znver1 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:

https://godbolt.org/z/WMKbeq5TY

camblomquist2y ago· 2 in thread

I did a double take on this because I wrote a blog post about this topic a few months ago and came to a very different conclusion, that the results are effectively identical on clang and gcc is just weird.

Then I realized that I was writing about compiling for ARM and this post is about x86. Which is extra weird! Why is the compiler better tuned for ARM than x86 in this case?

Never did figure out what gcc's problem was.

https://godbolt.org/z/Y75qnTGdr

frozenport2y ago

Try switching to -Ofast it produces different ASM

klodolph2y ago

-Ofast is one of those dangerous flags that you should probably be careful with. It is “contagious” and it can mess up code elsewhere in the program, because it changes processor flags.

I would try a more specific flag like -ffinite-math-only.

3 more replies

planede2y ago

On a somewhat similar note, don't use std::lerp if you don't need its strong guarantees around rounding (monotonicity among other things).

https://godbolt.org/z/hzrG3s6T4

CountHackulus2y ago

I see that the assembly instructions are different, but what's the performance difference? Personally, I don't care about the number of instructions used, as long as it's faster. With things like store forwarding and register files, a lot of those movs might be treated as noops.

superjan2y ago

The only times I worry about min/max/clamp performance is when I need to do thousands or millions of them. And in that case, I’d suggest intrinsics. You get to choose how NaN is handled, it’s branchless, and you can do multiple in parallel.

It feels backwards that you need to order your comparisons so as to generate optimal assembly.

nickysielicki2y ago

https://bugs.llvm.org/show_bug.cgi?id=47271

This specific test (click the godbolt links) does not reproduce the issue.

j / k navigate · click thread line to collapse

142 comments

49 comments · 11 top-level

svantana2y ago· 8 in thread

lpapez2y ago

> Of course, this should never happen but can be difficult to guarantee when the limits are derived from user inputs.

As they used to say in the Soviet Union: "trust, but verify".

iknowstuff2y ago

The answer is of course

    clamp(min(a,b), max(a,b))

classic c++

PaulDavisThe1st2y ago

That was what Reagan said about the Soviet Union, not what was said in the Soviet Union.

Correct me if I'm wrong.

5 more replies

lifthrasiir2y ago

Pretty sure that their behaviors on NaN arguments will also differ.

wegfawefgawefg2y ago

I hope they fix it. Thats quite a basic functional unit for it to be a footgun all on its own.

camblomquist2y ago

Don't get your hopes up, the behavior when lo > hi is explicitly undefined.

dahart2y ago

svantana2y ago

1 more reply

celegans252y ago· 7 in thread

https://gcc.godbolt.org/z/fGaP6roe9

I see the same behavior on clang 17 as well

https://gcc.godbolt.org/z/6jvnoxWhb

gumby2y ago

mort962y ago

What you really should enable is the fun and safe math optimizations, with -funsafe-math-optimizations.

3 more replies

dahart2y ago

4 more replies

alexey-salmin2y ago

planede2y ago

2 more replies

cogman102y ago

Ehh, not so much inaccurate, more of a "floating point numbers are tricky, let's act like they aren't".

Compilers are pretty skittish about changing the order of floating point operations (for good reason) and ffast-math is the thing that lets them transform equations to try and generate faster code.

2 more replies

mhh__2y ago

One of the things that you can do with D and as far as I know Julia is enable specific optimizations locally e.g. allow FMAs here and there, not globally.

fast-math is one of the dumbest things we have as an industry IMO.

1 more reply

fooker2y ago· 7 in thread

If you benchmark these, you'll likely find the version with the jump edges out the one with the conditional instruction in practice.

jeffbee2y ago

FYI. https://quick-bench.com/q/sK9t9GoFDRkx9XxloUUbB8Q3ht4'

fooker2y ago

With random test cases, branch prediction can't help.

pclmulqdq2y ago

fooker2y ago

The branches in this example are not 50/50.

Given a few million calls of clamp, most would be no-ops in practice. Modern CPUs are very good at dynamically observing this.

1 more reply

IainIreland2y ago

It's hard to predict statically which branches will be dynamically unpredictable.

I've never gone back to see whether the timeline checks out, but it's a good story.

1 more reply

svantana2y ago

That must depend on the platform and the surrounding code, no?

fooker2y ago

Yes. On platform - most modern cpus are happier with predictable branches than exotic instructions.

On surrounding code - for sure.

cmovq2y ago· 6 in thread

Depending on the order of the arguments to min max you'll get an extra move instruction [1]:

std::min(max, std::max(min, v));

        maxsd   xmm0, xmm1
        minsd   xmm0, xmm2

std::min(std::max(v, min), max);

        maxsd   xmm1, xmm0
        minsd   xmm2, xmm1
        movapd  xmm0, xmm2

The reason for this NaN behavior is that minsd is implemented to look like `(a < b) ? a : b`, where if any of a or b is NaN the condition is false, and the expression evaluates to b.

Possibly std::clamp has the comparisons ordered like the second case?

[1]: https://godbolt.org/z/coes8Gdhz

x1f604OP2y ago

I think the libstdc++ implementation does indeed have the comparisons ordered in the way that you describe. I stepped into the std::clamp() call in gdb and got this:

    ┌─/usr/include/c++/12/bits/stl_algo.h──────────────────────────────────────────────────────────────────────────────────────
    │     3617     \*  @pre `_Tp` is LessThanComparable and `(__hi < __lo)` is false.
    │     3618     \*/
    │     3619    template<typename _Tp>
    │     3620      constexpr const _Tp&
    │     3621      clamp(const _Tp& __val, const _Tp& __lo, const _Tp& __hi)
    │     3622      {
    │     3623        __glibcxx_assert(!(__hi < __lo));
    │  >  3624        return std::min(std::max(__val, __lo), __hi);
    │     3625      }
    │     3626

cmovq2y ago

2 more replies

miohtama2y ago

Sir cmovq, you have deserved your username.

lebubule2y ago

Yes, I arrived at the same conclusion.

Funny that for such simple functions the discussion can become quickly so difficult/interesting.

Some toying around with the various implementations in C [1]:

[1]: https://godbolt.org/z/d4Tcdojx3

x1f604OP2y ago

Yes, you are correct, the faster clamp is incorrect because it does not return v when v is equal to lo and hi.

vitorsr2y ago

It seems that this is close to the most likely reason. See also:

https://godbolt.org/z/q7e3MrE66

jeffbee2y ago· 4 in thread

Always specify your target.

josephg2y ago

wongarsu2y ago

Some linux distros also give you the option to either get a version compatible with ancient hardware or the optimized x86-64-v3 version, which seems like a good compromise.

jeffbee2y ago

Those Gentoo people were onto something.

2 more replies

x1f604OP2y ago

Even with -march=x86-64-v4 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:

https://godbolt.org/z/hd44KjMMn

tambre2y ago· 4 in thread

Both recent GCC and Clang are able to generate the most optimal version for std::clamp() if you add something like -march=znver1, even at -O1 [0]. Interesting!

[0] https://godbolt.org/z/YsMMo7Kjz

GrumpySloth2y ago

But then it uses AVX instructions. (You can replace -march=znver1 with just -mavx.)

When AVX isn’t enabled, the std::min + std::max example still uses fewer instructions. Looks like a random register allocation failure.

gpderetta2y ago

x1f604OP2y ago

I'm not sure why this causes the extra assembly to be generated in the "realistic" code example though. See https://godbolt.org/z/hd44KjMMn

x1f604OP2y ago

Even with -march=znver1 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:

https://godbolt.org/z/WMKbeq5TY

camblomquist2y ago· 2 in thread

Then I realized that I was writing about compiling for ARM and this post is about x86. Which is extra weird! Why is the compiler better tuned for ARM than x86 in this case?

Never did figure out what gcc's problem was.

https://godbolt.org/z/Y75qnTGdr

frozenport2y ago

Try switching to -Ofast it produces different ASM

klodolph2y ago

-Ofast is one of those dangerous flags that you should probably be careful with. It is “contagious” and it can mess up code elsewhere in the program, because it changes processor flags.

I would try a more specific flag like -ffinite-math-only.

3 more replies

planede2y ago

On a somewhat similar note, don't use std::lerp if you don't need its strong guarantees around rounding (monotonicity among other things).

https://godbolt.org/z/hzrG3s6T4

CountHackulus2y ago

superjan2y ago

It feels backwards that you need to order your comparisons so as to generate optimal assembly.

nickysielicki2y ago

https://bugs.llvm.org/show_bug.cgi?id=47271

This specific test (click the godbolt links) does not reproduce the issue.

j / k navigate · click thread line to collapse