Beware of fast-math (opens in new tab)

(simonbyrne.github.io)

189 pointssimonbyrne4y ago107 comments

107 comments

76 comments · 15 top-level

dmitrygr4y ago· 11 in thread

Contrarian waypoint: beware of not-fast-math. Making things like atan2f and sqrtf set errno takes you down a very slow path, costing you significant perf in cases where you likely do not want it. And most math will work fine with fast-math, if you are careful how you write it. (Free online numerical methods classes are available, eg [1]) Without fast-math most compilers cannot even use FMA instructions (costing you up to 2x in cases where they could be used otherwise) since they cannot prove it will produce the same result - FMA will actually likely produce a more accurate result, but your compiler is handicapped by lack of fast-math to offer it to you.

[1] https://ocw.mit.edu/courses/mathematics/18-335j-introduction...

mbauman4y ago

That's precisely the part that makes it so impossible to use! Sometimes it means fewer bits of accuracy than IEEE would otherwise give you; sometimes it means more. Sometimes it results in your code being interpreted in a more algebra-ish way, sometimes it's less.

That's why finer-grained flags are needed — yes, FMAs and SIMD are essential for _both_ performance and improved accuracy, but `-ffast-math` bundles so many disparate things together it's impossible to understand what your code does.

> And most math will work fine with fast-math, if you are careful how you write it.

The most hair-pulling part about `-ffast-math` is that it will actively _disable_ your "careful code." You can't check for nans. You can't check for residuals. It'll rearrange those things on your behalf because it's faster that way.

an1sotropy4y ago

(in case anyone reading doesn't know: FMA = Fused Multiply and Add, as in a*b+c, an operation on 3 values, which increases precision by incurring rounding error once instead of twice)

I'm not an expert on this, but for my own code I've been meaning to better understand the discussion here [1], which suggests that there ARE ways of getting FMAs, without the sloppiness of fast-math.

[1] https://stackoverflow.com/questions/15933100/how-to-use-fuse...

simonbyrneOP4y ago

-ffp-contract=fast will enable FMA contraction, i.e. replacing a * b + c with fma(a,b,c). This is generally okay, but there are a few cases where it can cause problems: the canonical example is computing an expression of the form:

a * d - b * c

If a == b and c == d (and all are finite), then this should give 0 (which is true for strict IEEE 754 math), but if you replace it with an fma then you can get either a positive or negative value, depending on the order in which it was contracted. Issues like this pop up in complex multiplication, or applying the quadratic formula.

a_e_k4y ago

C99 [0] and C++11 [1] both have fma() functions that let you directly request it without the need to mess around with sloppier FP contracts to infer it.

[0] https://en.cppreference.com/w/c/numeric/math/fma

[1] https://en.cppreference.com/w/cpp/numeric/math/fma

StefanKarpinski4y ago

The way Julia handles this is worth noting:

- `fma(a, b, c)` is exact but may be slow: it uses intrinsics if available and falls back to a slow software emulation when they're not

- `muladd(a, b, c)` uses the fastest possibly inexact implementation of `a*b + c` available, which is FMA intrinsics if available or just doing separate `*` and `+` operations if they're not

That gives the user control over what they need—precision or speed. If you're writing code that needs the extra precision, use the `fma` function but if you just want to compute `a*b + c` as fast as possible, with or without extra precision, then use `muladd`.

adgjlsfhk14y ago

Note that this is only true in theory. In practice, there are still some bugs here that will hopefully be fixed by julia 1.8

dahart4y ago

> which suggests that there ARE ways of getting FMAs, without the sloppiness of fast-math.

There are ways, indeed, but they are pretty slow, it’s prioritizing accuracy over performance. And they’re still pretty tricky too. The most practical alternative for float FMA might be to use doubles, and for double precision FMA might be to bump to a 128 bit representation.

Here’s a paper on what it takes to do FMA emulation: https://www.lri.fr/~melquion/doc/08-tc.pdf

an1sotropy4y ago

I remember a teacher who said (when I was a student) something like "if you care about precision use double". Now that I'm teaching, I force students to only use single-precision "float"s in their code, with the message that FP precision is a finite resource, and you don't learn how to manage any resource by increasing its supply. I think my students hate me.

3 more replies

simonbyrneOP4y ago

My point isn't that fast-math isn't useful: it very much is. The problem is that it is a whole grab bag of things that can do very dangerous things. Rather than using a sledgehammer, you should try to be selective and enable only the useful optimizations, e.g. you could just enable -ffp-contract=fast and -fno-math-errno.

djmips4y ago

One thing I don't think you pointed out is that tracking down issues with NaNs seems hard with fast-math since, I believe, it also disables any exceptions that might be useful to being alerted to their formation?

kristofferc4y ago

> Free online numerical methods classes are available

How can you use any numerical methods (like error analysis) if you don't have a solid foundation with strict rules to analyze on top on?

SeanLuke4y ago· 10 in thread

The other examples he gave trade off significant math deficiencies for small speed gains. But flushing subnormals to zero can produce a MASSIVE speed gain. Like 1000x. And including subnormals isn't necessarily good floating point practice -- they were rather controversial during the development of IEEE 754 as I understand it. The tradeoff here is markedly different than in the other cases.

adrian_b4y ago

Flushing subnormals to zero produces speed gains only on certain CPU models, while on others it almost does not have any effect.

For example Zen CPUs have negligible penalties for handling denormals, but many Intel models have a penalty between 100 and 200 clock cycles for an operation with denormals.

Even on the CPU models with slow denormal processing, a speedup between 100 and 1000 exists only for the operation with denormals itself and only when the operation belonged to a stream of operations working at the maximum CPU SIMD speed, when during the one hundred and something lost clock cycles the CPU could have done 4 or 8 operations during every clock cycle.

Any complete computations cannot have a significant percentage of operations with denormals, unless they are written in an extremely bad way.

So for a complete computation, even on the models with bad denormal handling, a speedup of more than a few times would be abnormal.

The only controversy that has ever existed about denormals is that handling them at full speed increases the cost of the FPU, so lazy or greedy companies, i.e. mainly Intel, have preferred to add the flush-to-zero option for gamers, instead of designing the FPU in the right way.

When the correctness of the results is not important, like in many graphic or machine-learning applications, using flush-to-zero is OK, otherwise it is not.

titzer4y ago

When we were debating whether WebAssembly should support subnormal numbers (i.e. be IEEE compliant), some people often cited these mythical subnormal slowdowns. So Dan Gohman ran some benchmarks and the scary-sounding slowdowns amounted to something like less than 1% (i.e. noise) for almost all benchmarks. Interestingly, one benchmark did not converge correctly with FTZ (i.e. no subnormals) and actually ran 3x more iterations, leading to a 3x slowdown.

Outside of a vanishingly few edge cases, I think the subnormal debate is basically over, except, apparently, inside of Intel. Every single other architecture and microarchitecture manages to handle subnormals with relative ease, with only a handful of clock cycle penalty. I think Intel hardware should be called out, not programmers who just want the 35 year old floating point standard to be fast like it is on other chips.

Similar stories happened in the GPU world, and my understanding is that essentially all GPUs are converging on IEEE compliance by default now.

simonbyrneOP4y ago

That's fascinating thread, thanks: https://github.com/WebAssembly/design/issues/148

Someone4y ago

> The only controversy that has ever existed about denormals is that handling them at full speed increases the cost of the FPU, so lazy or greedy companies, i.e. mainly Intel, have preferred to add the flush-to-zero option for gamers

You could also say some companies have been kind enough to make hardware for gamers that doesn’t have costly features they do not need.

titzer4y ago

Except CPUs do have those features, they are just slow. FTZ is kind of a cheat mode for extra speed. The problem is that cheats just mushroom into software problems and generally make a crappier, less reliable platform. The situation is rife in computer hardware.

StefanKarpinski4y ago

That's true, but the danger of flushing subnormals to zero is correspondingly worse because it's global CPU state and there's commonly used code that relies on not flushing subnormals to zero in order to work correctly, like `libm`. The example linked in the post is of a case where loading a shared library that had been compiled with `-Ofast` (which includes `-ffast-math`) broke a completely unrelated package because of this. Of course, the fact that CPU designers made this a global hardware flag is atrocious, but they did, so here we are.

Joker_vD4y ago

Wait, what is "local" CPU state/hardware flag? In any case, since x64 ABI doesn't require MXCSR to be in any particular state on function entry, libm should set/clear whatever control flags it needs on its own (and restore them on exit since MXCSR control bits are defined to be callee-saved).

StefanKarpinski4y ago

Local would be not using register flags at all and instead indicating with each operation whether you want flushing or not (and rounding mode, ideally). Some libms may clear and restore the control flags and some may not. Libm is just an example here and one where you're right that most of function calls that might need to avoid flushing subnormals to zero are expensive enough that clearing and restoring flags is an acceptable cost. However, that's not always the case—sometimes the operation in question is a few instructions and it may get inlined into some other code. It might be possible to handle this better at the compiler level while still using the MXCSR register, but if it is, LLVM certainly can't currently do that well.

simonbyrneOP4y ago

In theory, every function should do that to check things like rounding mode etc. But that would be pretty slow, especially for low-latency operations (modifying mxcsr will disrupt pipelining for example).

pcwalton4y ago

That wouldn't be practical. C math library performance really matters for numerical-intensive apps like games.

zoomablemind4y ago· 8 in thread

On the subject of the floating-point math in general, I wonder what's the practical way to treat the extreme order values (close to zero ~ 1E-200, or infinity ~ 1E200, but not zero or inf)? This can take place in some iterative methods, expansion series, or around some singularities.

How reliable is it to keep the exreme orders in expectation that the resp. quatities would cancel the orders properly yielding a meaningful value (rounding wise)?

For example, calculating some resulting value function, expressed as

v(x)=f(x)/g(x),

where both f(x) and g(x) are oscillating with a number of roots in a given interval of x.

simonbyrneOP4y ago

The key thing about floating point is that it maintains relative accuracy: in your case, if you have say f(x) and g(x) are both O(1e200), and are correct to some small relative tolerance, say 1e-10 (that is, the absolute error is 1e190). Then the relative for f(x)/g(x) stays nicely bounded to about 2e-10.

However if you do f(x) - g(x), the absolute error is on the order of 2e190: if f(x) - g(x) is small, then now the relative error can be huge (this is known as catastrophic cancellation).

zoomablemind4y ago

Both f(x) and g(x) could be calc'ed to proper machine precision (say, GSL double prec ~ 1E-15). Would this imply, that beyond the machine precision the values are effectively fixed to resp. zero or infinity, instead of carrying around the extreme orders of magnitude?

simonbyrneOP4y ago

I'm not exactly sure what you're asking here, but the point is that "to machine precision" is relative: if f(x) and g(x) are O(1e200), then the absolute error of each is still O(1e185). However f(x)/g(x) will still be very accurate (with absolute error O(1e-15)).

gsteinb884y ago

If you can, working with the logarithms of the intermediate large or small values is one way around the issue

One example talking about this here: http://aosabook.org/en/500L/a-rejection-sampler.html#the-mul...

junon4y ago

Not very reliable. For precision one usually seeks out the MPFR library or something similar.

zoomablemind4y ago

If anyone else wonders, MPFR (Multi-Precision Floating-point calc with correct Rounding). The application domain is Interval Math [1].

[1]:http://cs.utep.edu/interval-comp/applications.html

adgjlsfhk14y ago

ARB https://arblib.org/ is often better than mpfr.

junon4y ago

Thanks, that's a new one.

kzrdude4y ago· 8 in thread

It looks like -fassociative-math is "safe" in the sense that it can not be used to get UB in working code? That's a good property to make it easier to use in the right context.

mbauman4y ago

See the one footnote: you can re-associate a list of 2046 numbers such that they sum to _any_ floating point number between 0 and 2^970.

https://discourse.julialang.org/t/array-ordering-and-naive-s...

StefanKarpinski4y ago

To be fair though, as noted further down in that thread, naive left-to-right summation is the worst case here since the tree of operations is as skewed as possible. I think that any other tree shape is better and compilers will tend to make the tree more balanced, not less, when they use associativity to enable SIMD optimizations. So while reassociation is theoretically arbitrarily bad, in practice it's probably mostly ok. Probably.

stephencanon4y ago

This is exactly right. Also, _all_ of the summation orderings satisfy the usual backwards-stability bounds, and so all of them are perfectly stable in normal computations.

evilotto4y ago

Floating point math is fun:

  def re_add(a,b,c,d,e):
      return (a+b+c+d+e) == (2 * (c+a+b+d+e))
  
  print(re_add(1e17, -1e17, 3, 2, 1))

simonbyrneOP4y ago

Not necessarily: if your cospi(x) function is always returning 1.0 (https://github.com/JuliaLang/julia/issues/30073#issuecomment...), but you wrote your code assuming the result was in a different interval, then you could quite easily invoke undefined behavior.

adgjlsfhk14y ago

Yeah. It's safe in that you won't get UB, but it's bad in that you can get arbitrarily wrong answers.

wiml4y ago

To be fair, if you're using floating point at all you can get arbitrarily wrong answers. The nice thing about ieee754 conformance is that you can, with a lot of expertise, somewhat reason about the kinds of error you're getting. But for code that wasn't written by someone skilled in numerical techniques, and that's the vast majority of fp code, is fast-math worse than the status quo?

simonbyrneOP4y ago

From personal experience, yes: I've seen multiple cases of scientists finding the ultimate cause of their bugs was some fast-math-related optimization.

The problem isn't necessarily the code they wrote themselves: it is often that they've compiled someone else's code or an open source library with fast-math, which broke some internal piece.

pavpanchekha4y ago· 7 in thread

Fun fact: when working on Herbie (http://herbie.uwplse.org), our automated tool for reducing floating-point error by rearranging your mathematical expressions, we found that fast-math often undid Herbie's improvements. In a sense, Herbie and fast-math are opposites: one makes code more accurate (sometimes slower, sometimes faster), while the other makes code faster (sometimes less accurate, sometimes more).

kazinator4y ago

If you have a program which finds some order of items in order to optimize something, and then you introduce some confounding technology which scrambles the order of the items afterward, that's indistinguishable from introducing bugs into that optimizing program; the output no longer implements the optimization requirements that the program tries to ensure.

I don't see how Herbie's accuracy improvements could not be undone, if Herbie's output is fed to a back-end which doesn't preserve Herbie's order of operations as Herbie requires and depends on.

Quekid54y ago

That's a real "fun fact" kind of thing. Love it.

Of course, the 'real' solution is actual Numerical Analysis (as I'm sure you know) to keep the error properly bounded, but it's really interesting to have a sort of middle ground which just stabilizes the math locally... which might be good enough.

Other fun fact: Numerical Analysis is a thing that's really hard to imagine unless you happen to be introduced to it during an education. It's so obviously a thing once you've heard of it, but REALLY hard to come up with ex nihilo.

simonbyrneOP4y ago

Herbie is a great tool, especially for teaching.

shoo4y ago

thank you for sharing the link to Herbie, that looks like a useful tool.

If I follow at high level, it looks like Herbie is trying to rewrite expressions to minimise error without runtime performance constraints.

Are there alternative tools that focus on rewriting code to maximise performance while keeping error below some configurable bound?

i guess compilers are generally focused on the latter problem, perhaps without giving the user much control over the degree of error they are willing to tolerate.

mattpharr4y ago

> Are there alternative tools that focus on rewriting code to maximise performance while keeping error below some configurable bound?

There are! See followup work by @pavpanchekha and others on "Pherbie", which finds a set of Pareto-optimal rewritings of a program so that it's possible to trade-off error and performance: https://ztatlock.net/pubs/2021-arith-pherbie/paper.pdf.

shoo4y ago

Fantastic! This appears to be available in the command line version of herbie with the `--pareto` flag:

> Enables multi-objective improvement. Herbie will attempt to simultaneously optimize for both accuracy and expression cost. Rather than generating a single "ideal" output expression, Herbie will generate many output expressions. This mode is still considered experimental. This will take a long time to run. We recommend timeouts measured in hours.

adgjlsfhk14y ago

I've generally been disappointed by ferbie when I've tried it, but this does look really cool. I would love it if this provides a path to making it easier to get good tradeoffs between precision and accuracy.

vchuravy4y ago· 4 in thread

Especially the fact that loading a library compiled with GCC and fast math on, can modify the global state of the program... It's one of the most baffling decisions made in the name of performance.

I would really like for someone to take fast math seriously, and to provide well scoped and granular options to programmers. The Julia `@fastmath` macro gets close, but it is two broad. I want to control the flags individually.

Also the question how that interacts with IPO/inlining...

mhh__4y ago

D (LDC) lets you control the flags on a per function basis.

So one can (and we do at work) have @optmath which is a specific set of flags (just a value we defined at compile time, the compiler recognizes it as a UDA) we want as opposed to letting the compiler bulldoze everything.

physicsguy4y ago

You can do that on a per-compiler basis for e.g. with

#pragma GCC optimize(“fast-math")

mhh__4y ago

Hopefully it still works as an attribute but my point is that you can (say) opt in to allowing more liberal use of FMA without (say) opting in to aggressive NaN assumptions

2 more replies

titzer4y ago

It should always have been a bit in the instruction encoding, never global state.

headPoet4y ago· 3 in thread

-funsafe-math-optimizations always makes me laugh. Of course I want fun and safe math optimisations

nusaru4y ago

Personally I’m a fan of Kotlin’s “fun factory()”

seba_dos14y ago

If you expect them to be safe, you're in for some fun!

krylon4y ago

They're not fun and safe, though, they're "fun-safe", so you don't enjoy yourself (too much) while doing math.

smitop4y ago· 3 in thread

The LLVM IR is more expressive than clang is for expressing fast-math: it supports making an operation use fast-math optimization on a per operation basis (https://llvm.org/docs/LangRef.html#fastmath).

simonbyrneOP4y ago

Do you know what happens when you have ops with different flags? e.g. if you have (a + b) + c, where one + allows reassoc but one doesn't?

diamondlovesyou4y ago

(a+b)+c has two ops in LLVM: addition is a binop, meaning it has two "arguments", thus (a+b) and adding "c" are separate instructions. You can't directly add three or more values.

gmueckl4y ago

I believe the parent wants to know how a sequence of IR instructions with alternating modes is translated to the target.

My guess would be that explicit state state changes show up in the generated machine code.

1 more reply

willis9364y ago· 3 in thread

I use single precision floating point to save memory and computation in applications where it makes sense. I had a case where I didn't care about the vertical precision of a signal very much. It had a sample rate in the tens of thousands of samples per second. I was generating a sinusoid and transmitting it. On the receiver the signal would become garbled after about a minute. I slapped my head and immediately realized I ran out of precision by using a single precision time value feeding the sin function when t became too large with the small increment.

  sin(single(t)) == bad

  single(sin(t)) == good

adgjlsfhk14y ago

IMO, sinpi(x)=sin(pi*x) is a better function because it does much better here. the regular trig functions are approximately 20% slower for most implementations in order to accurately reduce mod 2pi, while reducing mod 2 is pretty much trivial.

willis9364y ago

I think the real solution I should have adopted is incrementing t like this:

  t = mod(t + ts, 1 / f)

Since I'm just sending a static frequency the range of time never needs to be beyond one period. However, using a double here is far from the critical path in increasing performance and it all runs fast enough anyway.

willis9364y ago

Also, thanks. I had not heard of this function before, but apparently it was added to MATLAB in 2018.

optimalsolver4y ago· 2 in thread

"-fno-math-errno" and "-fno-signed-zeros" can be turned on without any problems.

I got a four times speedup on <cmath> functions with no loss in accuracy.

owlbite4y ago

Unless, of course, you have some algorithm that depends on signed zeros. Which is basically the same with all the optimizations the article complains about.

I'd suggest -ffp-contract=fast is a good idea for 99% of code. It's only going to break things where very specific effort has gone in to the numerical analysis, and likely the authors of such things are sufficiently fp-savy to tell you not to do the thing.

adgjlsfhk14y ago

Is there any algorithm that depends on signed zeros? I'm not aware of any.

dlsa4y ago· 1 in thread

Never considered fast-math. I get the sense its useful but can create awkward and/or unexpected surprises. If I was to use it I'd have to have a verification test harness as part of some pipeline to comfirm no weirdness. Literally a bunch of example canary calculations to determine if fast-math will kill or harm some real use case.

Is this a sensible approach? What are others experiences around this? I've never bothered with this kind of optimisation and I now vaguely feel like I'm missing out.

I tend to use calculations for deterministic purposes rather than pure accuracy. 1+1=2.1 where the answer is stable and approximate is still better and more useful than 1+1=2.0 but where the answer is unstable. Eg because one of those is 0.9999999 and the precision triggers some edge case.

simonbyrneOP4y ago

I tried to lay out a reasonable path: incrementally test accuracy and performance, and only enable the necessary optimizations to get the desired performance. Good tests will catch the obvious catastrophic cases, but some will inevitably be weird edge cases.

As always, the devil is in the details: you typically can't check exact equality, as e.g. reassociating arithmetic can give slightly different (but not necessarily worse) results. So the challenge is coming up with appropriate measure of determining whether something is wrong.

jjgreen4y ago· 1 in thread

One trick that I happened upon was speeding up complex multiplication (like a factor of 5) under gcc with the --enable-cx-fortran switch.

bruce3434344y ago

-fcx-fortran-rules

    Complex multiplication and division follow Fortran rules. Range reduction is done as part of complex division, but there is no checking whether the result of a complex multiplication or division is NaN + I*NaN, with an attempt to rescue the situation in that case.

    The default is -fno-cx-fortran-rules.

[1] https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Optimize-Option...

gnufx4y ago

You will generally want at least -funsafe-math-optimizations for performance-critical loops. Otherwise you won't get vectorization at all with ARM Neon, for instance. You also won't get some simple loops vectorized (like products) or generally(?) loop nest optimizations. You just may not be able to afford the maybe order of magnitude cost if your code is bottlenecked on such things (although HPC code actually may well not be).

In my experience much scientific Fortran code, at least, is OK with something like -ffast-math, at least because it's likely to have been used with ifort at some stage, and even with non-754-compliant hardware if it's old enough. Obviously you should check, though, and perhaps confine such optimizations to where they're needed.

BLIS turned on -funsafe-math-optimizations (if I recall correctly) to provide extra vectorization, and still passed its extensive test suite. (The GEMM implementation is possibly the ultimate loop nest restructuring.)

pfdietz4y ago

The link to Kahan Summation was interesting.

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

markhahn4y ago

NaN's should trap, but compilers should not worry about accurate debugging.

j / k navigate · click thread line to collapse

107 comments

76 comments · 15 top-level

dmitrygr4y ago· 11 in thread

[1] https://ocw.mit.edu/courses/mathematics/18-335j-introduction...

mbauman4y ago

> And most math will work fine with fast-math, if you are careful how you write it.

an1sotropy4y ago

(in case anyone reading doesn't know: FMA = Fused Multiply and Add, as in a*b+c, an operation on 3 values, which increases precision by incurring rounding error once instead of twice)

I'm not an expert on this, but for my own code I've been meaning to better understand the discussion here [1], which suggests that there ARE ways of getting FMAs, without the sloppiness of fast-math.

[1] https://stackoverflow.com/questions/15933100/how-to-use-fuse...

simonbyrneOP4y ago

a * d - b * c

a_e_k4y ago

C99 [0] and C++11 [1] both have fma() functions that let you directly request it without the need to mess around with sloppier FP contracts to infer it.

[0] https://en.cppreference.com/w/c/numeric/math/fma

[1] https://en.cppreference.com/w/cpp/numeric/math/fma

StefanKarpinski4y ago

The way Julia handles this is worth noting:

- `fma(a, b, c)` is exact but may be slow: it uses intrinsics if available and falls back to a slow software emulation when they're not

- `muladd(a, b, c)` uses the fastest possibly inexact implementation of `a*b + c` available, which is FMA intrinsics if available or just doing separate `*` and `+` operations if they're not

adgjlsfhk14y ago

Note that this is only true in theory. In practice, there are still some bugs here that will hopefully be fixed by julia 1.8

dahart4y ago

> which suggests that there ARE ways of getting FMAs, without the sloppiness of fast-math.

Here’s a paper on what it takes to do FMA emulation: https://www.lri.fr/~melquion/doc/08-tc.pdf

an1sotropy4y ago

3 more replies

simonbyrneOP4y ago

djmips4y ago

kristofferc4y ago

> Free online numerical methods classes are available

How can you use any numerical methods (like error analysis) if you don't have a solid foundation with strict rules to analyze on top on?

SeanLuke4y ago· 10 in thread

adrian_b4y ago

Flushing subnormals to zero produces speed gains only on certain CPU models, while on others it almost does not have any effect.

For example Zen CPUs have negligible penalties for handling denormals, but many Intel models have a penalty between 100 and 200 clock cycles for an operation with denormals.

Any complete computations cannot have a significant percentage of operations with denormals, unless they are written in an extremely bad way.

So for a complete computation, even on the models with bad denormal handling, a speedup of more than a few times would be abnormal.

When the correctness of the results is not important, like in many graphic or machine-learning applications, using flush-to-zero is OK, otherwise it is not.

titzer4y ago

Similar stories happened in the GPU world, and my understanding is that essentially all GPUs are converging on IEEE compliance by default now.

simonbyrneOP4y ago

That's fascinating thread, thanks: https://github.com/WebAssembly/design/issues/148

Someone4y ago

You could also say some companies have been kind enough to make hardware for gamers that doesn’t have costly features they do not need.

titzer4y ago

StefanKarpinski4y ago

Joker_vD4y ago

StefanKarpinski4y ago

simonbyrneOP4y ago

pcwalton4y ago

That wouldn't be practical. C math library performance really matters for numerical-intensive apps like games.

zoomablemind4y ago· 8 in thread

How reliable is it to keep the exreme orders in expectation that the resp. quatities would cancel the orders properly yielding a meaningful value (rounding wise)?

For example, calculating some resulting value function, expressed as

v(x)=f(x)/g(x),

where both f(x) and g(x) are oscillating with a number of roots in a given interval of x.

simonbyrneOP4y ago

However if you do f(x) - g(x), the absolute error is on the order of 2e190: if f(x) - g(x) is small, then now the relative error can be huge (this is known as catastrophic cancellation).

zoomablemind4y ago

simonbyrneOP4y ago

gsteinb884y ago

If you can, working with the logarithms of the intermediate large or small values is one way around the issue

One example talking about this here: http://aosabook.org/en/500L/a-rejection-sampler.html#the-mul...

junon4y ago

Not very reliable. For precision one usually seeks out the MPFR library or something similar.

zoomablemind4y ago

If anyone else wonders, MPFR (Multi-Precision Floating-point calc with correct Rounding). The application domain is Interval Math [1].

[1]:http://cs.utep.edu/interval-comp/applications.html

adgjlsfhk14y ago

ARB https://arblib.org/ is often better than mpfr.

junon4y ago

Thanks, that's a new one.

kzrdude4y ago· 8 in thread

It looks like -fassociative-math is "safe" in the sense that it can not be used to get UB in working code? That's a good property to make it easier to use in the right context.

mbauman4y ago

See the one footnote: you can re-associate a list of 2046 numbers such that they sum to _any_ floating point number between 0 and 2^970.

https://discourse.julialang.org/t/array-ordering-and-naive-s...

StefanKarpinski4y ago

stephencanon4y ago

This is exactly right. Also, _all_ of the summation orderings satisfy the usual backwards-stability bounds, and so all of them are perfectly stable in normal computations.

evilotto4y ago

Floating point math is fun:

  def re_add(a,b,c,d,e):
      return (a+b+c+d+e) == (2 * (c+a+b+d+e))
  
  print(re_add(1e17, -1e17, 3, 2, 1))

simonbyrneOP4y ago

adgjlsfhk14y ago

Yeah. It's safe in that you won't get UB, but it's bad in that you can get arbitrarily wrong answers.

wiml4y ago

simonbyrneOP4y ago

From personal experience, yes: I've seen multiple cases of scientists finding the ultimate cause of their bugs was some fast-math-related optimization.

The problem isn't necessarily the code they wrote themselves: it is often that they've compiled someone else's code or an open source library with fast-math, which broke some internal piece.

pavpanchekha4y ago· 7 in thread

kazinator4y ago

I don't see how Herbie's accuracy improvements could not be undone, if Herbie's output is fed to a back-end which doesn't preserve Herbie's order of operations as Herbie requires and depends on.

Quekid54y ago

That's a real "fun fact" kind of thing. Love it.

simonbyrneOP4y ago

Herbie is a great tool, especially for teaching.

shoo4y ago

thank you for sharing the link to Herbie, that looks like a useful tool.

If I follow at high level, it looks like Herbie is trying to rewrite expressions to minimise error without runtime performance constraints.

Are there alternative tools that focus on rewriting code to maximise performance while keeping error below some configurable bound?

i guess compilers are generally focused on the latter problem, perhaps without giving the user much control over the degree of error they are willing to tolerate.

mattpharr4y ago

> Are there alternative tools that focus on rewriting code to maximise performance while keeping error below some configurable bound?

shoo4y ago

Fantastic! This appears to be available in the command line version of herbie with the `--pareto` flag:

adgjlsfhk14y ago

vchuravy4y ago· 4 in thread

Especially the fact that loading a library compiled with GCC and fast math on, can modify the global state of the program... It's one of the most baffling decisions made in the name of performance.

Also the question how that interacts with IPO/inlining...

mhh__4y ago

D (LDC) lets you control the flags on a per function basis.

physicsguy4y ago

You can do that on a per-compiler basis for e.g. with

#pragma GCC optimize(“fast-math")

mhh__4y ago

Hopefully it still works as an attribute but my point is that you can (say) opt in to allowing more liberal use of FMA without (say) opting in to aggressive NaN assumptions

2 more replies

titzer4y ago

It should always have been a bit in the instruction encoding, never global state.

headPoet4y ago· 3 in thread

-funsafe-math-optimizations always makes me laugh. Of course I want fun and safe math optimisations

nusaru4y ago

Personally I’m a fan of Kotlin’s “fun factory()”

seba_dos14y ago

If you expect them to be safe, you're in for some fun!

krylon4y ago

They're not fun and safe, though, they're "fun-safe", so you don't enjoy yourself (too much) while doing math.

smitop4y ago· 3 in thread

simonbyrneOP4y ago

Do you know what happens when you have ops with different flags? e.g. if you have (a + b) + c, where one + allows reassoc but one doesn't?

diamondlovesyou4y ago

(a+b)+c has two ops in LLVM: addition is a binop, meaning it has two "arguments", thus (a+b) and adding "c" are separate instructions. You can't directly add three or more values.

gmueckl4y ago

I believe the parent wants to know how a sequence of IR instructions with alternating modes is translated to the target.

My guess would be that explicit state state changes show up in the generated machine code.

1 more reply

willis9364y ago· 3 in thread

  sin(single(t)) == bad

  single(sin(t)) == good

adgjlsfhk14y ago

willis9364y ago

I think the real solution I should have adopted is incrementing t like this:

  t = mod(t + ts, 1 / f)

willis9364y ago

Also, thanks. I had not heard of this function before, but apparently it was added to MATLAB in 2018.

optimalsolver4y ago· 2 in thread

"-fno-math-errno" and "-fno-signed-zeros" can be turned on without any problems.

I got a four times speedup on <cmath> functions with no loss in accuracy.

owlbite4y ago

Unless, of course, you have some algorithm that depends on signed zeros. Which is basically the same with all the optimizations the article complains about.

adgjlsfhk14y ago

Is there any algorithm that depends on signed zeros? I'm not aware of any.

dlsa4y ago· 1 in thread

Is this a sensible approach? What are others experiences around this? I've never bothered with this kind of optimisation and I now vaguely feel like I'm missing out.

simonbyrneOP4y ago

jjgreen4y ago· 1 in thread

One trick that I happened upon was speeding up complex multiplication (like a factor of 5) under gcc with the --enable-cx-fortran switch.

bruce3434344y ago

-fcx-fortran-rules

    Complex multiplication and division follow Fortran rules. Range reduction is done as part of complex division, but there is no checking whether the result of a complex multiplication or division is NaN + I*NaN, with an attempt to rescue the situation in that case.

    The default is -fno-cx-fortran-rules.

[1] https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Optimize-Option...

gnufx4y ago

pfdietz4y ago

The link to Kahan Summation was interesting.

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

markhahn4y ago

NaN's should trap, but compilers should not worry about accurate debugging.

j / k navigate · click thread line to collapse