I don't see how Herbie's accuracy improvements could not be undone, if Herbie's output is fed to a back-end which doesn't preserve Herbie's order of operations as Herbie requires and depends on.
Of course, the 'real' solution is actual Numerical Analysis (as I'm sure you know) to keep the error properly bounded, but it's really interesting to have a sort of middle ground which just stabilizes the math locally... which might be good enough.
Other fun fact: Numerical Analysis is a thing that's really hard to imagine unless you happen to be introduced to it during an education. It's so obviously a thing once you've heard of it, but REALLY hard to come up with ex nihilo.
If I follow at high level, it looks like Herbie is trying to rewrite expressions to minimise error without runtime performance constraints.
Are there alternative tools that focus on rewriting code to maximise performance while keeping error below some configurable bound?
i guess compilers are generally focused on the latter problem, perhaps without giving the user much control over the degree of error they are willing to tolerate.
There are! See followup work by @pavpanchekha and others on "Pherbie", which finds a set of Pareto-optimal rewritings of a program so that it's possible to trade-off error and performance: https://ztatlock.net/pubs/2021-arith-pherbie/paper.pdf.
For example Zen CPUs have negligible penalties for handling denormals, but many Intel models have a penalty between 100 and 200 clock cycles for an operation with denormals.
Even on the CPU models with slow denormal processing, a speedup between 100 and 1000 exists only for the operation with denormals itself and only when the operation belonged to a stream of operations working at the maximum CPU SIMD speed, when during the one hundred and something lost clock cycles the CPU could have done 4 or 8 operations during every clock cycle.
Any complete computations cannot have a significant percentage of operations with denormals, unless they are written in an extremely bad way.
So for a complete computation, even on the models with bad denormal handling, a speedup of more than a few times would be abnormal.
The only controversy that has ever existed about denormals is that handling them at full speed increases the cost of the FPU, so lazy or greedy companies, i.e. mainly Intel, have preferred to add the flush-to-zero option for gamers, instead of designing the FPU in the right way.
When the correctness of the results is not important, like in many graphic or machine-learning applications, using flush-to-zero is OK, otherwise it is not.
Outside of a vanishingly few edge cases, I think the subnormal debate is basically over, except, apparently, inside of Intel. Every single other architecture and microarchitecture manages to handle subnormals with relative ease, with only a handful of clock cycle penalty. I think Intel hardware should be called out, not programmers who just want the 35 year old floating point standard to be fast like it is on other chips.
Similar stories happened in the GPU world, and my understanding is that essentially all GPUs are converging on IEEE compliance by default now.
You could also say some companies have been kind enough to make hardware for gamers that doesn’t have costly features they do not need.
[1] https://ocw.mit.edu/courses/mathematics/18-335j-introduction...
That's why finer-grained flags are needed — yes, FMAs and SIMD are essential for _both_ performance and improved accuracy, but `-ffast-math` bundles so many disparate things together it's impossible to understand what your code does.
> And most math will work fine with fast-math, if you are careful how you write it.
The most hair-pulling part about `-ffast-math` is that it will actively _disable_ your "careful code." You can't check for nans. You can't check for residuals. It'll rearrange those things on your behalf because it's faster that way.
I'm not an expert on this, but for my own code I've been meaning to better understand the discussion here [1], which suggests that there ARE ways of getting FMAs, without the sloppiness of fast-math.
[1] https://stackoverflow.com/questions/15933100/how-to-use-fuse...
a * d - b * c
If a == b and c == d (and all are finite), then this should give 0 (which is true for strict IEEE 754 math), but if you replace it with an fma then you can get either a positive or negative value, depending on the order in which it was contracted. Issues like this pop up in complex multiplication, or applying the quadratic formula.
- `fma(a, b, c)` is exact but may be slow: it uses intrinsics if available and falls back to a slow software emulation when they're not
- `muladd(a, b, c)` uses the fastest possibly inexact implementation of `a*b + c` available, which is FMA intrinsics if available or just doing separate `*` and `+` operations if they're not
That gives the user control over what they need—precision or speed. If you're writing code that needs the extra precision, use the `fma` function but if you just want to compute `a*b + c` as fast as possible, with or without extra precision, then use `muladd`.
There are ways, indeed, but they are pretty slow, it’s prioritizing accuracy over performance. And they’re still pretty tricky too. The most practical alternative for float FMA might be to use doubles, and for double precision FMA might be to bump to a 128 bit representation.
Here’s a paper on what it takes to do FMA emulation: https://www.lri.fr/~melquion/doc/08-tc.pdf
How can you use any numerical methods (like error analysis) if you don't have a solid foundation with strict rules to analyze on top on?
I would really like for someone to take fast math seriously, and to provide well scoped and granular options to programmers. The Julia `@fastmath` macro gets close, but it is two broad. I want to control the flags individually.
Also the question how that interacts with IPO/inlining...
So one can (and we do at work) have @optmath which is a specific set of flags (just a value we defined at compile time, the compiler recognizes it as a UDA) we want as opposed to letting the compiler bulldoze everything.
#pragma GCC optimize(“fast-math")
Is this a sensible approach? What are others experiences around this? I've never bothered with this kind of optimisation and I now vaguely feel like I'm missing out.
I tend to use calculations for deterministic purposes rather than pure accuracy. 1+1=2.1 where the answer is stable and approximate is still better and more useful than 1+1=2.0 but where the answer is unstable. Eg because one of those is 0.9999999 and the precision triggers some edge case.
As always, the devil is in the details: you typically can't check exact equality, as e.g. reassociating arithmetic can give slightly different (but not necessarily worse) results. So the challenge is coming up with appropriate measure of determining whether something is wrong.
sin(single(t)) == bad
single(sin(t)) == good t = mod(t + ts, 1 / f)
Since I'm just sending a static frequency the range of time never needs to be beyond one period. However, using a double here is far from the critical path in increasing performance and it all runs fast enough anyway.How reliable is it to keep the exreme orders in expectation that the resp. quatities would cancel the orders properly yielding a meaningful value (rounding wise)?
For example, calculating some resulting value function, expressed as
v(x)=f(x)/g(x),
where both f(x) and g(x) are oscillating with a number of roots in a given interval of x.
However if you do f(x) - g(x), the absolute error is on the order of 2e190: if f(x) - g(x) is small, then now the relative error can be huge (this is known as catastrophic cancellation).
One example talking about this here: http://aosabook.org/en/500L/a-rejection-sampler.html#the-mul...
https://discourse.julialang.org/t/array-ordering-and-naive-s...
def re_add(a,b,c,d,e):
return (a+b+c+d+e) == (2 * (c+a+b+d+e))
print(re_add(1e17, -1e17, 3, 2, 1))In my experience much scientific Fortran code, at least, is OK with something like -ffast-math, at least because it's likely to have been used with ifort at some stage, and even with non-754-compliant hardware if it's old enough. Obviously you should check, though, and perhaps confine such optimizations to where they're needed.
BLIS turned on -funsafe-math-optimizations (if I recall correctly) to provide extra vectorization, and still passed its extensive test suite. (The GEMM implementation is possibly the ultimate loop nest restructuring.)
I got a four times speedup on <cmath> functions with no loss in accuracy.
I'd suggest -ffp-contract=fast is a good idea for 99% of code. It's only going to break things where very specific effort has gone in to the numerical analysis, and likely the authors of such things are sufficiently fp-savy to tell you not to do the thing.
Complex multiplication and division follow Fortran rules. Range reduction is done as part of complex division, but there is no checking whether the result of a complex multiplication or division is NaN + I*NaN, with an attempt to rescue the situation in that case.
The default is -fno-cx-fortran-rules.
[1] https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Optimize-Option...