I work on 3D/4D math in F#. As part of the testing strategy for algorithms, I've set up a custom agent with an F# script that instruments Roslyn to find FP and FP-in-loop hotspots across the codebase.
The agent then reasons through the implementation and writes core expressions into an FPCore file next to the existing tests, running several passes, refining the pres based on realistic caller input. This logs Herbie's proposed improvements as output FPCore transformations. The agent then reasons through solutions (which is required, Herbie doesn't know algorithm design intent, see e.g. this for a good case study: https://pavpanchekha.com/blog/herbie-rust.html), and once convinced of a gap, creates additional unit tests and property tests (FsCheck/QuickCheck) to prove impact. Then every once in a while I review a batch to see what's next.
Generally there are multiple types of issues that can be flagged:
a) Expression-level imprecision over realistic input ranges: this is Herbie's core strength. Usually this catches "just copied the textbook formula" instance of naive math. Cancellation, Inf/NaN propagation, etc. The fixes are consistently using fma for accumulation, biggest-factor scaling to prevent Inf, hypot use, etc.
b) Ill-conditioned algorithms. Sometimes the text books lie to you, and the algorithms themselves are unfit for purpose, especially in boundary regions. If there are multiple expressions that have a <60% precision and only a 1 to 2% improvement across seeds, it's a good sign the algo is bad - there's no form that adequately performs on target inputs.
c) Round-off, accumulation errors. This is more a consequence of agent reasoning, but often happens after an apparent "100% -> 100%" pass. The agent is able to, via failing tests, identify parts of an algorithm that can benefit from upgrading the context to e.g. double-word arithmetic for additional precision.
I don't know if there is an equivalent in Roslyn, but in Julia you can have the agent inspect the LLVM output to surface problems in hot loops.
Also nice to see an article thats not about AI or politics
For x in [−1.79e308, 1.79e308]:
Initial Program: 100.0% accurate, 1.0× speedup
def code(x):
return math.sqrt((x + 1.0))
Alternative 1: 67.5% accurate, 5.6× speedup def code(x):
return 1.0This is one of the problems that alternative formats such as the Posit aim to solve. It's quite interesting: I've got an implementation in rust here if you want to play with it https://github.com/andrepd/posit-rust
IEEE floats have a few warts like any other 1980s standard, but they're a fantastic design.
We usually recommend looking for 90%+ accuracy or carefully examining the accuracy plot
(* Mathematica Notation, Assume x>0 ) If[ x < 10^(-10), 1+ x/2, ( order x^2 error ) If[ x> 10^10, Sqrt[x], ( order 1/Sqrt[x] error ) (else) Sqrt[x +1] ] ] ( I guess the If statements take too much time. *)
Like looking at this example,
https://herbie.uwplse.org/demo/b070b371a661191752fe37ce0321c...
It is claimed that for the function f(x) =sqrt(x+1) -1
Accuracy is increased by from 8.5% accuracy to 98% for alternative 5 Which has f(x) = 0.5x
Ok so x=99, the right answer is sqrt(100) -1 = 9
But 0.5 * 99 = 49.5 which doesn't seem too accurate to me.
Maybe the thing to optimize the expression for is the minimize the maximum error, and not the average error. I think that's what I would care about
What would be cool is if you could some how have this kind of analysis done automatically for your whole program where it finds the needle in the haystack expression that can be improved, assuming you gave expected ranges for your variables
Also, I’m not sure I understand the speedup. Is it latency or throughput?
What’s uwplse mean? I mixed up the letters and misread it as ulp-wise which works for the project, haha.
I'm not at UW any more, I'm now at Utah, but some of the Herbie team is at UW and they provide the infrastructure
In the limit, an alternative with 10x better accuracy when x>10^150 and 10x worse in 1<x<10^150 would rank higher :) but more generally, not all inputs are equally important.
Furthermore, floats have underflow to 0 and overflow to infinity, which screw all this up because it can lead to infinite relative error.
Because of this you have some of the funny cases reported elsewhere in this thread :p
I'm not sure what would be a better approach though. Weigh the scores with a normal distribution around 0? Around 1? Exponents around 0?
It's true that averages can be misleading but we encourage users to think about it instead as a percentage of inputs. In practice the error distribution is very bimodal, the two modes being "basically fine" (a few ulps of error) and "garbage" (usually 0 instead of some actual value)