That applies to x86 for sure. But the designers of ARM and RISC-V had the foresight to standardise the implementation of rsqrt to make it deterministic ... on each respective platform. But on either, the precision is only 7 bits.
Recent AMD and Intel x86-64 processors use 11 input bits, and the results are similar enough that only 22 results over the whole input range are different. Source: <https://robert.ocallahan.org/2021/09/emulating-amd-rsqrtss-e...>
Or you can just take a square root and divide; these are (partially-)pipelined operations on modern CPUs, with _much_ shorter latency than they had when "fast inverse square root" was a thing.
It still has a niche, but that niche is very, very narrow today.
Short of functions like exp and the trig functions, DIY approximations are usually not worth it these days. However, FP division getting faster has made these functions faster, too, since they can now use rational approximations rather than pure polynomials.
In this case, the Q_rsqrt actually seems to provide a 2-4x speedup compared to the reproducible naive method.
Another, admittedly niche, scenario is that numerics code can use lower-precision types to emulate higher-precision types (e.g., represent a number with a pair of doubles).
Anyway, as with other BOINC projects, they sent each work unit (simulation run) to at least two (or was it three?) different computers and compared results, to ensure correctness. And they found that they got quite a lot of work units which disagreed and had to be sent to more computers for validation.
After some digging and eliminating factors like overclocked CPUs, they found that usually, all Intel machines would agree and all AMDs would agree, but Intel and AMD would disagree. Like, a run that would hit the detector wall after 30 revolutions on Intels could go on for many thousands of revolutions on AMDs.
Further digging led to the discrepancy in lower bits of transcendental operations in the FPUs[2]. After switching to a software library for these operations, at the cost of a few % in speed, they got Intels and AMDs to agree.
So yeah, when you do a large number of iterated operations like this, even a single LSB of difference can lead to issues.
As an aside, the LHC@Home was initially run almost like a hobby project by a few researchers connected to the LHC, without much official support. However the data the project produced was AFAIK highly beneficial to the machine commissioning, and it later became a more official part of the High Luminocity upgrade.
[1]: https://cds.cern.ch/record/1463291/files/CERN-ATS-2012-159.p...
Classic examples include most under-constrained randomized algorithms, like training a neural network. Rejection sampling is required to accurately produce some sorts of randomness, and that yields bifurcations in the initialization if you don't have LSB determinism. The complicated loss landscape then virtually guarantees you'll converge to a different minima. Even with a deterministic seed, algorithms guaranteed to converge, a principled way to ensure that concurrent computations yield bitwise identical results no matter the execution order, and most of your operations being bitwise identical, a few stray LSB issues in your inverse square roots or transcendentals will still nearly ensure that the final result isn't even approximately the same.
As to why that latter thing matters, it varies, but at a minimum reproducibility makes lots of federated processes cheaper (and not just federated in the "folding at home" sense, but generally when some people are performing computations and other people are making actions based on them -- being able to explain credit scores or parole denials or whatever, or validating that several people you trust yield the same compiled binary). Bitwise reproducibility would be better, but even being approximately right would probably be good enough and isn't tenable without bitwise identical building blocks.
OK but, doesn't intel internally use 80 bits of precision for float64 computations? If that's the case, you can't even guarantee that float64 multiplication and addition would behave the same on say x86 vs arm.
Also,
-funsafe-math-optimizations
Fun, safe math optimizations should be turned on by default! ;)
I would not write a better explanation than Daniel Lemire on his blog:
https://lemire.me/blog/2023/04/06/are-your-memory-bound-benc...
I would not write a better explanation than Emery Berger speaks: https://youtu.be/r-TLSBdHe1A?t=10m41s Key: “Layout changes can shift performance by +/-40%”
Not sure it matters so much on small in-cache programs, but worth thinking about when profiling larger programs.
> It is not possible, in a normal distribution, to be multiple times the standard deviation away from the mean. However, it happens much more easily with a log normal.
Should that read it is not _impossible_ otherwise it is quite wrong. I mean it's a gaussian probability distribution you sure can be several standard deviations away from the mean, admittedly at smaller and smaller probability
https://twitter.com/danieldekok/status/1484898130441166853?s...
It's quite a bit cheaper, because it doesn't need expensive elementary functions like exp or erf.
(We did add it to Thinc: https://thinc.ai/docs/api-layers#dish)
#include <math.h>
double mysqrt(double d) { return sqrt(d); }
with and without -fno-math-errno: https://godbolt.org/z/bvrz9r8ce
One can see that with -fno-math-errno the function can be entirely inlined. But if errno is enabled, it has to first check whether the input is negative, and in that case call the libc sqrt() function which sets errno.
As for why it's not the default, I guess it's historical. The errno approach was common back in the days before IEEE 754 with its exception model provided another way.
E.g. for glibc: https://man7.org/linux/man-pages/man7/math_error.7.html
Musl libc, being newer, does away with that and never sets errno in libm functions: https://wiki.musl-libc.org/mathematical-library.html
float my_rsqrt( float number )
{
float x2;
union {
float y;
long i;
} u;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
u.y = number;
u.i = 0x5f3759df - ( u.i >> 1 ); // what the fuck?
u.y = u.y * ( threehalfs - ( x2 * u.y * u.y ) ); // 1st iteration
return u.y;
}
Were unions not supported by the compilers back then?I believe the benchmark program outputs the wrong units? It should be picoseconds (ps) instead of femtoseconds (fs)?