NaN is the most misunderstood feature of IEEE floating point. Most people react to a NaN like they'd react to the dentist telling them they need a root canal. But NaN is actually a very valuable and useful tool!
NaN is just a value that represents an invalid floating point value. The result of any operation on a NaN is a NaN. This means that NaNs propagate from the source of the original NaN to the final printed result.
"This sounds terrible" you might think.
But let's study it a bit. Suppose you are searching an array for a value, and the value is not in the array. What do you return for an index into the array? People often use -1 as the "not found" value. But then what happens when the -1 value is not noticed? It winds up corrupting further attempts to use it. The problem is that integers do not have a NaN value to use for this.
What's the result of sqrt(-1.0)? It's not a number, so it's a NaN. If a NaN appears in your results, you know you've got mistake in your algorithm or initial values. Yes, I know, it can be clumsy to trace it back to its source, but I submit it is better than having a bad result go unrecognized.
NaN has value beyond that. Suppose you have an array of sensors. One of those sensors goes bad (like they always do). What value to you use for the bad sensor? NaN. Then, when the data is crunched, if the result is NaN, you know that your result comes from bad data. Compare with setting the bad input to 0.0. You never know how that affects your results.
This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.
NaN is your friend!
You return (value, found), or (value,error) or Result<T>.
>NaN has value beyond that. Suppose you have an array of sensors. One of those sensors goes bad (like they always do). What value to you use for the bad sensor? NaN. Then, when the data is crunched, if the result is NaN, you know that your result comes from bad data. Compare with setting the bad input to 0.0. You never know how that affects your results.
You return error and handle the error. You want to know sensor is wonky or returns bad data.
Also you technically should use signalling NaN for "this is error" and quiet NaN for "this just impossible math result", which makes it even more error prone. Just return fucking error if it is a function.
Sure, useful for expressions but the handling should be there and then, and if function can have error it should return it explicitly as error, else you have different error handling for different types of functions.
> This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.
I'd like to see how much of the code actually uses that as a feature and not just sets it to 0.0 (or initializes it right away)
And this is great for environments that can support it, but as the levels get lower and lower, such safety nets become prohibitively expensive.
Take data formats, for example. Say we have a small device that records ieee754 binary float32 readings. A simple format might be something like this:
record = reading* terminator;
reading = float(32, ~) | invalid;
invalid = float(32, snan);
terminator = uint(32, 0xffffffff);
We use a signaling NaN to record an error in the sensor reading, and we use the encoding 0xffffffff (which is a quiet NaN) to mark the end of the record.If we wanted the validity signaling to be out-of-band, we'd need to encode it as such; perhaps as a "validity" bit preceding each record:
record = reading* terminator;
reading = valid_bit & float(32, ~);
valid_bit = uint(1, ~);
terminator = uint(1, 1) & uint(32, 0xffffffff);
Now the format is more complicated, and we also have alignment problems due to each record entry being 33 bits. We could use a byte instead and lose to bloat a little: record = reading* terminator;
reading = valid_bit & float(32, ~);
valid_bit = uint(8, ~);
terminator = uint(8, 1) & uint(32, 0xffffffff);
But we're still unaligned (40 bits per record), which will slow down ingestion. We could fix that by using a 32-bit validity "bit": record = reading* terminator;
reading = valid_bit & float(32, ~);
valid_bit = uint(32, ~);
terminator = uint(32, 1) & uint(32, 0xffffffff);
But now we've doubled the size of the data format.Or perhaps we keep it as a separate bit array, padded to a 32-bit boundary to deal with alignment issues:
record = bind(count,count_field) & pad(32, validity{count.value}, padding*) & reading{count.value};
count_field = uint(32,bind(value,~));
reading = float(32, ~);
validity = uint(1, ~);
padding = uint(1, 0);
But now we've lost the ability of ad-hoc appends (we have to precede each record with a length), and the format is becoming a lot more complicated.None in an Optional/Maybe, and Error in a Result, Null in a nullable type, an exception,
Really though, I'm a fan, I just think that we need better means for checking them in legacy languages and we need to entirely do away with "fast math" optimizations.
Which languages do not have a function to test for NaN?
> and the lure of people who want to turn on "fast math" optimizations which do things like assume NaNs aren't possible and then dead-code-eliminate everything that's guarded by an "x != x" test.
This is not unique to NaNs. There are plenty of potential floating point problems if you enable those flags.
Well, allegedly in D at least https://www.reddit.com/r/rust/comments/a1w75c/the_bug_i_did_...
It's not often that I get to correct Mr D himself, but 1.0/0.0 is...
root -1 is i - and we get into complex numbers. If I ever end up with needing to deal with a square root of a number that might be -1 then I'll do it the old fashioned way and test for that and sort it. What is the problem here? root -1 is well defined!
Equally (lol) 1.0/0.0 and 1/0 can be tricky to handle but not beyond whit of person. In those cases it is syntax and and a ... few other things. Is 1.0 === 1.00 etc? Define your rules, work in the space that you have defined and all will be fine.
It is a complete nonsense to require "real" world maths to be correct in your program. Your program is your little world. If it needs to deal with math like concepts then that is down to you how it works. In your programming language you get some methods and functions and they often have familiar names but they do not work like the "real thing" unless you get them to.
numpy and the like exist for a very good reason: A general purpose programming language will cock up very soon.
Do basic and very simple arithmetic in your chosen language if it is well formed, switch to sophisticated addons as soon as is practicable.
An exception would be better. Then you immediately get at the first problem instead of having to track down the lifetime of the observed problem to find the first problem.
It depends on the context. Sometimes NaNs are expected and to be ignored. Sometimes they signal a problem.
This is definitely something I like about D, but I'd much prefer a compiler error. double x; double y = x+1.5; is less than optimal.
> What do you return for an index into the array?
An option/maybe type would solve this much better.
> Yes, I know, it can be clumsy to trace it back to its source
An exception would be much better, alerting you to the exact spot where the problem occurred.
NaN's are already an option type, although implemented in hardware. The checking comes for free.
> An exception would be much better
You can configure the FPU to cause an Invalid Operation Exception, but I personally don't find that attractive.
So technically Python is correct when it decided that 0.0/0.0 should raise an exception instead of just quietly returning NaN. Raising an exception is a standards-conforming option.
https://stackoverflow.com/questions/18118408/what-is-the-dif...
Loud and clear on the other point. I work with ROS a lot and the message definitions do not support NaN or null. So we have these magic numbers all over the place for things like “the robot does not yet know where it is.” (therefore: on the moon.)
That's not true! maxNum(nan, x) = x.
So that entirely depends on how max() is implemented. A naive implementation of max() might just as easily instead return NaN for that. Or if max(NaN, x) is x, then it may give NaN for max(x, NaN).
Note that the fact that comparisons always return false also means that sorting an array of floating point values that contain NaNs can very easily break a comparison-based sorting algorithm!! (I've seen std::sort() in C++, for example, crash because NaNs break the strict weak ordering [1] requirement.)
https://754r.ucbtest.org/background/minNum_maxNum_Removal_De...
With just that understanding, you can understand the reason for most of the examples in this post. You avoid both the extreme of thinking that floating-point numbers are mathematical (exact) real numbers, and the extreme of "superstition" like believing that floating-point numbers are some kind of fuzzy blurry values and that any operation always has some error / is "random", etc. You won't find it surprising why 0.1 + 0.2 ≠ 0.3, but 1.0 + 2.0 will always give 3.0, but 100000000000000000000000.0 + 200000000000000000000000.0 ≠ 300000000000000000000000.0. :-) (Sure this confidence may turn out to be dangerous, but it's better than "superstition".) The second-most valuable thing, if you have 5–10 minutes, may be to go to https://float.exposed/ and play with it for a while.
Anyway, great post as always from Julia Evans. Apart from the technical content, her attitude is really inspiring to me as well, e.g. the contents of the “that’s all for now” section at the end.
The page layout example ("example 7") illustrates the kind of issue because of which Knuth avoided floating-point arithmetic in TeX (except where it doesn't matter) and does everything with scaled integers (fixed-point arithmetic). (It was even worse then before IEEE 754.)
I think things like fixed-point arithmetic, decimal arithmetic, and maybe even exact real arithmetic / interval arithmetic are actually more feasible these days, and it's no longer obvious to me that floating-point should be the default that programming languages guide programmers towards.
The common cause of floating point problems is usually treating them as a mathematical ideal. The quirks appear at the extremes when you try to to un-physical things with them. You can't measure exactly 0 V with a voltmeter, or use an instrument for measuring the distance to stars then add a length obtained from a micrometer without entirely losing the latter's contribution.
You could argue that it's "safe" to distrust floating-point entirely, but I find it more comforting to be able to take at least some things as solid and reason about them, to refine my mental model of when errors can happen and not happen, etc.
Edit: See also the floating point isn’t “bad” or random section that the author just added to the post (https://twitter.com/b0rk/status/1613986022534135809).
Some CPU have the instruction FMA(a,b,c) = ab + c and it is guaranteed to be rounded to the nearest float. You might think that using FMA will lead to more accurate results, which is true most of the time.
However, assume that you want to compute a dot product between 2 orthogonal vectors, say (u,v) and (w,u) where w = -v. You will write:
p = uv + wu
Without FMA, that amounts to two products and an addition between two opposite numbers. This results in p = 0, which is the expected result.
With FMA, the compiler might optimize this code to:
p = FMA(u, v, wu)
That is one FMA and one product. Now the issue is that wu is rounded to the nearest float, say x, which is not exactly -vu. So the result will be the nearest float to uv + x, which is not zero!
So even for a simple formula like this, testing if two vectors are orthogonal would not necessary work by testing if the result is exactly zero. One recommended workaround in this case is to test if the dot product has an absolute value smaller than a small threshold.
Yes! Different people can make different performance-vs-correctness trade-offs, but I also think reproducible-by-default would be better.
Fortunately, specifying a proper standard (e.g. -std=c99 or -std=c++11) implies -ffp-contract=off. I guess specifying such a standard is probably a good idea independently when we care about reproducibility.
Edit: Thinking about it, it the days of 80-bit x87 FPUs, strictly following the standard (specifically, always rounding to 64 bits after every operation) may have been prohibitively expensive. This may explain gcc's GNU mode defaulting to -ffast-math.
A central primitive in 2D computational geometry is the orientation problem; in this case deciding whether a point lies to the left or right of a line. In real arithmetic, the classic way to solve it is to set up the line equation (so the value is zero for points on the line), then evaluate that for the given point and test the sign.
The problem is of course that for points very near the line, roundoff error can give the wrong answer, it is in fact an example of cancellation. The problem has an exact answer, and can be solved with rational numbers, or in a related technique detecting when you're in the danger zone and upping the floating point precision just in those cases. (This technique is the basis of Jonathan Shewchuk's thesis).
However, in work I'm doing, I want to take a different approach. If the y coordinate of the point matches the y coordinate of one of the endpoints of the line, then you can tell orientation exactly by comparing the x coordinates. In other cases, either you're far enough away that you know you won't get the wrong answer due to roundoff, or you can subdivide the line at that y coordinate. Then you get an orientation result that is not necessarily exactly correct wrt the original line, but you can count on it being consistent, which is what you really care about.
So the ironic thing is that if you had a lint that said, "exact floating point equality is dangerous, you should use a within-epsilon test instead," it would break the reasoning outlined above, and you could no longer count on the orientations being consistent.
As I said, though, this is a very special case. Almost always, it is better to use a fuzzy test over exact equality, and I can also list times I've been bitten by that (especially in fastmath conditions, which are hard to avoid when you're doing GPU programming).
Back in university I was taking part in programming competition. I don't remember the exact details of a problem, but it was expected to be solved as a dynamic problem with dp[n][n] as an answer, n < 1000. But, wrangling some numbers around one could show that dp[n][n] = dp[n-1][n-1] + 1/n, and the answer was just the sum of first N elements of harmonic series. Unluckily for us the intended solution had worse precision and our solution failed.
The task was to output answer with `10^-6` precision, which they solution didn't achieve. Funnily enough the number of other teams went the "correct" route and passed (as they were doing additions in same order as original solution).
There is a simple workaround for this:
https://en.wikipedia.org/wiki/Kahan_summation
It's usually only needed when adding billions of values together and the accumulated truncation errors would be at an unacceptable level.
Y += (X-Y) * alpha * dt
When dt is small and alpha is too, the right hand side can be too small to affect the 24bit mantissa of the left.
I prefer a 16/32bit fixed point version that guarantees convergene to any 16bit steady state. This happened in a power conversion system where dt=1/40000 and I needed a filter in the 10's of Hz.
OTOH, it’s easy to implement, so I have a couple of functions to do it easily, and I got quite a lot of use out of them. It’s probably overkill sometimes, but sometimes it’s useful.
I tracked down where it was happening (involving an ==), but it magically stopped when I added print statements or looked at it in the debugger.
It turns out the x86 was running the math at a higher precision and truncating when it moved values out of registers - as soon as it hit memory, things were equal. MacOS was defaulting to -ffloat-store to get consistency (their UI library is float based).
There were too many instances of == in that code base (which IMO is a bad idea with floats), so I just added -ffloat-store to the Linux build and called it a day.
Thankfully the solution to that problem came when x86 (32 bit) mostly disappeared.
Note that this was back in January of 2011, possibly earlier. Back then gcc was still the default compiler on macos, and I think Apple was still 32-bit. (I don't remember when they switched to 64-bit, but I have a 32-bit only game that I bought near the end of that year.)
The situation is quite different these days, I don't think any compiler still uses the old registers by default.
Floating point is amazingly useful! There's a reason why it's implemented in hardware in all modern computers and why every programming language has a built-in type for floats. You should use it! And you should understand that most of its limitations are an inherent mathematical and fundamental limitation, it is logically impossible to do better on most of its limitations:
1. Numerical error is a fact of life, you can only delay it or move it to another part of your computation, but you cannot get rid of it.
2. You cannot avoid working with very small or very large things because your users are going to try, and floating point or not, you'd better have a plan ready.
3. You might not like that floats are in binary, which makes decimal arithmetic look weird. But doing decimal arithmetic does not get rid of numerical error, see point 1 (and binary arithmetic thinks your decimal arithmetic looks weird too).
But sure, don't use floats for ID numbers, that's always a problem. In fact, don't use bigints either, nor any other arithmetic type for something you won't be doing arithmetic on.
I completely agree with you even though I go out of my way to avoid FP, and even though, due to what I usually work on, I can often get away with avoiding FP (often fixed point works -- for me).
IEEE-754 is a marvelous standard. It's a short, easy to understand standard attached to an absolutely mind boggling number of special cases or explanation as to why certain decisions in the simple standard were actually incredibly important (and often really smart and non-obvious). It's the product of some very smart people who had, through their careers, made FP implementations and discovered why various decisions turned out to have been bad ones.
I'm glad it's in hardware, and not just because FP used to be quite slow and different on every machine. I'm glad it's in hardware because chip designers (unlike most software developers) are anal about getting things right, and implementing FP properly is hard -- harder than using it!
You can do exact real arithmetic. But this is only done by people who prove theorems with computers - or by the Android calculator! https://en.wikipedia.org/wiki/Computable_analysis
Other alternatives (also niche) are exact rational arithmetic, computer algebra, arbitrary precision arithmetic.
Fixed point sometimes gets used instead of floats because some operations lose no precision over them, but most operations still do.
Same for arbitrary-precision calculations like big rationals. That just gives you as much precision as your computer can fit in memory. You will still run out of precision, just later rather then sooner.
One half : )
One thing that I suspect trips people a lot is decimal string/literal <-> (binary) float conversions instead of the floating point math itself. This includes the classic 0.1+0.2 thing, and many of the problems in the article.
I think these days using floating point hex strings/literals more would help a lot. There are also decimal floating point numbers that people largely ignore despite being standard for over 15 years
> Floating point is amazingly useful!
Another thing about floats is they are for most parts actually very predictable. In particular all basic operations should produce bit-exact results to last ulp. Also because they are language independent standard, you generally can get same behavior in different languages and platforms. This makes learning floats properly worthwhile because the knowledge is so widely applicable
As long as you are not using a compiler that utilizes x87's extended precision flaots for intermediate calculations, and silently rounding whenever it transfers to memory (That used to be a common issue), and as long as you are not doing dumb stuff with compiler math flags.
Also if you have any code anwhere in your program that relies on correct subnormal handling, then you need to be absolutely sure no code is compiled with `-ffast-math`, including in any dynamically loaded code in your entire program, or your math will break: https://simonbyrne.github.io/notes/fastmath/#flushing_subnor...
And of course if you are doing anything complicated with floating point number, there are entire fields of study about creating numerically stable algorithms, and determining the precision of algorithms with floating point numbers.
> There's a reason why it's implemented in hardware in all modern computers
Yah, legacy.
The reason we used it originally is that computers were small and slow. Now that they're big and fast we could do without it, except that there is already so much hardware and software out there that it will never happen.
I think fixed-point math is underrated though.
Also: The JSON example is nasty. Should IDs then always be strings?
Specs, vs. their implementations, vs. backwards compat. JSON just defines a number type, and neither the grammar nor the spec places limits on it (though the spec does call out exactly this problem). So the JSON is to-spec valid. But implementations have limits as to what they'll decode: JS's is that it decodes to number (a double) by default, and thus, loses precision.
(I feel like this issue is pretty well known, but I suppose it probably bites everyone once.)
JS does have the BigInt type, nowadays. Unfortunately, while the JSON.parse API includes a "reviver" parameter, the way it ends up working means that it can't actually take advantage of BigInt.
> Should IDs then always be strings?
That's a decent-ish solution; as it side-steps the interop issues. String, to me, is not unreasonable for an ID, as you're not going to be doing math on it.
Backward error: the error between the given question, and the question whose right answer is the given answer.
Easier to parse like this: a small forward error means that you give an answer close to the right one.
A small backward error means that the answer you give is the right answer for a nearby question.
The way I approach this[1] is to choose the sign of the ± to be the same as -b to compute the first root x1. Then the second root is c/(a * x1). There are a few other checks in the code for things overflowing, but that basically gives you very accurate answers across the range.
This is explained a bit in a good Math Exchange post[2]. Jim Blinn's "How to solve a quadratic equation" also makes good reading.
Wait til you get to cubics and quartics.
[1]: https://docs.rs/kurbo/latest/src/kurbo/common.rs.html#116-16...
[2]: https://math.stackexchange.com/questions/866331/numerically-...
It's worse than that: You can use the same language, compiler, library, machine, and still get different results if your OS is different.
I forget all the details, but it boils down to how intermediate results are handled. When you compute certain functions, there are several intermediate calculations before it spits out the result. You get more accuracy if you allow those intermediate calculations to happen in a higher precision format (e.g. you're computing in 32 bits, so it will compute the intermediate values in 64 bits). But that is also slower.
OS's make a "default" choice. I think Linux defaults to slower, but more accurate, and BSD defaults to faster, but less accurate.
There may be flags you can set to force one configuration regardless of the OS, but you shouldn't assume your libraries do that.
> In principle you might think that different implementations should work the same way because of the IEEE 754 standard for floating point, but here are a couple of caveats that were mentioned:
> math operations in libc (like sin/log) behave differently in different implementations. So code using glibc could give you different results than code using musl
IEEE 754 doesn't mandate a certain level of accuracy for transcendental functions like sin/log. You shouldn't expect different libraries to give you the same value. If you're doing 64 bit calculations, I would imagine most math libraries will give results accurate enough for 99.99% of math applications, even if only the first 45 bits are correct (and this would be considered "very inaccurate" by FP standards).
[1] https://docs.microsoft.com/en-us/office/troubleshoot/excel/f...
A thousand thanks for not saying "addition is not commutative".
(Addition is commutative in floating point. It merely is not associative).
Can anyone justify this? Do JS developers prefer not having exact integers, or is this something that everyone just kinda deals with?
If you're very careful, a double can be an integer type. (A 53-bit one, I think?) (I don't love this line of thinking. It has a lot of sharp edges. But JS programmers effectively do this all the time, often without thinking too hard about it.)
(And even before BigInt, there's an odd u32-esque "type" in JS; it's not a real type — it doesn't appear in the JS type system, but rather an internal one that certain operations will be converted to internally. That's why (0x100000000 | 0) == 0 ; even though 0x100000000 (and every other number in that expression, and the right answer) is precisely representable as a f64. This doesn't matter for JSON decoding, though, … and most other things.)
Also there are typed arrays and bigints if we can throw those in, too.
What do you mean? Floating-point arithmetic is, by design, exact for small integers. The result of adding 2.0 to 3.0 is exactly 5.0. This is one of the few cases where it is perfectly legitimate to compare floats for equality.
In fact, using 64-bit doubles to represent ints you get way more ints than using plain 32-bit ints. Thus, choosing doubles to represent integers makes perfect sense (unless you worry about wasting a bit of memory and performance).
Does JS at least write ".0" at the end when converting a number to a string? Or switch to scientific notation for large numbers?
Edit: x87 has FPREM1 which can calculate a remainder (accurately one hopes), but I can’t find an equivalent in modern SSE or AVX. So I guess you are at the mercy of your language’s library and/or compiler? Is this a library/language bug rather than a Floating Point gotcha?
The division of these numbers is 2.9999998957049091386350361962468173875300478102103478639802753918, but the nearest float to that is 3. (Exactly 3.) [2]
The modulo operation can (presumably) determine that 3X > Y, so the modulo is Y - 2X, as normal.
This gives inconsistent results, if you don't know that every float is actually a range, and "3" as a float includes some numbers that are smaller than 3.
[1] https://www.wolframalpha.com/input?i=13.715999603271484375+%... [2] https://www.wolframalpha.com/input?i=2.999999895704909138635..., then https://float.exposed/0x40400000
• "13.716" means 13.7159999999999993037 (https://float.exposed/0x402b6e978d4fdf3b)
• "4.572" means 4.57200000000000006395 (https://float.exposed/0x401249ba5e353f7d)
• "13.716 / 4.572" means the nearest representable value to 13.7159999999999993037 / 4.57200000000000006395 which (https://www.wolframalpha.com/input?i=13.7159999999999993037+...) is 3.0 (https://float.exposed/0x4008000000000000)
• "13.716 % 4.572" means the nearest representable value to 13.7159999999999993037 % 4.57200000000000006395 namely to 4.5719999999999991758 (https://www.wolframalpha.com/input?i=13.7159999999999993037+...), which is 4.57199999999999917577 (https://float.exposed/0x401249ba5e353f7c) printed as 4.571999999999999.
————————
Edit: For a useful analogy (answering the GP), imagine you're working in decimal fixed-point arithmetic with two decimal digits (like dollars and cents), and someone asks you for 10.01/3.34 and 10.01%3.34. Well,
• 10.01 / 3.34 is well over 2.99 (it's over 2.997 in fact) so you'd be justified in answering 3.00 (the nearest representable value).
• 10.01 % 3.34 is 3.33 (which you can represent exactly), so you'd answer 3.33 to that one.
(For an even bigger difference: try 19.99 and 6.67 to get 3.00 as quotient, but 6.65 as remainder.)
It is a problem that appears whenever you compose an inexact function, like the conversion from decimal to binary, with a function that is not continuous, like the remainder a.k.a. modulo function.
In decimal, 13.716 is exactly 3 times 4.572, so any kind of remainder must be null, but after conversion from decimal to binary that relationship is no longer true, and because the remainder is not a continuous function its value may be wildly different from the correct value.
When you compute with approximate numbers, like the floating-point numbers, as long as you compose only continuous functions, the error in the final result remains bounded and smaller errors in inputs lead to a diminished error in the output.
However, it is enough to insert one discontinuous function in the computation chain for losing any guarantee about the magnitude of the error in the final result.
The conclusion is that whenever computing with approximate numbers (which may also use other representations, not only floating-point) you have to be exceedingly cautious when using any function that is not continuous.
package main
import "fmt"
func main() {
var i, iterations, meters float64
for iterations = 100_000_000; i < iterations; i++ {
meters += 0.01
}
// Expected: 1000.000000 km
fmt.Printf("Expected: %f km\n", 0.01 * iterations / 1000)
// Got: 1000.000001 km
fmt.Printf("Got: %f km \n", meters / 1000)
}Soon enough I had two consecutive releases in hand, which produced different results, and which had identical numerical code. The only code I had changed that ran during the numerical calculations was code that ran between iterations of the numerical parts of the code. IIRC, it printed out some status information like how long it had been running, how many calculations it had done, the percent completed, and the predicted time remaining.
How could that be affecting the numerical calculations??? My first thought was a memory bug (the code was in C-flavored C++, with manual memory management) but I got nowhere looking for one. Unfortunately, I don't remember the process by which I figured out the answer, but at some point I wondered what instructions were used to do the floating-point calculations. The Makefile didn't specify any architecture at all, and for that compiler, on that architecture, that meant using x87 floating-point instructions.
The x87 instruction set was originally created for floating point coprocessors that were designed to work in tandem with Intel CPUs. The 8087 coprocessor worked with the 8086, the 287 with the 286, the 387 with the 386. Starting with the 486 generation, the implementation was moved into the CPU.
Crucially, the x87 instruction set includes a stack of eight 80-bit registers. Your C code may specify 64-bit floating point numbers, but since the compiled code has to copy those value into the x87 registers to execute floating-point instructions, the calculations are done with 80-bit precision. Then the values are copied back into 64-bit registers. If you are doing multiple calculations, a smart compiler will keep intermediate values in the 80-bit registers, saving cycles and gaining a little bit of precision as a bonus.
Of course, the number of registers is limited, so intermediate values may need to be copied to a 64-bit register temporarily to make room for another calculation to happen, rounding them in the process. And that's how code interleaved with numerical calculations can affected the results even if it semantically doesn't change any of the values. Calculating percent completed, printing a progress bar -- the compiler may need to move values out of the 80-bit registers to make room for these calculations, and when the code changes (like you decide to also print out an estimated time remaining) the compiler might change which intermediate values are bumped out of the 80-bit registers and rounded to 64 bits.
It was silly that we were executing these ancient instructions in 2004 on Opteron workstations, which supported SSE2, so I added a compiler flag to enable SSE2 instructions, and voila, the numerical results matched exactly from build to build. We also got a considerable speedup. I later found out that there's a bit you can flip to force x87 arithmetic to always round results to 64 bits, probably to solve exactly the problem I encountered, but I never circled back to try it.
Floating point itself is an implementation on top of that.
With a slide rule it's basically truncated integers all the way, you're very limited on significant figures, you float the point in your head or some other way, and apply it afterward. You're constantly aware of any step where your significant figures drop below 3 because that's such a drastic drop from 3 down to 2. Or down to 1 which can really slap you in the face.
The number of decimal places also is an integer naturally which is related to scientific notation. Whether a result is positite or negative is a simple flag.
On a proper computer, integer addition, subtraction, and multiplication are exact. It's the division which produces integers which are not exact, since they can usually be truncated results, as they are written to memory at the bitness in use at the time, storing the whole-number portion of the quotient only.
Integer arithmetic can usually be made as exact as you want and it helps to consciously minimize divisions, and if necessary offset/scale the operands beforehand such that the full range of quotients is never close enough to zero for the truncation to affect your immediate result within the number of significant figures necessary for your ultimate result to be completely reliable.
The number of significant figures available on even an 8-bit computer just blows away the slide rule but not if you don't take full advantage of it.
What sometimes happens is that zero-crossing functions, when the quotients are not offset, will fluctuate between naturally taking advantage of the enhanced bitness of the hardware for large values (blowing away the slide rule a huge amount of the time), while "periodically" dropping below the accuracy of a slide rule when some quotient, especially an intermediate one, is too near zero. Floating point or integer.
If there's nobody keeping track of not just the magnitude of the possible values, but also the significant figures being carried at each point, your equations might not come out as good as a slide rule sometimes.
Edit: IOW when the final reportable result is actually a floating point number where you need to have an accurate idea how many figures to the right of the decimal place are truly valid, it might be possible to use all integer calculations to your advantage from the beginning, and confidently place the decimal point as a final act.
I thought the JSON spec was that a number is of any precision and that it’s up to the parser to say, “best I can do is 754.”
That is, I think you can deserialize very long decimal numbers to higher accuracy in some languages.
They learn by experience that they also slowly accumulate errors and end up with a transformation matrix A that's no longer orthogonal and will skew the image.
Or people try to solve large lineair systems of floating points with a naive Gaussian elimination approx and end up with noise. Same with naive iterative eigen vector calculations.
I wonder if people sometimes make a one element integer array this way so they can have a integer to work with.
The PATRIOT missile error (it wasn't a disaster) was more due to the handling of timestamps than just floating point deviation. There were several concurrent failures that allowed the SCUD to hit it's target. IIRC the clock drift was significant and was magnified by being converted to a floating point and, importantly, truncated into a 24 bit register. Moreover, they weren't "slightly off". The clock drift alone put the missile considerably off target.
While I don't claim that floating points didn't have a hand in this error it's likely the correct handling of timestamps would not have introduced the problem in the first place. Unlike the other examples given this one is a better example of knowing your system and problem domain rather than simply forgetting to calculate a delta or being unaware of the limitations of IEEE 754. "Good enough for government work" struck again here.
> 1. it has a funny name
Reasoning accepted!