Examples of floating point problems (opens in new tab)

(jvns.ca)

255 pointsgrappler3y ago169 comments

169 comments

106 comments · 29 top-level

WalterBright3y ago· 28 in thread

> NaN/infinity values can propagate and cause chaos

NaN is the most misunderstood feature of IEEE floating point. Most people react to a NaN like they'd react to the dentist telling them they need a root canal. But NaN is actually a very valuable and useful tool!

NaN is just a value that represents an invalid floating point value. The result of any operation on a NaN is a NaN. This means that NaNs propagate from the source of the original NaN to the final printed result.

"This sounds terrible" you might think.

But let's study it a bit. Suppose you are searching an array for a value, and the value is not in the array. What do you return for an index into the array? People often use -1 as the "not found" value. But then what happens when the -1 value is not noticed? It winds up corrupting further attempts to use it. The problem is that integers do not have a NaN value to use for this.

What's the result of sqrt(-1.0)? It's not a number, so it's a NaN. If a NaN appears in your results, you know you've got mistake in your algorithm or initial values. Yes, I know, it can be clumsy to trace it back to its source, but I submit it is better than having a bad result go unrecognized.

NaN has value beyond that. Suppose you have an array of sensors. One of those sensors goes bad (like they always do). What value to you use for the bad sensor? NaN. Then, when the data is crunched, if the result is NaN, you know that your result comes from bad data. Compare with setting the bad input to 0.0. You never know how that affects your results.

This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.

NaN is your friend!

ilyt3y ago

> But let's study it a bit. Suppose you are searching an array for a value, and the value is not in the array. What do you return for an index into the array? People often use -1 as the "not found" value. But then what happens when the -1 value is not noticed? It winds up corrupting further attempts to use it. The problem is that integers do not have a NaN value to use for this.

You return (value, found), or (value,error) or Result<T>.

>NaN has value beyond that. Suppose you have an array of sensors. One of those sensors goes bad (like they always do). What value to you use for the bad sensor? NaN. Then, when the data is crunched, if the result is NaN, you know that your result comes from bad data. Compare with setting the bad input to 0.0. You never know how that affects your results.

You return error and handle the error. You want to know sensor is wonky or returns bad data.

Also you technically should use signalling NaN for "this is error" and quiet NaN for "this just impossible math result", which makes it even more error prone. Just return fucking error if it is a function.

Sure, useful for expressions but the handling should be there and then, and if function can have error it should return it explicitly as error, else you have different error handling for different types of functions.

> This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.

I'd like to see how much of the code actually uses that as a feature and not just sets it to 0.0 (or initializes it right away)

kstenerud3y ago

> You return (value, found), or (value,error) or Result<T>.

And this is great for environments that can support it, but as the levels get lower and lower, such safety nets become prohibitively expensive.

Take data formats, for example. Say we have a small device that records ieee754 binary float32 readings. A simple format might be something like this:

    record     = reading* terminator;
    reading    = float(32, ~) | invalid;
    invalid    = float(32, snan);
    terminator = uint(32, 0xffffffff);

We use a signaling NaN to record an error in the sensor reading, and we use the encoding 0xffffffff (which is a quiet NaN) to mark the end of the record.

If we wanted the validity signaling to be out-of-band, we'd need to encode it as such; perhaps as a "validity" bit preceding each record:

    record     = reading* terminator;
    reading    = valid_bit & float(32, ~);
    valid_bit  = uint(1, ~);
    terminator = uint(1, 1) & uint(32, 0xffffffff);

Now the format is more complicated, and we also have alignment problems due to each record entry being 33 bits. We could use a byte instead and lose to bloat a little:

    record     = reading* terminator;
    reading    = valid_bit & float(32, ~);
    valid_bit  = uint(8, ~);
    terminator = uint(8, 1) & uint(32, 0xffffffff);

But we're still unaligned (40 bits per record), which will slow down ingestion. We could fix that by using a 32-bit validity "bit":

    record     = reading* terminator;
    reading    = valid_bit & float(32, ~);
    valid_bit  = uint(32, ~);
    terminator = uint(32, 1) & uint(32, 0xffffffff);

But now we've doubled the size of the data format.

Or perhaps we keep it as a separate bit array, padded to a 32-bit boundary to deal with alignment issues:

    record      = bind(count,count_field) & pad(32, validity{count.value}, padding*) & reading{count.value};
    count_field = uint(32,bind(value,~));
    reading     = float(32, ~);
    validity    = uint(1, ~);
    padding     = uint(1, 0);

But now we've lost the ability of ad-hoc appends (we have to precede each record with a length), and the format is becoming a lot more complicated.

1 more reply

googlryas3y ago

Float-with-NaN is essentially (float-without-NaN, boolean error), in an efficient, hardware-optimized structure.

dragonwriter3y ago

> What do you return for an index into the array?

None in an Optional/Maybe, and Error in a Result, Null in a nullable type, an exception,

the__alchemist3y ago

I concur. I would use an `Option`, `Result`, or enum for all of those scenarios, depending on the details.

1 more reply

maximilianburke3y ago

I think the concept of NaNs are sound, but I think relying on them is fraught with peril, made so by the unobvious test for NaN-ness in many languages (ie, "if (x != x)"), and the lure of people who want to turn on "fast math" optimizations which do things like assume NaNs aren't possible and then dead-code-eliminate everything that's guarded by an "x != x" test.

Really though, I'm a fan, I just think that we need better means for checking them in legacy languages and we need to entirely do away with "fast math" optimizations.

WalterBright3y ago

I call them "buggy math" optimizations. The dmd D compiler does not have a switch to enable buggy math.

BeetleB3y ago

> made so by the unobvious test for NaN-ness in many languages (ie, "if (x != x)")

Which languages do not have a function to test for NaN?

> and the lure of people who want to turn on "fast math" optimizations which do things like assume NaNs aren't possible and then dead-code-eliminate everything that's guarded by an "x != x" test.

This is not unique to NaNs. There are plenty of potential floating point problems if you enable those flags.

1 more reply

abnry3y ago

NaNs are incredibly helpful in numpy. Often I want to take a mean over a collection of lists but the size of each list isn't constant. Np.nanmean works great.

avgcorrection3y ago

> I think the concept of NaNs are sound, but I think relying on them is fraught with peril

Well, allegedly in D at least https://www.reddit.com/r/rust/comments/a1w75c/the_bug_i_did_...

jordigh3y ago

> What's the result of 1.0/0.0? It's not a number, so it's a NaN

It's not often that I get to correct Mr D himself, but 1.0/0.0 is...

WalterBright3y ago

You're right. I'll fix it.

maxfurman3y ago

What is 1.0/0.0 in D? A DivideByZero exception of some sort?

1 more reply

gerdesj3y ago

I often find myself wondering wtf? when I see discussions like this.

root -1 is i - and we get into complex numbers. If I ever end up with needing to deal with a square root of a number that might be -1 then I'll do it the old fashioned way and test for that and sort it. What is the problem here? root -1 is well defined!

Equally (lol) 1.0/0.0 and 1/0 can be tricky to handle but not beyond whit of person. In those cases it is syntax and and a ... few other things. Is 1.0 === 1.00 etc? Define your rules, work in the space that you have defined and all will be fine.

It is a complete nonsense to require "real" world maths to be correct in your program. Your program is your little world. If it needs to deal with math like concepts then that is down to you how it works. In your programming language you get some methods and functions and they often have familiar names but they do not work like the "real thing" unless you get them to.

numpy and the like exist for a very good reason: A general purpose programming language will cock up very soon.

Do basic and very simple arithmetic in your chosen language if it is well formed, switch to sophisticated addons as soon as is practicable.

1 more reply

inetknght3y ago

> This means that NaNs propagate from the source of the original NaN to the final printed result.

An exception would be better. Then you immediately get at the first problem instead of having to track down the lifetime of the observed problem to find the first problem.

thfuran3y ago

Yeah, or using a type system that lets you avoid making NaN a float to both force the case to be handled and prevent NaN from leaking into the math.

BeetleB3y ago

> An exception would be better.

It depends on the context. Sometimes NaNs are expected and to be ignored. Sometimes they signal a problem.

insulanus3y ago

Definitely. Unfortunately, Language implementations that guaranteed exceptions were not in wide use at the time. Also, to have a chance at being implemented on more than one CPU, it had to work in C and assembly.

jcranmer3y ago

You can get an exception! Just enable FP exceptions for the invalid exception, and compile all of your code such that it's FP-exception aware, and you can get the exception.

bachmeier3y ago

> This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.

This is definitely something I like about D, but I'd much prefer a compiler error. double x; double y = x+1.5; is less than optimal.

pwpwp3y ago

I don't find this convincing.

> What do you return for an index into the array?

An option/maybe type would solve this much better.

> Yes, I know, it can be clumsy to trace it back to its source

An exception would be much better, alerting you to the exact spot where the problem occurred.

WalterBright3y ago

> An option/maybe type would solve this much better.

NaN's are already an option type, although implemented in hardware. The checking comes for free.

> An exception would be much better

You can configure the FPU to cause an Invalid Operation Exception, but I personally don't find that attractive.

4 more replies

jordigh3y ago

Exceptions are actually part of floats, they're called "signalling nans".

So technically Python is correct when it decided that 0.0/0.0 should raise an exception instead of just quietly returning NaN. Raising an exception is a standards-conforming option.

https://stackoverflow.com/questions/18118408/what-is-the-dif...

2 more replies

usefulcat3y ago

> > What do you return for an index into the array? > An option/maybe type would solve this much better. Only if optional<float> is the same size as float.

Waterluvian3y ago

Only if you’re working with pure enough functions, I think? Otherwise I might be polluting state with NaN along the way. And therefore need to error handle the whole way. So I may rather prefer to catch NaN up front and throw an error.

Loud and clear on the other point. I work with ROS a lot and the message definitions do not support NaN or null. So we have these magic numbers all over the place for things like “the robot does not yet know where it is.” (therefore: on the moon.)

UncleMeat3y ago

> The result of any operation on a NaN is a NaN.

That's not true! maxNum(nan, x) = x.

a_e_k3y ago

The standard for comparison with NaNs is to always return false [0].

So that entirely depends on how max() is implemented. A naive implementation of max() might just as easily instead return NaN for that. Or if max(NaN, x) is x, then it may give NaN for max(x, NaN).

Note that the fact that comparisons always return false also means that sorting an array of floating point values that contain NaNs can very easily break a comparison-based sorting algorithm!! (I've seen std::sort() in C++, for example, crash because NaNs break the strict weak ordering [1] requirement.)

[0] https://en.wikipedia.org/wiki/NaN#Comparison_with_NaN

[1] https://en.cppreference.com/w/cpp/named_req/Compare

1 more reply

ilyt3y ago

It's complicated.

https://754r.ucbtest.org/background/minNum_maxNum_Removal_De...

1 more reply

jordigh3y ago· 12 in thread

One thing that pains me about this kind of zoo of problems is that people often have the takeaway, "floating point is full of unknowable, random errors, never use floating point, you will never understand it."

Floating point is amazingly useful! There's a reason why it's implemented in hardware in all modern computers and why every programming language has a built-in type for floats. You should use it! And you should understand that most of its limitations are an inherent mathematical and fundamental limitation, it is logically impossible to do better on most of its limitations:

1. Numerical error is a fact of life, you can only delay it or move it to another part of your computation, but you cannot get rid of it.

2. You cannot avoid working with very small or very large things because your users are going to try, and floating point or not, you'd better have a plan ready.

3. You might not like that floats are in binary, which makes decimal arithmetic look weird. But doing decimal arithmetic does not get rid of numerical error, see point 1 (and binary arithmetic thinks your decimal arithmetic looks weird too).

But sure, don't use floats for ID numbers, that's always a problem. In fact, don't use bigints either, nor any other arithmetic type for something you won't be doing arithmetic on.

gumby3y ago

> Floating point is amazingly useful! There's a reason why it's implemented in hardware in all modern computers and why every programming language has a built-in type for floats.

I completely agree with you even though I go out of my way to avoid FP, and even though, due to what I usually work on, I can often get away with avoiding FP (often fixed point works -- for me).

IEEE-754 is a marvelous standard. It's a short, easy to understand standard attached to an absolutely mind boggling number of special cases or explanation as to why certain decisions in the simple standard were actually incredibly important (and often really smart and non-obvious). It's the product of some very smart people who had, through their careers, made FP implementations and discovered why various decisions turned out to have been bad ones.

I'm glad it's in hardware, and not just because FP used to be quite slow and different on every machine. I'm glad it's in hardware because chip designers (unlike most software developers) are anal about getting things right, and implementing FP properly is hard -- harder than using it!

ogogmad3y ago

> And you should understand that most of its limitations are an inherent mathematical and fundamental limitation, it is logically impossible to do better on most of its limitations

You can do exact real arithmetic. But this is only done by people who prove theorems with computers - or by the Android calculator! https://en.wikipedia.org/wiki/Computable_analysis

Other alternatives (also niche) are exact rational arithmetic, computer algebra, arbitrary precision arithmetic.

Fixed point sometimes gets used instead of floats because some operations lose no precision over them, but most operations still do.

jordigh3y ago

In my opinion, that's in the realm of "you can only delay it". Sure, you can treat real numbers via purely logical deductions like a human mathematician would, but at some point someone's going to ask, "so, where is the number on this plot?" and that's when it's time to pay the fiddler.

Same for arbitrary-precision calculations like big rationals. That just gives you as much precision as your computer can fit in memory. You will still run out of precision, just later rather then sooner.

1 more reply

lanstin3y ago

I wouldn't call computable reals the reals. They are a subset of measure zero. Perhaps all we sentient beings can aspire to use, but still short of the glory of the completed infinities that even one arbitrary real represents.

One half : )

saagarjha3y ago

These are only relevant in some circumstances. For example, a calculator is typically bounded in the number of operations you can perform to a small number (humans don’t add millions of numbers). This allows for certain representations that don’t make sense elsewhere.

zokier3y ago

> 3. You might not like that floats are in binary, which makes decimal arithmetic look weird. But doing decimal arithmetic does not get rid of numerical error, see point 1 (and binary arithmetic thinks your decimal arithmetic looks weird too).

One thing that I suspect trips people a lot is decimal string/literal <-> (binary) float conversions instead of the floating point math itself. This includes the classic 0.1+0.2 thing, and many of the problems in the article.

I think these days using floating point hex strings/literals more would help a lot. There are also decimal floating point numbers that people largely ignore despite being standard for over 15 years

jordigh3y ago

The only implementation of IEEE754 decimals I've ever seen is in Python's Decimal package. Is there an easily-available implementation anywhere else?

1 more reply

zokier3y ago

> One thing that pains me about this kind of zoo of problems is that people often have the takeaway, "floating point is full of unknowable, random errors, never use floating point, you will never understand it."

> Floating point is amazingly useful!

Another thing about floats is they are for most parts actually very predictable. In particular all basic operations should produce bit-exact results to last ulp. Also because they are language independent standard, you generally can get same behavior in different languages and platforms. This makes learning floats properly worthwhile because the knowledge is so widely applicable

jsmith453y ago

>In particular all basic operations should produce bit-exact results to last ulp.

As long as you are not using a compiler that utilizes x87's extended precision flaots for intermediate calculations, and silently rounding whenever it transfers to memory (That used to be a common issue), and as long as you are not doing dumb stuff with compiler math flags.

Also if you have any code anwhere in your program that relies on correct subnormal handling, then you need to be absolutely sure no code is compiled with `-ffast-math`, including in any dynamically loaded code in your entire program, or your math will break: https://simonbyrne.github.io/notes/fastmath/#flushing_subnor...

And of course if you are doing anything complicated with floating point number, there are entire fields of study about creating numerically stable algorithms, and determining the precision of algorithms with floating point numbers.

carapace3y ago

Floating point is a goofy hacky kludge.

> There's a reason why it's implemented in hardware in all modern computers

Yah, legacy.

The reason we used it originally is that computers were small and slow. Now that they're big and fast we could do without it, except that there is already so much hardware and software out there that it will never happen.

dahfizz3y ago

What replacement would you propose? They all have different tradeoffs.

1 more reply

astrange3y ago

Turning all your fixed-size numeric types into variable-sized numeric types introduces some really exciting performance and security issues. (At least if you consider DoS security.)

I think fixed-point math is underrated though.

guyomes3y ago· 8 in thread

Example 4 mentions that the result might be different with the same code. Here is an example that is particularly counter-intuitive.

Some CPU have the instruction FMA(a,b,c) = ab + c and it is guaranteed to be rounded to the nearest float. You might think that using FMA will lead to more accurate results, which is true most of the time.

However, assume that you want to compute a dot product between 2 orthogonal vectors, say (u,v) and (w,u) where w = -v. You will write:

p = uv + wu

Without FMA, that amounts to two products and an addition between two opposite numbers. This results in p = 0, which is the expected result.

With FMA, the compiler might optimize this code to:

p = FMA(u, v, wu)

That is one FMA and one product. Now the issue is that wu is rounded to the nearest float, say x, which is not exactly -vu. So the result will be the nearest float to uv + x, which is not zero!

So even for a simple formula like this, testing if two vectors are orthogonal would not necessary work by testing if the result is exactly zero. One recommended workaround in this case is to test if the dot product has an absolute value smaller than a small threshold.

zokier3y ago

Note that with gcc/clang you can control the auto-use of fma with compile flags (-ffp-contract=off). It is pretty crazy imho that gcc defaults to using fma

thxg3y ago

> It is pretty crazy imho that gcc defaults to using fma

Yes! Different people can make different performance-vs-correctness trade-offs, but I also think reproducible-by-default would be better.

Fortunately, specifying a proper standard (e.g. -std=c99 or -std=c++11) implies -ffp-contract=off. I guess specifying such a standard is probably a good idea independently when we care about reproducibility.

Edit: Thinking about it, it the days of 80-bit x87 FPUs, strictly following the standard (specifically, always rounding to 64 bits after every operation) may have been prohibitively expensive. This may explain gcc's GNU mode defaulting to -ffast-math.

2 more replies

gnufx3y ago

Why is it crazy? Some of us don't want to lose a factor of two on linear algebra (and also care about correctness). I remember testing for correctness against Kahan's tests after FMA became available in RS/6000.

1 more reply

owlbite3y ago

My take is not using the higher precision operation is crazy.

1 more reply

lanstin3y ago

In general with reals with any source of error anywhere, this caution about equality is always correct. the odds of two reals being equal is zero.

raphlinus3y ago

I have an exception that proves the rule. I thought about responding to Julia's call, but decided this was too subtle. But here we go...

A central primitive in 2D computational geometry is the orientation problem; in this case deciding whether a point lies to the left or right of a line. In real arithmetic, the classic way to solve it is to set up the line equation (so the value is zero for points on the line), then evaluate that for the given point and test the sign.

The problem is of course that for points very near the line, roundoff error can give the wrong answer, it is in fact an example of cancellation. The problem has an exact answer, and can be solved with rational numbers, or in a related technique detecting when you're in the danger zone and upping the floating point precision just in those cases. (This technique is the basis of Jonathan Shewchuk's thesis).

However, in work I'm doing, I want to take a different approach. If the y coordinate of the point matches the y coordinate of one of the endpoints of the line, then you can tell orientation exactly by comparing the x coordinates. In other cases, either you're far enough away that you know you won't get the wrong answer due to roundoff, or you can subdivide the line at that y coordinate. Then you get an orientation result that is not necessarily exactly correct wrt the original line, but you can count on it being consistent, which is what you really care about.

So the ironic thing is that if you had a lint that said, "exact floating point equality is dangerous, you should use a within-epsilon test instead," it would break the reasoning outlined above, and you could no longer count on the orientations being consistent.

As I said, though, this is a very special case. Almost always, it is better to use a fuzzy test over exact equality, and I can also list times I've been bitten by that (especially in fastmath conditions, which are hard to avoid when you're doing GPU programming).

thxg3y ago

Yes, and this is not just a theoretical concern: There was an article here [1] in 2021 claiming that Apple M1's FMA implementation had "flaws". There was actually no such flaw. Instead, the author was caught off guard by the very phenomenon you are describing.

[1] https://news.ycombinator.com/item?id=27880461

adgjlsfhk13y ago

Although funnily enough the Windows fma implementation is significantly flawed.

dahfizz3y ago· 6 in thread

> Javascript only has floating point numbers – it doesn’t have an integer type.

Can anyone justify this? Do JS developers prefer not having exact integers, or is this something that everyone just kinda deals with?

deathanatos3y ago

Nowadays, it has BigInt.

If you're very careful, a double can be an integer type. (A 53-bit one, I think?) (I don't love this line of thinking. It has a lot of sharp edges. But JS programmers effectively do this all the time, often without thinking too hard about it.)

(And even before BigInt, there's an odd u32-esque "type" in JS; it's not a real type — it doesn't appear in the JS type system, but rather an internal one that certain operations will be converted to internally. That's why (0x100000000 | 0) == 0 ; even though 0x100000000 (and every other number in that expression, and the right answer) is precisely representable as a f64. This doesn't matter for JSON decoding, though, … and most other things.)

thdc3y ago

I believe this is technically inaccurate; while Javascript groups most of the number values under, well, "number", modern underlying implementations may resort to perform integer operations when they recognize it is possible. There are also a couple hacks you can do with bit operations to "work" with integers, although I don't remember them off the top of my head - typically used for truncating and whatnot and was mainly a performance thing.

Also there are typed arrays and bigints if we can throw those in, too.

saagarjha3y ago

The way runtimes optimize arithmetic is an implementation detail and must conform to IEEE-754.

1 more reply

enriquto3y ago

> not having exact integers

What do you mean? Floating-point arithmetic is, by design, exact for small integers. The result of adding 2.0 to 3.0 is exactly 5.0. This is one of the few cases where it is perfectly legitimate to compare floats for equality.

In fact, using 64-bit doubles to represent ints you get way more ints than using plain 32-bit ints. Thus, choosing doubles to represent integers makes perfect sense (unless you worry about wasting a bit of memory and performance).

ogogmad3y ago

Having something work most of the time but break without warning - at least by writing to a log that an overflow has happened - goes against a lot of SWE best practices AFAICT. Imagine debugging that without knowing it can happen.

Does JS at least write ".0" at the end when converting a number to a string? Or switch to scientific notation for large numbers?

josefx3y ago

You can use doubles to store and calculate exact integer values. You just wont get 2^64 integers, instead you get the range +/-2^53 .

lifefeed3y ago· 4 in thread

My favorite floating point weirdness is that 0.1 can't be exactly represented in floating point.

jrockway3y ago

Isn't it equally weird that 1/3 can't be exactly represented in decimal?

kps3y ago

Indeed, 0.1 can be represented exactly in decimal floating point, and can't be represented in binary fixed point. It's just that fractional values are currently almost always represented using binary floating point, so the two get conflated.

layer83y ago

The reason why the 0.1 case is weird (unexpected) is that we use decimal notation in floating-point constants (in source code, in formats like JSON, and in UI number inputs), but the value that the constant actually ends up representing is really the closest binary number, where in addition the closeness depends on the FP precision used. If we would write FP values in binary or hexadecimal (which some languages support), the issue wouldn’t arise.

pitaj3y ago

Yep! Too bad humanity has settled on decimal instead of dozenal (base 12).

1 more reply

kloch3y ago· 3 in thread

> if you add very big values to very small values, you can get inaccurate results (the small numbers get lost!)

There is a simple workaround for this:

https://en.wikipedia.org/wiki/Kahan_summation

It's usually only needed when adding billions of values together and the accumulated truncation errors would be at an unacceptable level.

phkahler3y ago

It can also come up in simple control systems. A simple low-pass filter can fail to converge to a steady state value if the time constant is long and the sample rate is high.

Y += (X-Y) * alpha * dt

When dt is small and alpha is too, the right hand side can be too small to affect the 24bit mantissa of the left.

I prefer a 16/32bit fixed point version that guarantees convergene to any 16bit steady state. This happened in a power conversion system where dt=1/40000 and I needed a filter in the 10's of Hz.

jbay8083y ago

This is a very important application and a tougher problem than most would guess. There is a huge variety of ways to numerically implement even a simple transfer function, and they can have very different consequences in terms of rounding and overflow. Especially if you want to not only guarantee that it converges to a steady-state, but furthermore that the steady-state has no error. I spent a lot of time working on this problem for nanometre-accurate servo controls. Floating and fixed point each have advantages depending on the nature and dynamic range of the variable (eg. location parameter vs scale parameter).

kergonath3y ago

> It's usually only needed when adding billions of values together and the accumulated truncation errors would be at an unacceptable level.

OTOH, it’s easy to implement, so I have a couple of functions to do it easily, and I got quite a lot of use out of them. It’s probably overkill sometimes, but sometimes it’s useful.

dunham3y ago· 3 in thread

I had one issue where pdftotext would produce different output on different machines (Linux vs Mac). It broke some of our tests.

I tracked down where it was happening (involving an ==), but it magically stopped when I added print statements or looked at it in the debugger.

It turns out the x86 was running the math at a higher precision and truncating when it moved values out of registers - as soon as it hit memory, things were equal. MacOS was defaulting to -ffloat-store to get consistency (their UI library is float based).

There were too many instances of == in that code base (which IMO is a bad idea with floats), so I just added -ffloat-store to the Linux build and called it a day.

alkonaut3y ago

x86 (x87) FP is notoriously inconsistent because of the 80 bit extended precision that may not be used. In a JITed language line Java/C# it’s even less fun as it can theoretically be inconsistent even for the same compiled program on different machines.

Thankfully the solution to that problem came when x86 (32 bit) mostly disappeared.

astrange3y ago

macOS doesn't use -ffloat-store, it uses SSE for floats instead of x87 because it only supports machines with SSSE3. You wanted -mfpmath=sse or -march=native.

dunham3y ago

Could be, but that's my recollection of the situation. I had added -ffloat-store on the Linux side, and I had read somewhere that was being done on the osx side.

Note that this was back in January of 2011, possibly earlier. Back then gcc was still the default compiler on macos, and I think Apple was still 32-bit. (I don't remember when they switched to 64-bit, but I have a 32-bit only game that I bought near the end of that year.)

The situation is quite different these days, I don't think any compiler still uses the old registers by default.

evancox1003y ago· 3 in thread

Example 7 really got me, can anyone explain that? I’m not sure how “modulo” operation would be implemented in hardware, if it is a native instruction or not, but one would hope it would give a result consistent with the matching divide operation.

Edit: x87 has FPREM1 which can calculate a remainder (accurately one hopes), but I can’t find an equivalent in modern SSE or AVX. So I guess you are at the mercy of your language’s library and/or compiler? Is this a library/language bug rather than a Floating Point gotcha?

timerol3y ago

Based on the nearest numbers that floats represent, the two numbers are Y = 13.715999603271484375 (https://float.exposed/0x415b74bc) and X = 4.57200002670288085938 (https://float.exposed/0x40924dd3).

The division of these numbers is 2.9999998957049091386350361962468173875300478102103478639802753918, but the nearest float to that is 3. (Exactly 3.) [2]

The modulo operation can (presumably) determine that 3X > Y, so the modulo is Y - 2X, as normal.

This gives inconsistent results, if you don't know that every float is actually a range, and "3" as a float includes some numbers that are smaller than 3.

[1] https://www.wolframalpha.com/input?i=13.715999603271484375+%... [2] https://www.wolframalpha.com/input?i=2.999999895704909138635..., then https://float.exposed/0x40400000

svat3y ago

This is useful but note that Python uses 64-bit floats (aka "double"), so the right values are:

• "13.716" means 13.7159999999999993037 (https://float.exposed/0x402b6e978d4fdf3b)

• "4.572" means 4.57200000000000006395 (https://float.exposed/0x401249ba5e353f7d)

• "13.716 / 4.572" means the nearest representable value to 13.7159999999999993037 / 4.57200000000000006395 which (https://www.wolframalpha.com/input?i=13.7159999999999993037+...) is 3.0 (https://float.exposed/0x4008000000000000)

• "13.716 % 4.572" means the nearest representable value to 13.7159999999999993037 % 4.57200000000000006395 namely to 4.5719999999999991758 (https://www.wolframalpha.com/input?i=13.7159999999999993037+...), which is 4.57199999999999917577 (https://float.exposed/0x401249ba5e353f7c) printed as 4.571999999999999.

————————

Edit: For a useful analogy (answering the GP), imagine you're working in decimal fixed-point arithmetic with two decimal digits (like dollars and cents), and someone asks you for 10.01/3.34 and 10.01%3.34. Well,

• 10.01 / 3.34 is well over 2.99 (it's over 2.997 in fact) so you'd be justified in answering 3.00 (the nearest representable value).

• 10.01 % 3.34 is 3.33 (which you can represent exactly), so you'd answer 3.33 to that one.

(For an even bigger difference: try 19.99 and 6.67 to get 3.00 as quotient, but 6.65 as remainder.)

adrian_b3y ago

This has nothing to do with the definition or implementation of the remainder or modulo function.

It is a problem that appears whenever you compose an inexact function, like the conversion from decimal to binary, with a function that is not continuous, like the remainder a.k.a. modulo function.

In decimal, 13.716 is exactly 3 times 4.572, so any kind of remainder must be null, but after conversion from decimal to binary that relationship is no longer true, and because the remainder is not a continuous function its value may be wildly different from the correct value.

When you compute with approximate numbers, like the floating-point numbers, as long as you compose only continuous functions, the error in the final result remains bounded and smaller errors in inputs lead to a diminished error in the output.

However, it is enough to insert one discontinuous function in the computation chain for losing any guarantee about the magnitude of the error in the final result.

The conclusion is that whenever computing with approximate numbers (which may also use other representations, not only floating-point) you have to be exceedingly cautious when using any function that is not continuous.

svat3y ago· 2 in thread

If you have only a couple of minutes to develop a mental model of floating-point numbers (and you have none currently), the most valuable thing IMO would be to spend them staring at a diagram like this one: https://upload.wikimedia.org/wikipedia/commons/b/b6/Floating... (uploaded to Wikipedia by user Joeleoj123 in 2020, made using Microsoft Paint) — it already covers the main things you need to know about floating-point, namely there are only finitely many discrete representable values (the green lines), and the gaps between them are narrower near 0 and wider further away.

With just that understanding, you can understand the reason for most of the examples in this post. You avoid both the extreme of thinking that floating-point numbers are mathematical (exact) real numbers, and the extreme of "superstition" like believing that floating-point numbers are some kind of fuzzy blurry values and that any operation always has some error / is "random", etc. You won't find it surprising why 0.1 + 0.2 ≠ 0.3, but 1.0 + 2.0 will always give 3.0, but 100000000000000000000000.0 + 200000000000000000000000.0 ≠ 300000000000000000000000.0. :-) (Sure this confidence may turn out to be dangerous, but it's better than "superstition".) The second-most valuable thing, if you have 5–10 minutes, may be to go to https://float.exposed/ and play with it for a while.

Anyway, great post as always from Julia Evans. Apart from the technical content, her attitude is really inspiring to me as well, e.g. the contents of the “that’s all for now” section at the end.

The page layout example ("example 7") illustrates the kind of issue because of which Knuth avoided floating-point arithmetic in TeX (except where it doesn't matter) and does everything with scaled integers (fixed-point arithmetic). (It was even worse then before IEEE 754.)

I think things like fixed-point arithmetic, decimal arithmetic, and maybe even exact real arithmetic / interval arithmetic are actually more feasible these days, and it's no longer obvious to me that floating-point should be the default that programming languages guide programmers towards.

sacrosancty3y ago

If you have even less time, just think of them as representing physical measurements made with practical instruments and the math done with analog equipment.

The common cause of floating point problems is usually treating them as a mathematical ideal. The quirks appear at the extremes when you try to to un-physical things with them. You can't measure exactly 0 V with a voltmeter, or use an instrument for measuring the distance to stars then add a length obtained from a micrometer without entirely losing the latter's contribution.

svat3y ago

Thanks, I actually edited my post (made the second paragraph longer) after seeing your comment. The "physical" / "analog" idea does help in one direction (prevents us from relying on floating-point numbers in unsafe ways) but I think it brings us too close to the "superstition" end of the spectrum, where we start to think that floating-point operations are non-deterministic, start doubting whether we can rely on (say) the operation 2.0 + 3.0 giving exactly 5.0 (we can!), whether addition is commutative (it is, if working with non-NaN floats) and so on.

You could argue that it's "safe" to distrust floating-point entirely, but I find it more comforting to be able to take at least some things as solid and reason about them, to refine my mental model of when errors can happen and not happen, etc.

Edit: See also the floating point isn’t “bad” or random section that the author just added to the post (https://twitter.com/b0rk/status/1613986022534135809).

2 more replies

kilotaras3y ago· 2 in thread

Story time.

Back in university I was taking part in programming competition. I don't remember the exact details of a problem, but it was expected to be solved as a dynamic problem with dp[n][n] as an answer, n < 1000. But, wrangling some numbers around one could show that dp[n][n] = dp[n-1][n-1] + 1/n, and the answer was just the sum of first N elements of harmonic series. Unluckily for us the intended solution had worse precision and our solution failed.

HarryHirsch3y ago

They didn't take into account that floats come with an estimated uncertainty, and that values that are the same within the limits of experimental error are identical? That's a really badly set problem!

kilotaras3y ago

I think it that particular case they just didn't do error analysis.

The task was to output answer with `10^-6` precision, which they solution didn't achieve. Funnily enough the number of other teams went the "correct" route and passed (as they were doing additions in same order as original solution).

ogogmad3y ago· 2 in thread

Related: In numerical analysis, I found the distinction between forwards and backwards numerical error to be an interesting concept. The forwards error initially seems like the only right kind, but is often impossible to keep small in numerical linear algebra. In particular, Singular Value Decomposition cannot be computed with small forwards error. But the SVD can be computed with small backwards error.

Also: The JSON example is nasty. Should IDs then always be strings?

deathanatos3y ago

> The JSON example is nasty.

Specs, vs. their implementations, vs. backwards compat. JSON just defines a number type, and neither the grammar nor the spec places limits on it (though the spec does call out exactly this problem). So the JSON is to-spec valid. But implementations have limits as to what they'll decode: JS's is that it decodes to number (a double) by default, and thus, loses precision.

(I feel like this issue is pretty well known, but I suppose it probably bites everyone once.)

JS does have the BigInt type, nowadays. Unfortunately, while the JSON.parse API includes a "reviver" parameter, the way it ends up working means that it can't actually take advantage of BigInt.

> Should IDs then always be strings?

That's a decent-ish solution; as it side-steps the interop issues. String, to me, is not unreasonable for an ID, as you're not going to be doing math on it.

gugagore3y ago

IIRC, forward error: the error between the given answer and the right answer to the given question.

Backward error: the error between the given question, and the question whose right answer is the given answer.

Easier to parse like this: a small forward error means that you give an answer close to the right one.

A small backward error means that the answer you give is the right answer for a nearby question.

toolslive3y ago· 2 in thread

in computer graphics, some people try to accumulate the transformations in 1 matrix. So A_{n+1} = T_{n} * A_{n} where T_{n} is a small transformation like a rotation around an axis

They learn by experience that they also slowly accumulate errors and end up with a transformation matrix A that's no longer orthogonal and will skew the image.

Or people try to solve large lineair systems of floating points with a naive Gaussian elimination approx and end up with noise. Same with naive iterative eigen vector calculations.

toolslive3y ago

moreover.... floating point addition is not commutative, nor is is associative.

CamperBob23y ago

What would be an example of a floating-point addition operation that doesn't commute?

1 more reply

owisd3y ago· 1 in thread

My 'favourite' is that the quadratic formula -b±sqrt(b²-4ac)/2a falls apart when you solve for the positive solution using floating point for cases where ε=b²/4ac is small, the workaround being to use the binomial expansion -b/2a*(0.5ε-0.125ε²+O(ε³))

raphlinus3y ago

This one is fun! However, that particular formula is not my favorite, as it doesn't behave nicely when a is small. The art here is deciding what to special case.

The way I approach this[1] is to choose the sign of the ± to be the same as -b to compute the first root x1. Then the second root is c/(a * x1). There are a few other checks in the code for things overflowing, but that basically gives you very accurate answers across the range.

This is explained a bit in a good Math Exchange post[2]. Jim Blinn's "How to solve a quadratic equation" also makes good reading.

Wait til you get to cubics and quartics.

[1]: https://docs.rs/kurbo/latest/src/kurbo/common.rs.html#116-16...

[2]: https://math.stackexchange.com/questions/866331/numerically-...

dkarl3y ago· 1 in thread

I'm not on Mastodon, so I'll share here: I inherited some numerical software that was used primarily to prototype new algorithms and check errors for a hardware product that solved the same problem. It was known that different versions of the software produced slightly different answers, for seemingly no reason. The hardware engineer who handed it off to me didn't seem to be bothered by it. He wasn't using version control, so I couldn't dig into it immediately, but I couldn't stop thinking about it.

Soon enough I had two consecutive releases in hand, which produced different results, and which had identical numerical code. The only code I had changed that ran during the numerical calculations was code that ran between iterations of the numerical parts of the code. IIRC, it printed out some status information like how long it had been running, how many calculations it had done, the percent completed, and the predicted time remaining.

How could that be affecting the numerical calculations??? My first thought was a memory bug (the code was in C-flavored C++, with manual memory management) but I got nowhere looking for one. Unfortunately, I don't remember the process by which I figured out the answer, but at some point I wondered what instructions were used to do the floating-point calculations. The Makefile didn't specify any architecture at all, and for that compiler, on that architecture, that meant using x87 floating-point instructions.

The x87 instruction set was originally created for floating point coprocessors that were designed to work in tandem with Intel CPUs. The 8087 coprocessor worked with the 8086, the 287 with the 286, the 387 with the 386. Starting with the 486 generation, the implementation was moved into the CPU.

Crucially, the x87 instruction set includes a stack of eight 80-bit registers. Your C code may specify 64-bit floating point numbers, but since the compiled code has to copy those value into the x87 registers to execute floating-point instructions, the calculations are done with 80-bit precision. Then the values are copied back into 64-bit registers. If you are doing multiple calculations, a smart compiler will keep intermediate values in the 80-bit registers, saving cycles and gaining a little bit of precision as a bonus.

Of course, the number of registers is limited, so intermediate values may need to be copied to a 64-bit register temporarily to make room for another calculation to happen, rounding them in the process. And that's how code interleaved with numerical calculations can affected the results even if it semantically doesn't change any of the values. Calculating percent completed, printing a progress bar -- the compiler may need to move values out of the 80-bit registers to make room for these calculations, and when the code changes (like you decide to also print out an estimated time remaining) the compiler might change which intermediate values are bumped out of the 80-bit registers and rounded to 64 bits.

It was silly that we were executing these ancient instructions in 2004 on Opteron workstations, which supported SSE2, so I added a compiler flag to enable SSE2 instructions, and voila, the numerical results matched exactly from build to build. We also got a considerable speedup. I later found out that there's a bit you can flip to force x87 arithmetic to always round results to 64 bits, probably to solve exactly the problem I encountered, but I never circled back to try it.

jordigh3y ago

Oh man, those 80-bit registers on 32-bit machines were weird. I was very confused as an undergrad when I ran the basic program to find machine epsilon, and was getting a much smaller epsilon than I expected on a 64-bit float. Turns out, the compiler had optimised all of my code to run on registers and I was getting the machine epsilon of the registers instead.

BeetleB3y ago

> example 4: different languages sometimes do the same floating point calculation differently

It's worse than that: You can use the same language, compiler, library, machine, and still get different results if your OS is different.

I forget all the details, but it boils down to how intermediate results are handled. When you compute certain functions, there are several intermediate calculations before it spits out the result. You get more accuracy if you allow those intermediate calculations to happen in a higher precision format (e.g. you're computing in 32 bits, so it will compute the intermediate values in 64 bits). But that is also slower.

OS's make a "default" choice. I think Linux defaults to slower, but more accurate, and BSD defaults to faster, but less accurate.

There may be flags you can set to force one configuration regardless of the OS, but you shouldn't assume your libraries do that.

> In principle you might think that different implementations should work the same way because of the IEEE 754 standard for floating point, but here are a couple of caveats that were mentioned:

> math operations in libc (like sin/log) behave differently in different implementations. So code using glibc could give you different results than code using musl

IEEE 754 doesn't mandate a certain level of accuracy for transcendental functions like sin/log. You shouldn't expect different libraries to give you the same value. If you're doing 64 bit calculations, I would imagine most math libraries will give results accurate enough for 99.99% of math applications, even if only the first 45 bits are correct (and this would be considered "very inaccurate" by FP standards).

asicsp3y ago

See also "Floating Point visually explained": https://fabiensanglard.net/floating_point_visually_explained...

mochomocha3y ago

Regarding denormal/subnormal numbers mentioned as "weird": the main issue with them is that their hardware implementation is awfully slow, to the point of being unusable for most computation cases with even moderate FLOPs

mikehollinger3y ago

Love it. I actually use Excel which even power users take for granted to highlight that people really need to understand the underlying system, or the system needs to have guard rails to prevent people from stubbing their toes. Microsoft even had to write a page explaining what might happen [1] with floating point wierdness.

[1] https://docs.microsoft.com/en-us/office/troubleshoot/excel/f...

BeetleB3y ago

> addition isn’t associative (x + (y + z)) is different from (x + y) + z))

A thousand thanks for not saying "addition is not commutative".

(Addition is commutative in floating point. It merely is not associative).

svnpenn3y ago

Here is Go version. works exactly as expected, no surprises. People just need to grow up and use a modern language, not a 50 year old out of date language:

    package main
    
    import "fmt"
    
    func main() {
       var i, iterations, meters float64
       for iterations = 100_000_000; i < iterations; i++ {
          meters += 0.01
       }
       // Expected: 1000.000000 km
       fmt.Printf("Expected: %f km\n", 0.01 * iterations / 1000)
       // Got: 1000.000001 km
       fmt.Printf("Got: %f km \n", meters / 1000)
    }

Lind53y ago

AI already has led to a rethinking of computer architectures, in which the conventional von Neumann structure is replaced by near-compute and at-memory floorplans. But novel layouts aren’t enough to achieve the power reductions and speed increases required for deep learning networks. The industry also is updating the standards for floating-point (FP) arithmetic. https://semiengineering.com/will-floating-point-8-solve-ai-m...

cratermoon3y ago

Muller's Recurrence is my favorite example of floating point weirdness. See https://scipython.com/blog/mullers-recurrence/ and https://latkin.org/blog/2014/11/22/mullers-recurrence-roundo...

marshallward3y ago

We work hard to retain floating point reproducibility in climate models. I have a presentation on this, if anyone is interested.

https://www.marshallward.org/fortrancon2021/#/title-slide

https://m.youtube.com/watch?v=wQQuFXm6ZqU&t=27s

fuzzfactor3y ago

On the hardware, it's fundamentally integer arithmetic under the hood.

Floating point itself is an implementation on top of that.

With a slide rule it's basically truncated integers all the way, you're very limited on significant figures, you float the point in your head or some other way, and apply it afterward. You're constantly aware of any step where your significant figures drop below 3 because that's such a drastic drop from 3 down to 2. Or down to 1 which can really slap you in the face.

The number of decimal places also is an integer naturally which is related to scientific notation. Whether a result is positite or negative is a simple flag.

On a proper computer, integer addition, subtraction, and multiplication are exact. It's the division which produces integers which are not exact, since they can usually be truncated results, as they are written to memory at the bitness in use at the time, storing the whole-number portion of the quotient only.

Integer arithmetic can usually be made as exact as you want and it helps to consciously minimize divisions, and if necessary offset/scale the operands beforehand such that the full range of quotients is never close enough to zero for the truncation to affect your immediate result within the number of significant figures necessary for your ultimate result to be completely reliable.

The number of significant figures available on even an 8-bit computer just blows away the slide rule but not if you don't take full advantage of it.

What sometimes happens is that zero-crossing functions, when the quotients are not offset, will fluctuate between naturally taking advantage of the enhanced bitness of the hardware for large values (blowing away the slide rule a huge amount of the time), while "periodically" dropping below the accuracy of a slide rule when some quotient, especially an intermediate one, is too near zero. Floating point or integer.

If there's nobody keeping track of not just the magnitude of the possible values, but also the significant figures being carried at each point, your equations might not come out as good as a slide rule sometimes.

Edit: IOW when the final reportable result is actually a floating point number where you need to have an accurate idea how many figures to the right of the decimal place are truly valid, it might be possible to use all integer calculations to your advantage from the beginning, and confidently place the decimal point as a final act.

Waterluvian3y ago

Regarding example 2.1:

I thought the JSON spec was that a number is of any precision and that it’s up to the parser to say, “best I can do is 754.”

That is, I think you can deserialize very long decimal numbers to higher accuracy in some languages.

tails4e3y ago

Got bitten by the big number being added to small number when converting code from double precision to single precision, hard to track down when subtle errors get introduced due to this.

ape43y ago

All numbers in JavaScript are floats, unless you make an array with Int8Array(). https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

I wonder if people sometimes make a one element integer array this way so they can have a integer to work with.

weakfortress3y ago

Used to run into these problems all the time when I was doing work in numerical analysis.

The PATRIOT missile error (it wasn't a disaster) was more due to the handling of timestamps than just floating point deviation. There were several concurrent failures that allowed the SCUD to hit it's target. IIRC the clock drift was significant and was magnified by being converted to a floating point and, importantly, truncated into a 24 bit register. Moreover, they weren't "slightly off". The clock drift alone put the missile considerably off target.

While I don't claim that floating points didn't have a hand in this error it's likely the correct handling of timestamps would not have introduced the problem in the first place. Unlike the other examples given this one is a better example of knowing your system and problem domain rather than simply forgetting to calculate a delta or being unaware of the limitations of IEEE 754. "Good enough for government work" struck again here.

1 more reply

Aardwolf3y ago

> but I wanted to mention it because:

> 1. it has a funny name

Reasoning accepted!

1 more reply

j / k navigate · click thread line to collapse

169 comments

106 comments · 29 top-level

WalterBright3y ago· 28 in thread

> NaN/infinity values can propagate and cause chaos

"This sounds terrible" you might think.

This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.

NaN is your friend!

ilyt3y ago

You return (value, found), or (value,error) or Result<T>.

You return error and handle the error. You want to know sensor is wonky or returns bad data.

> This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.

I'd like to see how much of the code actually uses that as a feature and not just sets it to 0.0 (or initializes it right away)

kstenerud3y ago

> You return (value, found), or (value,error) or Result<T>.

And this is great for environments that can support it, but as the levels get lower and lower, such safety nets become prohibitively expensive.

Take data formats, for example. Say we have a small device that records ieee754 binary float32 readings. A simple format might be something like this:

    record     = reading* terminator;
    reading    = float(32, ~) | invalid;
    invalid    = float(32, snan);
    terminator = uint(32, 0xffffffff);

We use a signaling NaN to record an error in the sensor reading, and we use the encoding 0xffffffff (which is a quiet NaN) to mark the end of the record.

If we wanted the validity signaling to be out-of-band, we'd need to encode it as such; perhaps as a "validity" bit preceding each record:

    record     = reading* terminator;
    reading    = valid_bit & float(32, ~);
    valid_bit  = uint(1, ~);
    terminator = uint(1, 1) & uint(32, 0xffffffff);

Now the format is more complicated, and we also have alignment problems due to each record entry being 33 bits. We could use a byte instead and lose to bloat a little:

    record     = reading* terminator;
    reading    = valid_bit & float(32, ~);
    valid_bit  = uint(8, ~);
    terminator = uint(8, 1) & uint(32, 0xffffffff);

But we're still unaligned (40 bits per record), which will slow down ingestion. We could fix that by using a 32-bit validity "bit":

    record     = reading* terminator;
    reading    = valid_bit & float(32, ~);
    valid_bit  = uint(32, ~);
    terminator = uint(32, 1) & uint(32, 0xffffffff);

But now we've doubled the size of the data format.

Or perhaps we keep it as a separate bit array, padded to a 32-bit boundary to deal with alignment issues:

    record      = bind(count,count_field) & pad(32, validity{count.value}, padding*) & reading{count.value};
    count_field = uint(32,bind(value,~));
    reading     = float(32, ~);
    validity    = uint(1, ~);
    padding     = uint(1, 0);

But now we've lost the ability of ad-hoc appends (we have to precede each record with a length), and the format is becoming a lot more complicated.

1 more reply

googlryas3y ago

Float-with-NaN is essentially (float-without-NaN, boolean error), in an efficient, hardware-optimized structure.

dragonwriter3y ago

> What do you return for an index into the array?

None in an Optional/Maybe, and Error in a Result, Null in a nullable type, an exception,

the__alchemist3y ago

I concur. I would use an `Option`, `Result`, or enum for all of those scenarios, depending on the details.

1 more reply

maximilianburke3y ago

Really though, I'm a fan, I just think that we need better means for checking them in legacy languages and we need to entirely do away with "fast math" optimizations.

WalterBright3y ago

I call them "buggy math" optimizations. The dmd D compiler does not have a switch to enable buggy math.

BeetleB3y ago

> made so by the unobvious test for NaN-ness in many languages (ie, "if (x != x)")

Which languages do not have a function to test for NaN?

> and the lure of people who want to turn on "fast math" optimizations which do things like assume NaNs aren't possible and then dead-code-eliminate everything that's guarded by an "x != x" test.

This is not unique to NaNs. There are plenty of potential floating point problems if you enable those flags.

1 more reply

abnry3y ago

NaNs are incredibly helpful in numpy. Often I want to take a mean over a collection of lists but the size of each list isn't constant. Np.nanmean works great.

avgcorrection3y ago

> I think the concept of NaNs are sound, but I think relying on them is fraught with peril

Well, allegedly in D at least https://www.reddit.com/r/rust/comments/a1w75c/the_bug_i_did_...

jordigh3y ago

> What's the result of 1.0/0.0? It's not a number, so it's a NaN

It's not often that I get to correct Mr D himself, but 1.0/0.0 is...

WalterBright3y ago

You're right. I'll fix it.

maxfurman3y ago

What is 1.0/0.0 in D? A DivideByZero exception of some sort?

1 more reply

gerdesj3y ago

I often find myself wondering wtf? when I see discussions like this.

numpy and the like exist for a very good reason: A general purpose programming language will cock up very soon.

Do basic and very simple arithmetic in your chosen language if it is well formed, switch to sophisticated addons as soon as is practicable.

1 more reply

inetknght3y ago

> This means that NaNs propagate from the source of the original NaN to the final printed result.

An exception would be better. Then you immediately get at the first problem instead of having to track down the lifetime of the observed problem to find the first problem.

thfuran3y ago

Yeah, or using a type system that lets you avoid making NaN a float to both force the case to be handled and prevent NaN from leaking into the math.

BeetleB3y ago

> An exception would be better.

It depends on the context. Sometimes NaNs are expected and to be ignored. Sometimes they signal a problem.

insulanus3y ago

jcranmer3y ago

You can get an exception! Just enable FP exceptions for the invalid exception, and compile all of your code such that it's FP-exception aware, and you can get the exception.

bachmeier3y ago

> This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.

This is definitely something I like about D, but I'd much prefer a compiler error. double x; double y = x+1.5; is less than optimal.

pwpwp3y ago

I don't find this convincing.

> What do you return for an index into the array?

An option/maybe type would solve this much better.

> Yes, I know, it can be clumsy to trace it back to its source

An exception would be much better, alerting you to the exact spot where the problem occurred.

WalterBright3y ago

> An option/maybe type would solve this much better.

NaN's are already an option type, although implemented in hardware. The checking comes for free.

> An exception would be much better

You can configure the FPU to cause an Invalid Operation Exception, but I personally don't find that attractive.

4 more replies

jordigh3y ago

Exceptions are actually part of floats, they're called "signalling nans".

So technically Python is correct when it decided that 0.0/0.0 should raise an exception instead of just quietly returning NaN. Raising an exception is a standards-conforming option.

https://stackoverflow.com/questions/18118408/what-is-the-dif...

2 more replies

usefulcat3y ago

> > What do you return for an index into the array? > An option/maybe type would solve this much better. Only if optional<float> is the same size as float.

Waterluvian3y ago

UncleMeat3y ago

> The result of any operation on a NaN is a NaN.

That's not true! maxNum(nan, x) = x.

a_e_k3y ago

The standard for comparison with NaNs is to always return false [0].

So that entirely depends on how max() is implemented. A naive implementation of max() might just as easily instead return NaN for that. Or if max(NaN, x) is x, then it may give NaN for max(x, NaN).

[0] https://en.wikipedia.org/wiki/NaN#Comparison_with_NaN

[1] https://en.cppreference.com/w/cpp/named_req/Compare

1 more reply

ilyt3y ago

It's complicated.

https://754r.ucbtest.org/background/minNum_maxNum_Removal_De...

1 more reply

jordigh3y ago· 12 in thread

1. Numerical error is a fact of life, you can only delay it or move it to another part of your computation, but you cannot get rid of it.

2. You cannot avoid working with very small or very large things because your users are going to try, and floating point or not, you'd better have a plan ready.

But sure, don't use floats for ID numbers, that's always a problem. In fact, don't use bigints either, nor any other arithmetic type for something you won't be doing arithmetic on.

gumby3y ago

> Floating point is amazingly useful! There's a reason why it's implemented in hardware in all modern computers and why every programming language has a built-in type for floats.

I completely agree with you even though I go out of my way to avoid FP, and even though, due to what I usually work on, I can often get away with avoiding FP (often fixed point works -- for me).

ogogmad3y ago

> And you should understand that most of its limitations are an inherent mathematical and fundamental limitation, it is logically impossible to do better on most of its limitations

You can do exact real arithmetic. But this is only done by people who prove theorems with computers - or by the Android calculator! https://en.wikipedia.org/wiki/Computable_analysis

Other alternatives (also niche) are exact rational arithmetic, computer algebra, arbitrary precision arithmetic.

Fixed point sometimes gets used instead of floats because some operations lose no precision over them, but most operations still do.

jordigh3y ago

1 more reply

lanstin3y ago

One half : )

saagarjha3y ago

zokier3y ago

I think these days using floating point hex strings/literals more would help a lot. There are also decimal floating point numbers that people largely ignore despite being standard for over 15 years

jordigh3y ago

The only implementation of IEEE754 decimals I've ever seen is in Python's Decimal package. Is there an easily-available implementation anywhere else?

1 more reply

zokier3y ago

> Floating point is amazingly useful!

jsmith453y ago

>In particular all basic operations should produce bit-exact results to last ulp.

carapace3y ago

Floating point is a goofy hacky kludge.

> There's a reason why it's implemented in hardware in all modern computers

Yah, legacy.

dahfizz3y ago

What replacement would you propose? They all have different tradeoffs.

1 more reply

astrange3y ago

Turning all your fixed-size numeric types into variable-sized numeric types introduces some really exciting performance and security issues. (At least if you consider DoS security.)

I think fixed-point math is underrated though.

guyomes3y ago· 8 in thread

Example 4 mentions that the result might be different with the same code. Here is an example that is particularly counter-intuitive.

However, assume that you want to compute a dot product between 2 orthogonal vectors, say (u,v) and (w,u) where w = -v. You will write:

p = uv + wu

Without FMA, that amounts to two products and an addition between two opposite numbers. This results in p = 0, which is the expected result.

With FMA, the compiler might optimize this code to:

p = FMA(u, v, wu)

That is one FMA and one product. Now the issue is that wu is rounded to the nearest float, say x, which is not exactly -vu. So the result will be the nearest float to uv + x, which is not zero!

zokier3y ago

Note that with gcc/clang you can control the auto-use of fma with compile flags (-ffp-contract=off). It is pretty crazy imho that gcc defaults to using fma

thxg3y ago

> It is pretty crazy imho that gcc defaults to using fma

Yes! Different people can make different performance-vs-correctness trade-offs, but I also think reproducible-by-default would be better.

2 more replies

gnufx3y ago

1 more reply

owlbite3y ago

My take is not using the higher precision operation is crazy.

1 more reply

lanstin3y ago

In general with reals with any source of error anywhere, this caution about equality is always correct. the odds of two reals being equal is zero.

raphlinus3y ago

I have an exception that proves the rule. I thought about responding to Julia's call, but decided this was too subtle. But here we go...

thxg3y ago

[1] https://news.ycombinator.com/item?id=27880461

adgjlsfhk13y ago

Although funnily enough the Windows fma implementation is significantly flawed.

dahfizz3y ago· 6 in thread

> Javascript only has floating point numbers – it doesn’t have an integer type.

Can anyone justify this? Do JS developers prefer not having exact integers, or is this something that everyone just kinda deals with?

deathanatos3y ago

Nowadays, it has BigInt.

thdc3y ago

Also there are typed arrays and bigints if we can throw those in, too.

saagarjha3y ago

The way runtimes optimize arithmetic is an implementation detail and must conform to IEEE-754.

1 more reply

enriquto3y ago

> not having exact integers

ogogmad3y ago

Does JS at least write ".0" at the end when converting a number to a string? Or switch to scientific notation for large numbers?

josefx3y ago

You can use doubles to store and calculate exact integer values. You just wont get 2^64 integers, instead you get the range +/-2^53 .

lifefeed3y ago· 4 in thread

My favorite floating point weirdness is that 0.1 can't be exactly represented in floating point.

jrockway3y ago

Isn't it equally weird that 1/3 can't be exactly represented in decimal?

kps3y ago

layer83y ago

pitaj3y ago

Yep! Too bad humanity has settled on decimal instead of dozenal (base 12).

1 more reply

kloch3y ago· 3 in thread

> if you add very big values to very small values, you can get inaccurate results (the small numbers get lost!)

There is a simple workaround for this:

https://en.wikipedia.org/wiki/Kahan_summation

It's usually only needed when adding billions of values together and the accumulated truncation errors would be at an unacceptable level.

phkahler3y ago

It can also come up in simple control systems. A simple low-pass filter can fail to converge to a steady state value if the time constant is long and the sample rate is high.

Y += (X-Y) * alpha * dt

When dt is small and alpha is too, the right hand side can be too small to affect the 24bit mantissa of the left.

I prefer a 16/32bit fixed point version that guarantees convergene to any 16bit steady state. This happened in a power conversion system where dt=1/40000 and I needed a filter in the 10's of Hz.

jbay8083y ago

kergonath3y ago

> It's usually only needed when adding billions of values together and the accumulated truncation errors would be at an unacceptable level.

OTOH, it’s easy to implement, so I have a couple of functions to do it easily, and I got quite a lot of use out of them. It’s probably overkill sometimes, but sometimes it’s useful.

dunham3y ago· 3 in thread

I had one issue where pdftotext would produce different output on different machines (Linux vs Mac). It broke some of our tests.

I tracked down where it was happening (involving an ==), but it magically stopped when I added print statements or looked at it in the debugger.

There were too many instances of == in that code base (which IMO is a bad idea with floats), so I just added -ffloat-store to the Linux build and called it a day.

alkonaut3y ago

Thankfully the solution to that problem came when x86 (32 bit) mostly disappeared.

astrange3y ago

macOS doesn't use -ffloat-store, it uses SSE for floats instead of x87 because it only supports machines with SSSE3. You wanted -mfpmath=sse or -march=native.

dunham3y ago

Could be, but that's my recollection of the situation. I had added -ffloat-store on the Linux side, and I had read somewhere that was being done on the osx side.

The situation is quite different these days, I don't think any compiler still uses the old registers by default.

evancox1003y ago· 3 in thread

timerol3y ago

Based on the nearest numbers that floats represent, the two numbers are Y = 13.715999603271484375 (https://float.exposed/0x415b74bc) and X = 4.57200002670288085938 (https://float.exposed/0x40924dd3).

The division of these numbers is 2.9999998957049091386350361962468173875300478102103478639802753918, but the nearest float to that is 3. (Exactly 3.) [2]

The modulo operation can (presumably) determine that 3X > Y, so the modulo is Y - 2X, as normal.

This gives inconsistent results, if you don't know that every float is actually a range, and "3" as a float includes some numbers that are smaller than 3.

[1] https://www.wolframalpha.com/input?i=13.715999603271484375+%... [2] https://www.wolframalpha.com/input?i=2.999999895704909138635..., then https://float.exposed/0x40400000

svat3y ago

This is useful but note that Python uses 64-bit floats (aka "double"), so the right values are:

• "13.716" means 13.7159999999999993037 (https://float.exposed/0x402b6e978d4fdf3b)

• "4.572" means 4.57200000000000006395 (https://float.exposed/0x401249ba5e353f7d)

————————

• 10.01 / 3.34 is well over 2.99 (it's over 2.997 in fact) so you'd be justified in answering 3.00 (the nearest representable value).

• 10.01 % 3.34 is 3.33 (which you can represent exactly), so you'd answer 3.33 to that one.

(For an even bigger difference: try 19.99 and 6.67 to get 3.00 as quotient, but 6.65 as remainder.)

adrian_b3y ago

This has nothing to do with the definition or implementation of the remainder or modulo function.

It is a problem that appears whenever you compose an inexact function, like the conversion from decimal to binary, with a function that is not continuous, like the remainder a.k.a. modulo function.

However, it is enough to insert one discontinuous function in the computation chain for losing any guarantee about the magnitude of the error in the final result.

svat3y ago· 2 in thread

sacrosancty3y ago

If you have even less time, just think of them as representing physical measurements made with practical instruments and the math done with analog equipment.

svat3y ago

Edit: See also the floating point isn’t “bad” or random section that the author just added to the post (https://twitter.com/b0rk/status/1613986022534135809).

2 more replies

kilotaras3y ago· 2 in thread

Story time.

HarryHirsch3y ago

kilotaras3y ago

I think it that particular case they just didn't do error analysis.

ogogmad3y ago· 2 in thread

Also: The JSON example is nasty. Should IDs then always be strings?

deathanatos3y ago

> The JSON example is nasty.

(I feel like this issue is pretty well known, but I suppose it probably bites everyone once.)

JS does have the BigInt type, nowadays. Unfortunately, while the JSON.parse API includes a "reviver" parameter, the way it ends up working means that it can't actually take advantage of BigInt.

> Should IDs then always be strings?

That's a decent-ish solution; as it side-steps the interop issues. String, to me, is not unreasonable for an ID, as you're not going to be doing math on it.

gugagore3y ago

IIRC, forward error: the error between the given answer and the right answer to the given question.

Backward error: the error between the given question, and the question whose right answer is the given answer.

Easier to parse like this: a small forward error means that you give an answer close to the right one.

A small backward error means that the answer you give is the right answer for a nearby question.

toolslive3y ago· 2 in thread

in computer graphics, some people try to accumulate the transformations in 1 matrix. So A_{n+1} = T_{n} * A_{n} where T_{n} is a small transformation like a rotation around an axis

They learn by experience that they also slowly accumulate errors and end up with a transformation matrix A that's no longer orthogonal and will skew the image.

Or people try to solve large lineair systems of floating points with a naive Gaussian elimination approx and end up with noise. Same with naive iterative eigen vector calculations.

toolslive3y ago

moreover.... floating point addition is not commutative, nor is is associative.

CamperBob23y ago

What would be an example of a floating-point addition operation that doesn't commute?

1 more reply

owisd3y ago· 1 in thread

raphlinus3y ago

This one is fun! However, that particular formula is not my favorite, as it doesn't behave nicely when a is small. The art here is deciding what to special case.

This is explained a bit in a good Math Exchange post[2]. Jim Blinn's "How to solve a quadratic equation" also makes good reading.

Wait til you get to cubics and quartics.

[1]: https://docs.rs/kurbo/latest/src/kurbo/common.rs.html#116-16...

[2]: https://math.stackexchange.com/questions/866331/numerically-...

dkarl3y ago· 1 in thread

jordigh3y ago

BeetleB3y ago

> example 4: different languages sometimes do the same floating point calculation differently

It's worse than that: You can use the same language, compiler, library, machine, and still get different results if your OS is different.

OS's make a "default" choice. I think Linux defaults to slower, but more accurate, and BSD defaults to faster, but less accurate.

There may be flags you can set to force one configuration regardless of the OS, but you shouldn't assume your libraries do that.

> In principle you might think that different implementations should work the same way because of the IEEE 754 standard for floating point, but here are a couple of caveats that were mentioned:

> math operations in libc (like sin/log) behave differently in different implementations. So code using glibc could give you different results than code using musl

asicsp3y ago

See also "Floating Point visually explained": https://fabiensanglard.net/floating_point_visually_explained...

mochomocha3y ago

mikehollinger3y ago

[1] https://docs.microsoft.com/en-us/office/troubleshoot/excel/f...

BeetleB3y ago

> addition isn’t associative (x + (y + z)) is different from (x + y) + z))

A thousand thanks for not saying "addition is not commutative".

(Addition is commutative in floating point. It merely is not associative).

svnpenn3y ago

Here is Go version. works exactly as expected, no surprises. People just need to grow up and use a modern language, not a 50 year old out of date language:

    package main
    
    import "fmt"
    
    func main() {
       var i, iterations, meters float64
       for iterations = 100_000_000; i < iterations; i++ {
          meters += 0.01
       }
       // Expected: 1000.000000 km
       fmt.Printf("Expected: %f km\n", 0.01 * iterations / 1000)
       // Got: 1000.000001 km
       fmt.Printf("Got: %f km \n", meters / 1000)
    }

Lind53y ago

cratermoon3y ago

Muller's Recurrence is my favorite example of floating point weirdness. See https://scipython.com/blog/mullers-recurrence/ and https://latkin.org/blog/2014/11/22/mullers-recurrence-roundo...

marshallward3y ago

We work hard to retain floating point reproducibility in climate models. I have a presentation on this, if anyone is interested.

https://www.marshallward.org/fortrancon2021/#/title-slide

https://m.youtube.com/watch?v=wQQuFXm6ZqU&t=27s

fuzzfactor3y ago

On the hardware, it's fundamentally integer arithmetic under the hood.

Floating point itself is an implementation on top of that.

The number of decimal places also is an integer naturally which is related to scientific notation. Whether a result is positite or negative is a simple flag.

The number of significant figures available on even an 8-bit computer just blows away the slide rule but not if you don't take full advantage of it.

Waterluvian3y ago

Regarding example 2.1:

I thought the JSON spec was that a number is of any precision and that it’s up to the parser to say, “best I can do is 754.”

That is, I think you can deserialize very long decimal numbers to higher accuracy in some languages.

tails4e3y ago

Got bitten by the big number being added to small number when converting code from double precision to single precision, hard to track down when subtle errors get introduced due to this.

ape43y ago

All numbers in JavaScript are floats, unless you make an array with Int8Array(). https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

I wonder if people sometimes make a one element integer array this way so they can have a integer to work with.

weakfortress3y ago

Used to run into these problems all the time when I was doing work in numerical analysis.

1 more reply

Aardwolf3y ago

> but I wanted to mention it because:

> 1. it has a funny name

Reasoning accepted!

1 more reply

j / k navigate · click thread line to collapse