The Fundamental Axiom of Floating Point Arithmetic (opens in new tab)

(johnbcoughlin.com)

172 pointsjohnbcoughlin5y ago60 comments

60 comments

53 comments · 13 top-level

fanf25y ago· 15 in thread

This “fundamental axiom” is not true:

”This means that for any two floating point numbers, say x and y, any operation involving them will give a floating point result which is within a factor of 1+ϵ of the true result.”

It fails for x-y when x and y are close. Subtraction is the easiest way to get big floating point errors. The article does mention this towards the end but you need to wade through a lot of equations before you get told this very important caveat.

The more tricky way to accumulate errors is to add lots of small numbers to a big number. Just summing a list of floats can be fraught with peril! Kahan’s summation algorithm is enlightening: https://en.wikipedia.org/wiki/Kahan_summation_algorithm

CharlesHorsey5y ago

"It fails for x-y when x and y are close"

I don't think that's quite right. Each individual operation is accurate within epsilon, the trouble with cancellation is that the error from operations prior to the subtraction is much larger than the result of the subtraction.

Example, if I do 1 + 2^30 - 2^30 in single precision I get zero. But each individual operation is correct. 1 + 2^30 = 2^30, so the error is 1 < 2^30 * epsilon. Then 2^30 - 2^30 = 0, which is exactly correct.

fanf25y ago

Yes. The big errors appear when you remember x is actually (1+ε)x, as explained towards the end of the article. If you assume the inputs are exact then you get an accurate result, but if you are doing error analysis that does not help you.

the_svd_doctor5y ago

This is the right answer. Substraction is accurate, as much as it can be given the its inputs. It’s everything (combined) leading up to it that isn’t.

Every year I struggle to teach this simple concept in class :-) :-(

hpcjoe5y ago

This is due to the non-associative nature of FP. (a+b) + c != a + (b+c).

I disagree that the "axiom" as stated is fundamental. My main argument with it is that I don't see an easy way to go from the axiom to usable theorems about FP.

Happy to be wrong on this, but I am missing this at this time.

1 more reply

augustt5y ago

Yep the term for this is catastrophic cancellation, which doesn’t invalidate the axiom.

imurray5y ago

> Kahan’s summation algorithm

Radford Neal (2015) has an approach to computing sums exactly (to the correct rounded floating point number) in less than twice the time to do it naively. In some situations it can do work in parallel and saturate memory bandwidth, so takes roughly the same time: http://www.cs.toronto.edu/~radford/xsum.abstract.html

"Kahan’s (1965) method for reducing summation error (but without producing the exact result) was also tested on all systems. On modern 64-bit processors, computing the exact sum with the large superaccumulator method was faster than Kahan’s method for summations of more than about 1000 terms. Kahan’s method was significantly faster than the small superaccumulator method only for summations of less than about 100 terms."

vortico5y ago

I think you're misinterpretating the fundamental axiom because it's still true for `x-y`. What you're thinking is that the fundamental axiom applies to the composition of operations `closest_float(x) + closest_float(y)`, but the axiom only applies to each individual operation.

The axiom can be stated more rigorously "If f is a (possibly multivariate) function on the set of floats to the reals, the IEEE-compliant approximation of f must give the closest float for all float values of its arguments."

zozbot2345y ago

> "If f is a (possibly multivariate) function on the set of floats to the reals, the IEEE-compliant approximation of f must give the closest float for all float values of its arguments."

It should be noted that this is not exactly true for some IEEE functions, due to the table-maker's dilemma: computing the "closest float approximation" to an arbitrary function can be computationally infeasible if the true result is close to being halfway in-between two consecutive floats. (This applies to the default rounding mode, but can be extended by analogy to other rounding modes.)

tgb5y ago

They're correct that the "In mathematical terms" version of it is wrong (even for addition, if one is negative). It's missing fl()'s around the other a's and b's. Your statement is greatly superior than the poor man's version presented in the article.

lifthrasiir5y ago

Kahan summation is unfortunately slower than naive summation, which can be problematic. Pairwise summation [1], which roundoff error is proportional to log n, can be used as a drop-in replacement of naive summation as it performs more or less equally.

[1] https://en.wikipedia.org/wiki/Pairwise_summation

the_svd_doctor5y ago

That's not true. The axiom is correct. Say you're using double (i.e., you have 16 decimal digits of accuracy, roughly) and you are doing ((1.0 + 1e-20) - 1.0).

That becomes (1.0 - 1.0) = 0.0 because 1e-20 was dropped during the addition. The substraction was perfectly accurate given its inputs (1.0 - 1.0 is exactly 0.0).

The issue is that float(1.0 + 1e-20) = 1.0. But that's also very accurate ! (All the 16 decimal digits of 1.0 are correct). But yeah the 20th (decimal) digit got dropped.

You see how each individual computation is maximally accurate. So the axiom is correct. What it boils down to is that you are manipulating (adding/subtracting) numbers of very different magnitudes. So cancellations, which would be negligible if all numbers where of similar magnitude, becomes catastrophic. Hence the name catastrophic cancellations.

stephencanon5y ago

The key is the framing; as stated, x and y are _exact_ floating-point values, which makes the statement true[1]. In fact, it's embarrassingly trivial, since it is just a weakening of the definition of rounding used by IEEE 754. In particular, if cancellation occurs in x - y, then the result is exact (Sterbenz' lemma).

[1] Under the assumption that no underflow or overflow occurs (which is the usual framing for this sort of naive error analysis). It's easy enough to correct for underflow by adding an absolute error term as well.

dreamcompiler5y ago

In other words, while we all know that addition and subtraction over real numbers is associative, addition and subtraction over floating point numbers is not.

commandlinefan5y ago

> Subtraction is the easiest way to get big floating point errors.

That's why most modern floating point representations allow subnormal numbers: without them, subtracting two near, normal numbers will result in zero even if the numbers aren't equal.

dragontamer5y ago

> It fails for x-y when x and y are close. Subtraction is the easiest way to get big floating point errors.

And note: addition might be subtraction, while subtraction might be addition.

Consider x=1.0 and y=-0.99999.

x + y = 0.00001

You did addition, but you got cancellation error. The only solution is to sort the numbers from largest to smallest dynamically during runtime.

rustybolt5y ago· 6 in thread

I was recently thinking about writing something about fixed point vs floating point.

The numbers that you can represent with floating point are exponentially distributed. Floating point is well-behaved when you multiply things, and it maintains a good relative error. Addition works best when you add similar numbers. Overflow is almost impossible, but it's easy to get loss of significance.

The numbers that you can represent with fixed point are uniformly distributed. Fixed point is well-behaved when you add things, and it maintains a good absolute error. Loss of significance doesn't happen, but it's easy to get an overflow.

In both systems it's possible to accumulate error, but under normal circumstances the approximation error should average out. Fixed point makes it easier to reason about the error, floating point is a bit harder, but usually works well in practice (with some exceptions -- for example, if you add a huge array of small values you will run into trouble due to the loss of significance).

edflsafoiewq5y ago

The big problem with fixed point is reciprocals are really bad. If you have a fixed point format that goes from 0 to 1000, then 0.0125% of the numbers are below 1/8, whereas 99.2% of the numbers are above 8.

Obviously for the reciprocal to be accurate, there needs to be the same amount of numbers above 1 as below 1, like there is for floats.

rustybolt5y ago

Good point, indeed. Fixed point works best for multiplication with numbers that are approximately 1 in magnitude.

bjackman5y ago

Fixed point also has the advantage that you can do it quickly without needing a floating point unit in your CPU. That means you can do it on microcontrollers, and you can do it in the kernel without needing to save extra user context on kernel entry (this is only relevant on architectures that have separate floating point registers).

munchbunny5y ago

I'm not sure that fixed point actually makes it easier to reason about the error, it's just that fixed point isn't the default these days, so you have to choose to use fixed point. That means, when you do use it, it likely suits the problem you need to solve.

nestorD5y ago

I have meet a surprisingly large number of people who think that fixed point arithmetic is the solution to all of their problems (probably because its use is not widespread).

The big problem with fixed-point, making them a bad choice for most scientific computation (which is where the design decision for the current dominating system came from) is their lack of resolution.

You can represent only a comparatively small interval of numbers and a few division|multiplication will quickly lead to large errors in the output (even if you do not hit underflows/overflows).

mamcx5y ago

> I have meet a surprisingly large number of people who think that fixed point arithmetic is the solution to all of their problems

>a bad choice for most scientific computation

Is weird this. Scientific computation is at most a niche.

Financial and real-world decimal calculations is the norm.

That is why a lot of people could be server better by the use of decimals or fixed point.

2 more replies

tux35y ago· 6 in thread

I'm far from a floating point expert, but the fact that the article does everything in terms of epsilons, without mentioning ULPs a single time confuses me.

In fact, that fundamental axiom of FP arithmetic is NOT in terms of some absolute value epsilonMachine. As I understand it, it says that there exists a relative epsilon value (which depends on your input numbers), and that the error of any FP operation on those numbers will be bounded by that epsilon.

What the article doesn't explain, is that floating point numbers and operations are guaranteed to be accurate to the nearest half-ULP, not to the nearest absolute epsilon. And what exactly is the absolute value of an ULP? It depends!

The difference is so fundamental, I'm not sure you can have a useful mental model of floats if you don't start by addressing this.

akdor11545y ago

The epsilon model is perfectly adequate, and does not assume epsilon is constant on a given machine (the article glosses over this in the first paragraph where epsilon is introduced, but it is spelled out later).

In fact the epsilon model is key to being able to do a mathematically formulated error analysis at all.

kd5bjo5y ago

Would you mind expanding the `ULP` acronym you’re using: if it’s so fundamental to understanding this, I’d like to know what it is.

tux35y ago

My apology, ULP stands for "Unit in the Last Place".

The simple explanation is that an ULP is defined as the difference between two consecutive floating point numbers, which means it isn't a fixed value.

Another "intuitive" explanation, is that if you increment the binary representation of a float, the difference between the two numbers is 1 ULP.

(I'd explain more, but I have to run to a meeting! More info on Wikipedia: https://en.wikipedia.org/wiki/Unit_in_the_last_place)

1 more reply

DougBTX5y ago

Instead of calling it "ULPs" they refer to a "next floating point number" which I'm pretty sure is the same idea:

> Unlike the number line, for every representable floating point number x, there is a _next_ floating point number, and it is approximately x + ϵmachine x. So for an arbitrary real number x0, it falls between two floating point values x and x+ϵx (leaving off the subscript for conciseness). When we represent x0 in floating point, we will get one of these two values.

fractionalhare5y ago

Machine epsilon is fine for doing error analysis. For example, I used machine epsilon instead of ULP for an article I recently wrote which walks through implementation and (numerical) analysis of various algorithms for e^x: https://www.pseudorandom.com/implementing-exp

So you don't need ULP (and if you did, you could convert to ULP from machine epsilon). But you are right that machine epsilon varies according to where you are in the interval of floating point numbers. Some authors don't mention that, but yes, since floating point values are not evenly distributed the machine epsilon for any particular x will vary. This is good to know, because even if error analysis demonstrates that your approximation function is bounded by machine epsilon, it's still going to exhibit less accuracy as you move away from the origin.

carlmr5y ago

But multiplying by (1+e) is relative. I guess it's much more convenient mathematically than dealing with the absolute error of the calculation.

dvt5y ago· 4 in thread

Great article, and a wonderful explanation as to why one should avoid (at all costs) doing financial calculations using IEEE-754 floating point math.

> ... where each \epsilon is very small, on the other of \epsilon_machine.

Not sure if author is reading replies here, but I think "other" should be "order."

KMag5y ago

That depends on the type of the financial calculations you're doing. Certainly accounting and billing shouldn't be done in binary floating point formats. Financial modelling and forecasting calculations aren't exact calculations, and the intermediate library functions for binary floating point may suffer less rounding error than practical alternatives. Also, computational efficiency comes into play once you're at the scale of spending millions of dollars per year on hardware and electricity for financial modelling.

Also, since IEEE-754-2008, there are decimal IEEE-754 floating point numbers, which are useful for some types of financial calculations.

cesaref5y ago

... except that every financial trading system i've ever seen is full of floats (and i've seen a lot).

bigiain5y ago

Many years back I worked on a project involved in heavily government regulated gambling (horse race betting). The rules were every calculation including all intermediate steps had to be done to at least 1/10,000th of a cent precision, and rounding was only allowed once at the very final step.

(Which, since my part of that project was a Javascript/jQueryMobile mobile webapp, I managed to push all that to the backend team saying "I'm doing calculations using Javascript, I don't even have proper integers to work with here - you'll need to send me all financial figures as strings, and do all the calculations at your end according to the rules. I can accurately display strings, I will not look a judge in the eye and try to claim I can accurately calculate _anything_ monetary in Javascript.")

jhj5y ago

IEEE 754 also provides for decimal radix floating point, just that there is no real hardware support for it beyond some IBM and Fujitsu chips, and what you can put together on FPGAs.

ema5y ago· 3 in thread

I wonder if it makes sense to treat floating points not as single points on the number line but as a range. This way when you subtract two similar numbers the result is a range that is wide in relation to the distance from 0, explicitly encoding the bigger relative rounding error.

atq21195y ago

That exists and is called interval arithmetic.

ritter2a5y ago

There are even more improvements (in some regard) to this, e.g. Affine Arithmetic[0, 1]. Here, error terms from different operations are accumulated symbolically so that in the case where they would cancel out (as could be useful for numerical algorithms), tighter bounds can be obtained. There are several research groups working on improving such analyses.

[0] - https://en.wikipedia.org/wiki/Affine_arithmetic

[1] - http://users.ece.cmu.edu/~rutenbar/pdf/rutenbar-intervalicas...

edit: formatting

nestorD5y ago

And the problem with it is that it can lead to [-inf;+inf] intervals on sane computation such as Newton's method (refinements have been introduced but nothing that fully solves the problem).

qppo5y ago· 2 in thread

> All floating point arithmetic operations are exact up to a relative error of [machine epsilon].

What about sub/de normal operands?

FullyFunctional5y ago

Exactly. This axiom doesn't hold for denormals which probably don't matter for finance, but really matter for calculations that regress numbers for many iterations.

OT: while obviously all fixed precision floating point must compromise on the representation of real numbers, Posits give you more precision for the same bits, behave far more sanely (eg. doesn't round to zero or Inf), and error bounds are much easier to understand.

EDIT: capitalize Posits

nestorD5y ago

> Posits give you more precision for the same bits

Independent papers doing some validation on Posit computations are starting to get out and... it is not better, just a different trade-off. Overall it seems to be worth it for 16 bits precision (where floating points are very fragile) but highly debatable for 32 and 64 bits.

Note that many papers claiming good properties for Posits also introduce what they call a Quire which is... an exact dot product algorithm encapsulated in an object (something that is also easily implemented with traditional floating-points). Comparing a naive sum in IEEE floating point with an exact summation in Posit is an apple to orange comparison (the algorithms are different) when you want to evaluate the precision of a numeric type.

twelve405y ago· 2 in thread

> At Square, we used a long amount_cents, and we got along with our lives

how does this help if you need to divide or calculate sales tax? aren't you back in the accumulating error trap?

kd5bjo5y ago

Accounting practice generally allows / requires rounding to some quantum value at each operation, like $0.001– The issue with using floating point for money isn’t that it has errors, it‘s that those errors don’t match the ones you get when using a paper ledger.

kazinator5y ago

Firstly, if you treat floating-point without the proper care, that will still be the case if you use an integer for all currency amounts (as a number of cents) but reach for floating-point for percentage calculations and such.

Secondly, you can implement pencil-and-paper rounding rules using binary floating point! It's done by serializing to decimal text, working with the digits (perhaps as integers) and then converting back to floating point.

> those errors don’t match the ones you get when using a paper ledger.

If you have errors in your paper ledger, you're not much of an accountant.

1 more reply

tgb5y ago· 2 in thread

I think this is a poor take on floating point numbers. Read the classic article it references instead [1]. The IEEE guarantee is not that the result is accurate to a relative epsilon but instead is the closest representable number to the true result. This is a clear guarantee but requires the reader to actually know how floating point numbers are stored (which this article fails to say!). It is also rather subtle and it needs to be explored in more depth to be of any use.

[1] https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf

kd5bjo5y ago

> The IEEE guarantee is not that the result is accurate to a relative epsilon but instead is the closest representable number to the true result.

The latter implies the former within the magnitude range that allows normalized representation, which is the vast majority of calculations. As an introductory article for a technical audience, it does a good job of describing the abilities and limitations of floating point numbers. Going into more detail about the representation runs the risk of sacrificing clarity for precision.

tgb5y ago

I disagree and think that the article manages to fall into an "uncanny valley" of not-actually-understanding. Why else would it need the "A Cautionary Tale" section at the end? That's the first time he mentions that the previous sections only hold for positive numbers and shows that the statements made before are bad.

And personally I prefer an "exact" guarantee that the result is as close-as-possible rather than merely "within some relative epsilon". Compare his (incorrect) "in mathematical terms" description with the actual statement that would be that, if re(x) gives the real number represented by a floating point number x, then:

fl(re(x) + re(y)) = x + y

where addition on the left is of reals and on the right is the floating point addition operator. This is simpler and more accurate than his inequality! And it works for sub-normals etc.

2 more replies

kanobo5y ago

The equations are well integrated in the article and matches the font of the article very well. I thought it was some special font+math library but just turned out to be Georgia and MathJax with tasteful design choices.

falcrist5y ago

As someone who spends most of my time at work writing firmware for embedded systems, I learned VERY quickly never to use floating point numbers when I could use fixed point numbers. Better to use an integer number of thousandths of a thing than try to use floating point numbers and have error. This mindset is clearly common, because it seems like the vast majority of microcontrollers have no floating point hardware anyway.

At some point I was listening to a talk where the guy called floating point numbers _"The incredibly misleading numbers that never really work the way that you want them to."_

That's probably the best description of floating point numbers I've ever heard.

lopmotr5y ago

My personal fundamental axiom of floating point numbers is that they're like a physical measurement of a continuous quantity. Examples:

1 + 2^30 - 2^30 = 0 is like adding a drop of water to the ocean then removing an ocean's worth of water. You're not going to be able to meter that accurately enough to leave the one drop behind.

Round-trip conversions from decimal to FP and back result in a slightly different value. Why do you expect special treatment for a number system in any particular base? Nature's physical quantities aren't in any base. And why do you care about the 17th significant figure? You can't measure that with common scientific equipment.

Cumulative errors. They're like tolerance stacking. Engineers define dimensions from a common datum instead of in a chain from each other because their physical measurements accumulate error like floating point does.

They don't allow extreme values. You're never going to need them because you can't find a 10^400 m or 10^-400 m long object in nature either.

Dollars and cents don't work right. Money isn't a continuous quantity.

nikolay5y ago

I started to teach my 12-year son programming and when we started to talk about floats, his reaction that computers can do simple math with fractions properly was pretty astounding. I should've started him on Lisp instead.

kazinator5y ago

> The moral of the story is, never use a floating point number to represent money.

That is simply not true. Floating-point arithmetic has sufficient precision to represent multiples of $0.01 with vanishingly good accuracy even for astronomical sums of money that your ledger will never see. If it didn't, it wouldn't be able to "numerically solve differential equations, or linear systems, or land on the moon." Moreover, programs like Microsoft Excel wouldn't be able to use floating-point.

You have to make sure that all calculations that produce a currency amount are actually rounded to the closest $0.01 multiple (using cent-based dollar accounting as an example).

The main rookie mistake is to leave an amount represented as, say, $17.4537359..., and then just show it as $17.45 on paper or in the user interface.

Rounding to the closest $0.01 will thwart cumulative error. So for instance, suppose we start with 0, and then repeatedly add 0.01. The naive way would be:

   for (x = 0.0; x < WHATEVER; x += 0.01)
   {
   }

this will suffer from cumulative errors. Eventually the errors will add up to more than half a penny: 0.005 and we are toast: N iterations of our loop will have resulted in a value of x that is now closer to N+1 cents or N-1 cents than it is to N cents.

However, if we do this:

   for (x = 0.0; x < WHATEVER; x += 0.01)
   {
      x = round_to_nearest_penny(x);
   }

then we are fine: our code will nicely stride from one good penny approximation to another. We only start to run aground when x becomes astronomical, representing a number of pennies that is an integer in the 14 or 15 digit range or something like that. After that we start getting into a range where consecutive pennies cannot be represented any more at all, let alone good approximations.

If we are just dealing with pocket change, like a few billion dollars here and there, we are good.

So that is to say, we can't even take a column of figures that are individually the best approximations of number of pennies and simply add them together; we have to regularly clamp to the a penny as we carry on the summation. It doesn't necessarily have to be after each addition, but we can't let the error stray too far from the abstract penny we are trying to represent. Any value we record into a ledger should be crisply clamped to the floating point system's best approximation of that dollar and cent amount.

Using (binary!) floating-point, we can even implement specific pencil-and-paper rounding methods for fractional results. This is because we can render the floating point value into text, out to several more than the required number of digits of precision, and then apply a textual technique to analyze the digits.

Using double-precision IEEE floating point, I can easily write a routine that will round 0.125 to 0.12 but 0.135 to 0.14. I.e. banker's rounding to the closest even penny.

Now these numbers are actually:

   Abstract                      Actual
   0.125                         0.125
   0.12                          0.1199999999999999956
   0.135                         0.1350000000000000089
   0.14                          0.1400000000000000133

So e.g. my code will be actually taking 1350000000000000089 to 1400000000000000133. First, I will get the value as text rounded to four digits after the decimal: "0.1350". Then I process the four digits as a string, or perhaps make an integer out of them. I can work out that 1350 % 100 is 50, indicating that the number is "on the fence". Then get the parity of the hundreds digit: h = (x / 100) % 10 == 3. Aha, odd, so we have to go to 4: ((x / 1000) * 1000) + h * 400.

j / k navigate · click thread line to collapse

60 comments

53 comments · 13 top-level

fanf25y ago· 15 in thread

This “fundamental axiom” is not true:

”This means that for any two floating point numbers, say x and y, any operation involving them will give a floating point result which is within a factor of 1+ϵ of the true result.”

CharlesHorsey5y ago

"It fails for x-y when x and y are close"

fanf25y ago

the_svd_doctor5y ago

This is the right answer. Substraction is accurate, as much as it can be given the its inputs. It’s everything (combined) leading up to it that isn’t.

Every year I struggle to teach this simple concept in class :-) :-(

hpcjoe5y ago

This is due to the non-associative nature of FP. (a+b) + c != a + (b+c).

I disagree that the "axiom" as stated is fundamental. My main argument with it is that I don't see an easy way to go from the axiom to usable theorems about FP.

Happy to be wrong on this, but I am missing this at this time.

1 more reply

augustt5y ago

Yep the term for this is catastrophic cancellation, which doesn’t invalidate the axiom.

imurray5y ago

> Kahan’s summation algorithm

vortico5y ago

zozbot2345y ago

> "If f is a (possibly multivariate) function on the set of floats to the reals, the IEEE-compliant approximation of f must give the closest float for all float values of its arguments."

tgb5y ago

lifthrasiir5y ago

[1] https://en.wikipedia.org/wiki/Pairwise_summation

the_svd_doctor5y ago

That's not true. The axiom is correct. Say you're using double (i.e., you have 16 decimal digits of accuracy, roughly) and you are doing ((1.0 + 1e-20) - 1.0).

That becomes (1.0 - 1.0) = 0.0 because 1e-20 was dropped during the addition. The substraction was perfectly accurate given its inputs (1.0 - 1.0 is exactly 0.0).

The issue is that float(1.0 + 1e-20) = 1.0. But that's also very accurate ! (All the 16 decimal digits of 1.0 are correct). But yeah the 20th (decimal) digit got dropped.

stephencanon5y ago

dreamcompiler5y ago

In other words, while we all know that addition and subtraction over real numbers is associative, addition and subtraction over floating point numbers is not.

commandlinefan5y ago

> Subtraction is the easiest way to get big floating point errors.

That's why most modern floating point representations allow subnormal numbers: without them, subtracting two near, normal numbers will result in zero even if the numbers aren't equal.

dragontamer5y ago

> It fails for x-y when x and y are close. Subtraction is the easiest way to get big floating point errors.

And note: addition might be subtraction, while subtraction might be addition.

Consider x=1.0 and y=-0.99999.

x + y = 0.00001

You did addition, but you got cancellation error. The only solution is to sort the numbers from largest to smallest dynamically during runtime.

rustybolt5y ago· 6 in thread

I was recently thinking about writing something about fixed point vs floating point.

edflsafoiewq5y ago

Obviously for the reciprocal to be accurate, there needs to be the same amount of numbers above 1 as below 1, like there is for floats.

rustybolt5y ago

Good point, indeed. Fixed point works best for multiplication with numbers that are approximately 1 in magnitude.

bjackman5y ago

munchbunny5y ago

nestorD5y ago

I have meet a surprisingly large number of people who think that fixed point arithmetic is the solution to all of their problems (probably because its use is not widespread).

You can represent only a comparatively small interval of numbers and a few division|multiplication will quickly lead to large errors in the output (even if you do not hit underflows/overflows).

mamcx5y ago

> I have meet a surprisingly large number of people who think that fixed point arithmetic is the solution to all of their problems

>a bad choice for most scientific computation

Is weird this. Scientific computation is at most a niche.

Financial and real-world decimal calculations is the norm.

That is why a lot of people could be server better by the use of decimals or fixed point.

2 more replies

tux35y ago· 6 in thread

I'm far from a floating point expert, but the fact that the article does everything in terms of epsilons, without mentioning ULPs a single time confuses me.

The difference is so fundamental, I'm not sure you can have a useful mental model of floats if you don't start by addressing this.

akdor11545y ago

In fact the epsilon model is key to being able to do a mathematically formulated error analysis at all.

kd5bjo5y ago

Would you mind expanding the `ULP` acronym you’re using: if it’s so fundamental to understanding this, I’d like to know what it is.

tux35y ago

My apology, ULP stands for "Unit in the Last Place".

The simple explanation is that an ULP is defined as the difference between two consecutive floating point numbers, which means it isn't a fixed value.

Another "intuitive" explanation, is that if you increment the binary representation of a float, the difference between the two numbers is 1 ULP.

(I'd explain more, but I have to run to a meeting! More info on Wikipedia: https://en.wikipedia.org/wiki/Unit_in_the_last_place)

1 more reply

DougBTX5y ago

Instead of calling it "ULPs" they refer to a "next floating point number" which I'm pretty sure is the same idea:

fractionalhare5y ago

carlmr5y ago

But multiplying by (1+e) is relative. I guess it's much more convenient mathematically than dealing with the absolute error of the calculation.

dvt5y ago· 4 in thread

Great article, and a wonderful explanation as to why one should avoid (at all costs) doing financial calculations using IEEE-754 floating point math.

> ... where each \epsilon is very small, on the other of \epsilon_machine.

Not sure if author is reading replies here, but I think "other" should be "order."

KMag5y ago

Also, since IEEE-754-2008, there are decimal IEEE-754 floating point numbers, which are useful for some types of financial calculations.

cesaref5y ago

... except that every financial trading system i've ever seen is full of floats (and i've seen a lot).

bigiain5y ago

jhj5y ago

IEEE 754 also provides for decimal radix floating point, just that there is no real hardware support for it beyond some IBM and Fujitsu chips, and what you can put together on FPGAs.

ema5y ago· 3 in thread

atq21195y ago

That exists and is called interval arithmetic.

ritter2a5y ago

[0] - https://en.wikipedia.org/wiki/Affine_arithmetic

[1] - http://users.ece.cmu.edu/~rutenbar/pdf/rutenbar-intervalicas...

edit: formatting

nestorD5y ago

And the problem with it is that it can lead to [-inf;+inf] intervals on sane computation such as Newton's method (refinements have been introduced but nothing that fully solves the problem).

qppo5y ago· 2 in thread

> All floating point arithmetic operations are exact up to a relative error of [machine epsilon].

What about sub/de normal operands?

FullyFunctional5y ago

Exactly. This axiom doesn't hold for denormals which probably don't matter for finance, but really matter for calculations that regress numbers for many iterations.

EDIT: capitalize Posits

nestorD5y ago

> Posits give you more precision for the same bits

twelve405y ago· 2 in thread

> At Square, we used a long amount_cents, and we got along with our lives

how does this help if you need to divide or calculate sales tax? aren't you back in the accumulating error trap?

kd5bjo5y ago

kazinator5y ago

> those errors don’t match the ones you get when using a paper ledger.

If you have errors in your paper ledger, you're not much of an accountant.

1 more reply

tgb5y ago· 2 in thread

[1] https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf

kd5bjo5y ago

> The IEEE guarantee is not that the result is accurate to a relative epsilon but instead is the closest representable number to the true result.

tgb5y ago

fl(re(x) + re(y)) = x + y

where addition on the left is of reals and on the right is the floating point addition operator. This is simpler and more accurate than his inequality! And it works for sub-normals etc.

2 more replies

kanobo5y ago

falcrist5y ago

At some point I was listening to a talk where the guy called floating point numbers _"The incredibly misleading numbers that never really work the way that you want them to."_

That's probably the best description of floating point numbers I've ever heard.

lopmotr5y ago

My personal fundamental axiom of floating point numbers is that they're like a physical measurement of a continuous quantity. Examples:

1 + 2^30 - 2^30 = 0 is like adding a drop of water to the ocean then removing an ocean's worth of water. You're not going to be able to meter that accurately enough to leave the one drop behind.

They don't allow extreme values. You're never going to need them because you can't find a 10^400 m or 10^-400 m long object in nature either.

Dollars and cents don't work right. Money isn't a continuous quantity.

nikolay5y ago

kazinator5y ago

> The moral of the story is, never use a floating point number to represent money.

You have to make sure that all calculations that produce a currency amount are actually rounded to the closest $0.01 multiple (using cent-based dollar accounting as an example).

The main rookie mistake is to leave an amount represented as, say, $17.4537359..., and then just show it as $17.45 on paper or in the user interface.

Rounding to the closest $0.01 will thwart cumulative error. So for instance, suppose we start with 0, and then repeatedly add 0.01. The naive way would be:

   for (x = 0.0; x < WHATEVER; x += 0.01)
   {
   }

However, if we do this:

   for (x = 0.0; x < WHATEVER; x += 0.01)
   {
      x = round_to_nearest_penny(x);
   }

If we are just dealing with pocket change, like a few billion dollars here and there, we are good.

Using double-precision IEEE floating point, I can easily write a routine that will round 0.125 to 0.12 but 0.135 to 0.14. I.e. banker's rounding to the closest even penny.

Now these numbers are actually:

   Abstract                      Actual
   0.125                         0.125
   0.12                          0.1199999999999999956
   0.135                         0.1350000000000000089
   0.14                          0.1400000000000000133

j / k navigate · click thread line to collapse