”This means that for any two floating point numbers, say x and y, any operation involving them will give a floating point result which is within a factor of 1+ϵ of the true result.”
It fails for x-y when x and y are close. Subtraction is the easiest way to get big floating point errors. The article does mention this towards the end but you need to wade through a lot of equations before you get told this very important caveat.
The more tricky way to accumulate errors is to add lots of small numbers to a big number. Just summing a list of floats can be fraught with peril! Kahan’s summation algorithm is enlightening: https://en.wikipedia.org/wiki/Kahan_summation_algorithm
I don't think that's quite right. Each individual operation is accurate within epsilon, the trouble with cancellation is that the error from operations prior to the subtraction is much larger than the result of the subtraction.
Example, if I do 1 + 2^30 - 2^30 in single precision I get zero. But each individual operation is correct. 1 + 2^30 = 2^30, so the error is 1 < 2^30 * epsilon. Then 2^30 - 2^30 = 0, which is exactly correct.
Every year I struggle to teach this simple concept in class :-) :-(
I disagree that the "axiom" as stated is fundamental. My main argument with it is that I don't see an easy way to go from the axiom to usable theorems about FP.
Happy to be wrong on this, but I am missing this at this time.
Radford Neal (2015) has an approach to computing sums exactly (to the correct rounded floating point number) in less than twice the time to do it naively. In some situations it can do work in parallel and saturate memory bandwidth, so takes roughly the same time: http://www.cs.toronto.edu/~radford/xsum.abstract.html
"Kahan’s (1965) method for reducing summation error (but without producing the exact result) was also tested on all systems. On modern 64-bit processors, computing the exact sum with the large superaccumulator method was faster than Kahan’s method for summations of more than about 1000 terms. Kahan’s method was significantly faster than the small superaccumulator method only for summations of less than about 100 terms."
The axiom can be stated more rigorously "If f is a (possibly multivariate) function on the set of floats to the reals, the IEEE-compliant approximation of f must give the closest float for all float values of its arguments."
It should be noted that this is not exactly true for some IEEE functions, due to the table-maker's dilemma: computing the "closest float approximation" to an arbitrary function can be computationally infeasible if the true result is close to being halfway in-between two consecutive floats. (This applies to the default rounding mode, but can be extended by analogy to other rounding modes.)
That becomes (1.0 - 1.0) = 0.0 because 1e-20 was dropped during the addition. The substraction was perfectly accurate given its inputs (1.0 - 1.0 is exactly 0.0).
The issue is that float(1.0 + 1e-20) = 1.0. But that's also very accurate ! (All the 16 decimal digits of 1.0 are correct). But yeah the 20th (decimal) digit got dropped.
You see how each individual computation is maximally accurate. So the axiom is correct. What it boils down to is that you are manipulating (adding/subtracting) numbers of very different magnitudes. So cancellations, which would be negligible if all numbers where of similar magnitude, becomes catastrophic. Hence the name catastrophic cancellations.
[1] Under the assumption that no underflow or overflow occurs (which is the usual framing for this sort of naive error analysis). It's easy enough to correct for underflow by adding an absolute error term as well.
That's why most modern floating point representations allow subnormal numbers: without them, subtracting two near, normal numbers will result in zero even if the numbers aren't equal.
And note: addition might be subtraction, while subtraction might be addition.
Consider x=1.0 and y=-0.99999.
x + y = 0.00001
You did addition, but you got cancellation error. The only solution is to sort the numbers from largest to smallest dynamically during runtime.
The numbers that you can represent with floating point are exponentially distributed. Floating point is well-behaved when you multiply things, and it maintains a good relative error. Addition works best when you add similar numbers. Overflow is almost impossible, but it's easy to get loss of significance.
The numbers that you can represent with fixed point are uniformly distributed. Fixed point is well-behaved when you add things, and it maintains a good absolute error. Loss of significance doesn't happen, but it's easy to get an overflow.
In both systems it's possible to accumulate error, but under normal circumstances the approximation error should average out. Fixed point makes it easier to reason about the error, floating point is a bit harder, but usually works well in practice (with some exceptions -- for example, if you add a huge array of small values you will run into trouble due to the loss of significance).
Obviously for the reciprocal to be accurate, there needs to be the same amount of numbers above 1 as below 1, like there is for floats.
The big problem with fixed-point, making them a bad choice for most scientific computation (which is where the design decision for the current dominating system came from) is their lack of resolution.
You can represent only a comparatively small interval of numbers and a few division|multiplication will quickly lead to large errors in the output (even if you do not hit underflows/overflows).
>a bad choice for most scientific computation
Is weird this. Scientific computation is at most a niche.
Financial and real-world decimal calculations is the norm.
That is why a lot of people could be server better by the use of decimals or fixed point.
In fact, that fundamental axiom of FP arithmetic is NOT in terms of some absolute value epsilonMachine. As I understand it, it says that there exists a relative epsilon value (which depends on your input numbers), and that the error of any FP operation on those numbers will be bounded by that epsilon.
What the article doesn't explain, is that floating point numbers and operations are guaranteed to be accurate to the nearest half-ULP, not to the nearest absolute epsilon. And what exactly is the absolute value of an ULP? It depends!
The difference is so fundamental, I'm not sure you can have a useful mental model of floats if you don't start by addressing this.
In fact the epsilon model is key to being able to do a mathematically formulated error analysis at all.
The simple explanation is that an ULP is defined as the difference between two consecutive floating point numbers, which means it isn't a fixed value.
Another "intuitive" explanation, is that if you increment the binary representation of a float, the difference between the two numbers is 1 ULP.
(I'd explain more, but I have to run to a meeting! More info on Wikipedia: https://en.wikipedia.org/wiki/Unit_in_the_last_place)
> Unlike the number line, for every representable floating point number x, there is a _next_ floating point number, and it is approximately x + ϵmachine x. So for an arbitrary real number x0, it falls between two floating point values x and x+ϵx (leaving off the subscript for conciseness). When we represent x0 in floating point, we will get one of these two values.
So you don't need ULP (and if you did, you could convert to ULP from machine epsilon). But you are right that machine epsilon varies according to where you are in the interval of floating point numbers. Some authors don't mention that, but yes, since floating point values are not evenly distributed the machine epsilon for any particular x will vary. This is good to know, because even if error analysis demonstrates that your approximation function is bounded by machine epsilon, it's still going to exhibit less accuracy as you move away from the origin.
> ... where each \epsilon is very small, on the other of \epsilon_machine.
Not sure if author is reading replies here, but I think "other" should be "order."
Also, since IEEE-754-2008, there are decimal IEEE-754 floating point numbers, which are useful for some types of financial calculations.
(Which, since my part of that project was a Javascript/jQueryMobile mobile webapp, I managed to push all that to the backend team saying "I'm doing calculations using Javascript, I don't even have proper integers to work with here - you'll need to send me all financial figures as strings, and do all the calculations at your end according to the rules. I can accurately display strings, I will not look a judge in the eye and try to claim I can accurately calculate _anything_ monetary in Javascript.")
--
[0] - https://en.wikipedia.org/wiki/Affine_arithmetic
[1] - http://users.ece.cmu.edu/~rutenbar/pdf/rutenbar-intervalicas...
edit: formatting
What about sub/de normal operands?
OT: while obviously all fixed precision floating point must compromise on the representation of real numbers, Posits give you more precision for the same bits, behave far more sanely (eg. doesn't round to zero or Inf), and error bounds are much easier to understand.
EDIT: capitalize Posits
Independent papers doing some validation on Posit computations are starting to get out and... it is not better, just a different trade-off. Overall it seems to be worth it for 16 bits precision (where floating points are very fragile) but highly debatable for 32 and 64 bits.
Note that many papers claiming good properties for Posits also introduce what they call a Quire which is... an exact dot product algorithm encapsulated in an object (something that is also easily implemented with traditional floating-points). Comparing a naive sum in IEEE floating point with an exact summation in Posit is an apple to orange comparison (the algorithms are different) when you want to evaluate the precision of a numeric type.
how does this help if you need to divide or calculate sales tax? aren't you back in the accumulating error trap?
Secondly, you can implement pencil-and-paper rounding rules using binary floating point! It's done by serializing to decimal text, working with the digits (perhaps as integers) and then converting back to floating point.
> those errors don’t match the ones you get when using a paper ledger.
If you have errors in your paper ledger, you're not much of an accountant.
[1] https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf
The latter implies the former within the magnitude range that allows normalized representation, which is the vast majority of calculations. As an introductory article for a technical audience, it does a good job of describing the abilities and limitations of floating point numbers. Going into more detail about the representation runs the risk of sacrificing clarity for precision.
And personally I prefer an "exact" guarantee that the result is as close-as-possible rather than merely "within some relative epsilon". Compare his (incorrect) "in mathematical terms" description with the actual statement that would be that, if re(x) gives the real number represented by a floating point number x, then:
fl(re(x) + re(y)) = x + y
where addition on the left is of reals and on the right is the floating point addition operator. This is simpler and more accurate than his inequality! And it works for sub-normals etc.
At some point I was listening to a talk where the guy called floating point numbers _"The incredibly misleading numbers that never really work the way that you want them to."_
That's probably the best description of floating point numbers I've ever heard.
1 + 2^30 - 2^30 = 0 is like adding a drop of water to the ocean then removing an ocean's worth of water. You're not going to be able to meter that accurately enough to leave the one drop behind.
Round-trip conversions from decimal to FP and back result in a slightly different value. Why do you expect special treatment for a number system in any particular base? Nature's physical quantities aren't in any base. And why do you care about the 17th significant figure? You can't measure that with common scientific equipment.
Cumulative errors. They're like tolerance stacking. Engineers define dimensions from a common datum instead of in a chain from each other because their physical measurements accumulate error like floating point does.
They don't allow extreme values. You're never going to need them because you can't find a 10^400 m or 10^-400 m long object in nature either.
Dollars and cents don't work right. Money isn't a continuous quantity.
That is simply not true. Floating-point arithmetic has sufficient precision to represent multiples of $0.01 with vanishingly good accuracy even for astronomical sums of money that your ledger will never see. If it didn't, it wouldn't be able to "numerically solve differential equations, or linear systems, or land on the moon." Moreover, programs like Microsoft Excel wouldn't be able to use floating-point.
You have to make sure that all calculations that produce a currency amount are actually rounded to the closest $0.01 multiple (using cent-based dollar accounting as an example).
The main rookie mistake is to leave an amount represented as, say, $17.4537359..., and then just show it as $17.45 on paper or in the user interface.
Rounding to the closest $0.01 will thwart cumulative error. So for instance, suppose we start with 0, and then repeatedly add 0.01. The naive way would be:
for (x = 0.0; x < WHATEVER; x += 0.01)
{
}
this will suffer from cumulative errors. Eventually the errors will add up to more than half a penny: 0.005 and we are toast: N iterations of our loop will have resulted in a value of x that is now closer to N+1 cents or N-1 cents than it is to N cents.However, if we do this:
for (x = 0.0; x < WHATEVER; x += 0.01)
{
x = round_to_nearest_penny(x);
}
then we are fine: our code will nicely stride from one good penny approximation to another. We only start to run aground when x becomes astronomical, representing a number of pennies that is an integer in the 14 or 15 digit range or something like that. After that we start getting into a range where consecutive pennies cannot be represented any more at all, let alone good approximations.If we are just dealing with pocket change, like a few billion dollars here and there, we are good.
So that is to say, we can't even take a column of figures that are individually the best approximations of number of pennies and simply add them together; we have to regularly clamp to the a penny as we carry on the summation. It doesn't necessarily have to be after each addition, but we can't let the error stray too far from the abstract penny we are trying to represent. Any value we record into a ledger should be crisply clamped to the floating point system's best approximation of that dollar and cent amount.
Using (binary!) floating-point, we can even implement specific pencil-and-paper rounding methods for fractional results. This is because we can render the floating point value into text, out to several more than the required number of digits of precision, and then apply a textual technique to analyze the digits.
Using double-precision IEEE floating point, I can easily write a routine that will round 0.125 to 0.12 but 0.135 to 0.14. I.e. banker's rounding to the closest even penny.
Now these numbers are actually:
Abstract Actual
0.125 0.125
0.12 0.1199999999999999956
0.135 0.1350000000000000089
0.14 0.1400000000000000133
So e.g. my code will be actually taking 1350000000000000089 to 1400000000000000133. First, I will get the value as text rounded to four digits after the decimal: "0.1350". Then I process the four digits as a string, or perhaps make an integer out of them. I can work out that 1350 % 100 is 50, indicating that the number is "on the fence". Then get the parity of the hundreds digit: h = (x / 100) % 10 == 3. Aha, odd, so we have to go to 4: ((x / 1000) * 1000) + h * 400.