undefined | Better HN

0 pointsqayxc4y ago0 comments

> The resolution with 64-bit floats (let alone integers) would be absurd, roughly a million times finer-grained still.

Careful there! Floating point numbers do not form a proper field, not even a semi-group. Due to the uneven distribution of elements, the field axioms don't hold (e.g. both commutativity and distributivity can be violated) and great care has to be taken to assure the numeric stability of computations.

0 comments

2 comments · 1 top-level

jiggawatts4y ago· 1 in thread

My physics professor had some good examples of how numeric precision vastly outstrips reality: if modelling a 1m iron bar using 32-bit numbers, the error in length is substantially less than a dust mote landing on the end of it. It's about the same as a virus (not a bacterium) on the end... or not. The oil from a fingerprint is thicker. The mere presence of a human in the room will warm up the iron rod enough to cause it to expand more than this.

You only get physically significant errors when using iterated algorithms where the errors accumulate, or when doing what amounts to equality comparisons, which is almost always an error.

Note that 64-bit numbers aren't "twice" as precise.[1] They're four billion times more precise. Going to 128 bits is absurd beyond belief. Numbers like these would allow the entire visible universe to be modelled, down to the width of a proton. You do not need 128 bit numbers for anything made on Earth, by humans, ever. If you think you do, you've made a mistake. It's as simple as that.

[1] floating point numbers and integers are obviously different, but the concepts are the same. A 64-bit double is "just" 536 million times more precise that a 32-bit float, but that is still an awful lot of precision for anything made of matter...

qayxcOP4y ago

> [1] floating point numbers and integers are obviously different, but the concepts are the same.

No they're not. That's the entire point. 32 bit IEEE floats get you 6 to 9 significant digits, whereas 64 bit IEEE floats get you 15 to 17 significant digits. Loss of significance and catastrophic cancellation are real problems in numerical analysis.

Physics in particular doesn't often have closed form solutions, so you're forced to use iterative approximations. Same goes for large matrix operations, which is why you actually have to be very careful with the algorithm you choose and the order of operations.

If you're still not convinced, feel free to try it yourself:

  #include <iostream>
  #include <iomanip>
  #include <cmath>
  #include <tuple>

  // solve ax^2+bx+c=0
  template<typename T> auto
  solve_quadratic(T const& a, T const& b, T const& c) -> std::tuple<T, T> {
    auto t = std::sqrt(b*b - T(4)*a*c);
    return {(-b + t) / (T(2)*a), (-b - t) / (T(2)*a)};
  }

  int main() {
    std::cout << "Expected x1=0.000000075 x2=-200.000000075\n"
    std::cout << "Single precision:\n";
    {
      float a=1.0f, b=200.0f, c=-0.000015f;
      auto [x1, x2] = solve_quadratic<float>(a, b, c);
      std::cout << "x1=" << x1 << ", x2=" << x2 << "\n";
      std::cout << "ax1^2+bx1+c=" << (a*x1*x1 + b*x1 + c) << "\n";
      std::cout << "ax2^2+bx2+c=" << (a*x2*x2 + b*x2 + c) << "\n";
    }
    std::cout << "\nExpected x1=1.000000028975958 x2=1.000000000000000\n";
    std::cout << "Double precision:\n";
    {
      double a=94906265.625, b=-189812534, c=94906268.375;
      auto [x1, x2] = solve_quadratic<double>(a, b, c);
      std::cout << std::setprecision(16);
      std::cout << "x1=" << x1 << ", x2=" << x2 << "\n";
      std::cout << "ax1^2+bx1+c=" << (a*x1*x1 + b*x1 + c) << "\n";
      std::cout << "ax2^2+bx2+c=" << (a*x2*x2 + b*x2 + c) << "\n";
    }
  }

Even 64-bit IEEE floats don't help if you use the standard quadratic formula. Note that you can improve the 32-bit case above significantly by reformulating the solution to

  template<typename T> auto
  solve_quadratic(T const& a, T const& b, T const& c) -> std::tuple<T, T> {
    auto sign_b = b < 0.0 ? -1.0 : 1.0;
    auto t = -b - sign_b*std::sqrt(b*b - T(4)*a*c);
    return {(T(2)*c) / t, t / (T(2)*a)};
  }

This simple change will yield the precise result (within the fp32 precision) for x1 and x2 with a=1, b=200, c=-0.000015 (i.e. the error of ax²+bx+c will be lower than the 9 max significant digits).

However this won't help with the second (64-bit) example, which will still be wrong after the 8th digit (i.e. well within the supposed 12 to 15 significant digits of a 64-bit IEEE float).

Also note that in both cases all numbers involved have less significant digits than supported by the respective FP format. Just to give you a little insight on why available bits ≠ precision in floating point maths.

j / k navigate · click thread line to collapse