Careful there! Floating point numbers do not form a proper field, not even a semi-group. Due to the uneven distribution of elements, the field axioms don't hold (e.g. both commutativity and distributivity can be violated) and great care has to be taken to assure the numeric stability of computations.
You only get physically significant errors when using iterated algorithms where the errors accumulate, or when doing what amounts to equality comparisons, which is almost always an error.
Note that 64-bit numbers aren't "twice" as precise.[1] They're four billion times more precise. Going to 128 bits is absurd beyond belief. Numbers like these would allow the entire visible universe to be modelled, down to the width of a proton. You do not need 128 bit numbers for anything made on Earth, by humans, ever. If you think you do, you've made a mistake. It's as simple as that.
[1] floating point numbers and integers are obviously different, but the concepts are the same. A 64-bit double is "just" 536 million times more precise that a 32-bit float, but that is still an awful lot of precision for anything made of matter...
No they're not. That's the entire point. 32 bit IEEE floats get you 6 to 9 significant digits, whereas 64 bit IEEE floats get you 15 to 17 significant digits. Loss of significance and catastrophic cancellation are real problems in numerical analysis.
Physics in particular doesn't often have closed form solutions, so you're forced to use iterative approximations. Same goes for large matrix operations, which is why you actually have to be very careful with the algorithm you choose and the order of operations.
If you're still not convinced, feel free to try it yourself:
#include <iostream>
#include <iomanip>
#include <cmath>
#include <tuple>
// solve ax^2+bx+c=0
template<typename T> auto
solve_quadratic(T const& a, T const& b, T const& c) -> std::tuple<T, T> {
auto t = std::sqrt(b*b - T(4)*a*c);
return {(-b + t) / (T(2)*a), (-b - t) / (T(2)*a)};
}
int main() {
std::cout << "Expected x1=0.000000075 x2=-200.000000075\n"
std::cout << "Single precision:\n";
{
float a=1.0f, b=200.0f, c=-0.000015f;
auto [x1, x2] = solve_quadratic<float>(a, b, c);
std::cout << "x1=" << x1 << ", x2=" << x2 << "\n";
std::cout << "ax1^2+bx1+c=" << (a*x1*x1 + b*x1 + c) << "\n";
std::cout << "ax2^2+bx2+c=" << (a*x2*x2 + b*x2 + c) << "\n";
}
std::cout << "\nExpected x1=1.000000028975958 x2=1.000000000000000\n";
std::cout << "Double precision:\n";
{
double a=94906265.625, b=-189812534, c=94906268.375;
auto [x1, x2] = solve_quadratic<double>(a, b, c);
std::cout << std::setprecision(16);
std::cout << "x1=" << x1 << ", x2=" << x2 << "\n";
std::cout << "ax1^2+bx1+c=" << (a*x1*x1 + b*x1 + c) << "\n";
std::cout << "ax2^2+bx2+c=" << (a*x2*x2 + b*x2 + c) << "\n";
}
}
Even 64-bit IEEE floats don't help if you use the standard quadratic formula.
Note that you can improve the 32-bit case above significantly by reformulating
the solution to template<typename T> auto
solve_quadratic(T const& a, T const& b, T const& c) -> std::tuple<T, T> {
auto sign_b = b < 0.0 ? -1.0 : 1.0;
auto t = -b - sign_b*std::sqrt(b*b - T(4)*a*c);
return {(T(2)*c) / t, t / (T(2)*a)};
}
This simple change will yield the precise result (within the fp32 precision) for x1 and x2 with a=1, b=200, c=-0.000015 (i.e. the error of ax²+bx+c will be lower than the 9 max significant digits).However this won't help with the second (64-bit) example, which will still be wrong after the 8th digit (i.e. well within the supposed 12 to 15 significant digits of a 64-bit IEEE float).
Also note that in both cases all numbers involved have less significant digits than supported by the respective FP format. Just to give you a little insight on why available bits ≠ precision in floating point maths.