The simplest and highest-impact tests are all the edge cases - if an input matrix/vector/scalar is 0/1/-1/NaN, that usually tells you a lot about what the outputs should be.
It can be difficult to determine sensible numerical limit for error in the algorithms. The simplest example is a dot product - summing floats is not associative, so doing it in parallel is not bitwise the same as serial. For dot in particular it's relatively easy to come up with an error bound, but for anything more complicated it takes a particular expertise that is not always available. This has been a work in progress, and sometimes (usually) we just picked a magic tolerance out of thin air that seems to work.
Solvers are tested using analytical solutions and by inverting them, e.g. if we're solving Ax = y, for x, then Ax should come out "close" to the original y (see error tolerance discussion above).
One of the most surprising things to me is that the suite has identified many bugs in vendor math libraries (OpenBLAS, MKL, cuSparse, rocSparse, etc.) - a major component of what we do is wrap up these vendor libraries in a common interface so our users don't have to do any work when they switch supercomputers, so in practice we test them all pretty thoroughly as well. Maybe I can let OpenBLAS off the hook due to the wide variety of systems they support, but I expected the other vendors would do a better job since they're better-resourced.
For this reason we find regression tests to be useful as well.
Probably worth mentioning that in general the tolerance should be relative error, not absolute, for floating point math. Absolute error tolerance should only be used when there’s a maximum limit on the magnitude of inputs, or the problem has been analyzed and understood.
I know that doesn’t stop people from just throwing in 1e-6 all over the place, just like the article did. (Hey I do it too!) But if the problem hasn’t been analyzed then an absolute error tolerance is just a bug waiting to happen. It might seem to work at first, but then catch and confuse someone as soon as the tests use bigger numbers. Or maybe worse, fail to catch a bug when they start using smaller numbers.
Of course, as you say, absolute error is also crap in general: it's overly restrictive for large inputs and overly permissive for small ones.
I'm not a numerics person, but I do end up needing to decide on something sensible for error bounds on computations sometimes. How does one do this properly? Interval arithmetic or something?
Presently, I am working on cuSPARSE and I'm very keen to improve its testing and correctness. I would appreciate anything more you can share about bugs you've seen in cuSPARSE. Feel free to email me, eedwards at nvidia.com
0-length vectors, too.
Then, do the same with Nx1, 1xN, and 1x1 matrices.
Many python numerical libraries change how they internally represent certain data types and/or change internal algorithms from version to version. This means that running the exact same code on two different versions may give you slightly different results.
In your side project this may not be a big deal but if you are working at, say, a hedge fund or a company that does long and complicated data processing pipelines then version upgrades with no unit testing can be "very bad".
In practice, these days compilers always use the former because they're widely available and are predictable. But if you tell a C compiler to not assume that SSE is available, it will use the x87 instructions.
If you use the x87 instructions, whether an intermediate float value is stored in memory or in a register changes the outcome, because it's stored in memory truncated to 64 bits, but in the (x87) register it's the full 80 bits.
If you rely on exact, reproducible floating-point results, you have more to worry about than just software versions.
There is a huge amount of data analysis which uses numbers representing money as inputs where it's fine to use floating point, and financial entities routinely do so. For a transaction, interest payments, and that sort of thing, then yes, one needs a specific sort of fixed-point representation, and operations which round in whatever the legally-correct fashion is for the use case.
Reductively, if you've turned money into numbers, and will turn the resulting numbers back into money, don't use floating point. If you're running a big calculation with lots of inputs to decide what stock to buy next, that's probably going to have floating-point monetary values in it, and there's nothing wrong with that. Hedge funds do a lot of the latter.
4 billion is something computers can handle, and it will give you pretty much all the edge cases of floating point representation.
if you do need 1 mm accuracy, including in the Pacific Ocean around 180 E/W then neither float32 nor scaled 32-bit ints are enough.
So for e.g. if you’re writing solvers for ODE systems, then you tend to test it on things like simple harmonic oscillators.
If you’re very lucky, some algorithms have bounded errors. For e.g. the fast multipole method for evaluating long range forces does, so you can check this for very simple systems.
Check expected behavior of...
Nan, +0, -0, +1, -1, +Inf, -Inf, +Eps, -Eps, +1+Eps, +1-Eps, -1+Eps, -1-Eps
...on each coordinate when supplying a sensible constant for the other coordinates.
Check the outer product of the above list taken across all coordinates.
Check that you get the same answer when you call the function twice in each of these outer product cases.
This approach works for any floating point function. You will be surprised how many mistakes these tests will catch, inclusive of later maintenance on the code under test.
>>> from math import nextafter
>>> nextafter(0, 1)
5e-324
>>> nextafter(0, -1)
-5e-324
>>> nextafter(-1, -10)
-1.0000000000000002
My numeric code became much more robust once I started doing this, including using it to find test cases, for example, by computing all distinct rationals (up to a given denominator) in my range of interest, then using nextafter() to generate all values within 5 or so steps from that rational.For example, one algorithm I had failed with values like 0.21000000000000002 and 0.39999999999999997 , which are one step away from 0.21 and 0.4, respectively.
Sure, there's your usual suspects like subtractive cancellation, but iterative methods in particular can vary quite a lot in terms of stability.
Sometimes it's due to step sizes, sometimes problems only occur within certain value ranges, other times it's peculiarities with specific hardware or even compilers. There's really no good general way of finding/avoiding these pitfalls that would apply to most numerical problems.