Unit Testing Numerical Routines (opens in new tab)

(buchanan.one)

75 pointsboscillator1y ago42 comments

42 comments

30 comments · 9 top-level

phdelightful1y ago· 6 in thread

I have worked on a performance-portable math library. We implement BLAS, sparse matrix operations, a variety of solvers, some ODE stuff, and various utilities for a variety of serial and parallel execution modes on x86, ARM, and the three major GPU vendors.

The simplest and highest-impact tests are all the edge cases - if an input matrix/vector/scalar is 0/1/-1/NaN, that usually tells you a lot about what the outputs should be.

It can be difficult to determine sensible numerical limit for error in the algorithms. The simplest example is a dot product - summing floats is not associative, so doing it in parallel is not bitwise the same as serial. For dot in particular it's relatively easy to come up with an error bound, but for anything more complicated it takes a particular expertise that is not always available. This has been a work in progress, and sometimes (usually) we just picked a magic tolerance out of thin air that seems to work.

Solvers are tested using analytical solutions and by inverting them, e.g. if we're solving Ax = y, for x, then Ax should come out "close" to the original y (see error tolerance discussion above).

One of the most surprising things to me is that the suite has identified many bugs in vendor math libraries (OpenBLAS, MKL, cuSparse, rocSparse, etc.) - a major component of what we do is wrap up these vendor libraries in a common interface so our users don't have to do any work when they switch supercomputers, so in practice we test them all pretty thoroughly as well. Maybe I can let OpenBLAS off the hook due to the wide variety of systems they support, but I expected the other vendors would do a better job since they're better-resourced.

For this reason we find regression tests to be useful as well.

dahart1y ago

> sometimes (usually) we just picked a magic tolerance out of thin air that seems to work.

Probably worth mentioning that in general the tolerance should be relative error, not absolute, for floating point math. Absolute error tolerance should only be used when there’s a maximum limit on the magnitude of inputs, or the problem has been analyzed and understood.

I know that doesn’t stop people from just throwing in 1e-6 all over the place, just like the article did. (Hey I do it too!) But if the problem hasn’t been analyzed then an absolute error tolerance is just a bug waiting to happen. It might seem to work at first, but then catch and confuse someone as soon as the tests use bigger numbers. Or maybe worse, fail to catch a bug when they start using smaller numbers.

tomsmeding1y ago

But then relative error is also not a panacea. If I compute 1 + 1e9, then producing 1e9 - 1 instead would fall within a relative error bound of 1e-6 easily. More generally, relative error works only if your computation scales "multiplicatively" from zero; if there's any additive component, it's suspect.

Of course, as you say, absolute error is also crap in general: it's overly restrictive for large inputs and overly permissive for small ones.

I'm not a numerics person, but I do end up needing to decide on something sensible for error bounds on computations sometimes. How does one do this properly? Interval arithmetic or something?

2 more replies

essexedwards1y ago

I've also been surprised many times by issues in numerical libraries. In addition to matrices with simple entries, I've found plenty of bugs just testing small matrices, with dimensions in {0,1,2,3,4}. Many libraries/routines fall over when the matrix is small, especially when one dimension is 0 or 1.

Presently, I am working on cuSPARSE and I'm very keen to improve its testing and correctness. I would appreciate anything more you can share about bugs you've seen in cuSPARSE. Feel free to email me, eedwards at nvidia.com

AlotOfReading1y ago

This is one of the reasons I argue that it's almost always better to prioritize speed and stability than accuracy specifically. No one actually knows what their thresholds are (including library authors), but the sky isn't falling despite that. Instabilities and nondeterminism will blow up a test suite pretty quickly though.

2 more replies

RhysU1y ago

Nx0, 0xN, and 0x0 matrices are great edge cases.

0-length vectors, too.

Then, do the same with Nx1, 1xN, and 1x1 matrices.

boscillatorOP1y ago

100% agree that picking numeric tolerances can be tricky. It is especially tricky when writing a generic math library like you are. If your doing something more applied it can help to take limits from your domain. For the example in the blog, if you're using GPS to determine you're position on earth, you probably know how precise physics allows that answer to be, and you only need to test to that tolerance (or an order of magnitude more strict, to give some wiggle room.)

alexpotato1y ago· 5 in thread

Mentioning this b/c it was something that surprised me when I first heard it:

Many python numerical libraries change how they internally represent certain data types and/or change internal algorithms from version to version. This means that running the exact same code on two different versions may give you slightly different results.

In your side project this may not be a big deal but if you are working at, say, a hedge fund or a company that does long and complicated data processing pipelines then version upgrades with no unit testing can be "very bad".

tomsmeding1y ago

This happens not only in math libraries but even in compilers. An x64 processor has two ways (at least) to perform simple 64-bit floating-point operations: with SSE registers (using only one of the slots in the vecror), or with the older x87 instructions. The former work with 64-bit IEEE intermediates (`double`), the latter works with 80-bit extended precision intermediates (`long double`).

In practice, these days compilers always use the former because they're widely available and are predictable. But if you tell a C compiler to not assume that SSE is available, it will use the x87 instructions.

If you use the x87 instructions, whether an intermediate float value is stored in memory or in a register changes the outcome, because it's stored in memory truncated to 64 bits, but in the (x87) register it's the full 80 bits.

If you rely on exact, reproducible floating-point results, you have more to worry about than just software versions.

owlbite1y ago

Unless the underlying code has been designed specifically for it (at the cost of some lost performance), it is a somewhat unreasonable expectation for high performance numerical code to always return the same result across version changes. Even running the same code on different machines, or the same machine with differently aligned data is likely to return different (but equally mathematically valid) results.

atoav1y ago

If you are working in a hedge fund you probably (hopefully?) know it isn't a good idea to represent monetary values using floats. And your numerics library should be checked for correctness on each commit, just in case.

samatman1y ago

This is a useful bit of conventional wisdom, but it helps to know when it is and isn't applicable.

There is a huge amount of data analysis which uses numbers representing money as inputs where it's fine to use floating point, and financial entities routinely do so. For a transaction, interest payments, and that sort of thing, then yes, one needs a specific sort of fixed-point representation, and operations which round in whatever the legally-correct fashion is for the use case.

Reductively, if you've turned money into numbers, and will turn the resulting numbers back into money, don't use floating point. If you're running a big calculation with lots of inputs to decide what stock to buy next, that's probably going to have floating-point monetary values in it, and there's nothing wrong with that. Hedge funds do a lot of the latter.

wesselbindt1y ago

Of course, but regardless, version bumps do sometimes affect the outcomes of pipelines. Unit tests are a good thing to have.

RcouF1uZ4gsC1y ago· 4 in thread

Another trick is to use 32 bit float and just test every single number.

4 billion is something computers can handle, and it will give you pretty much all the edge cases of floating point representation.

boscillatorOP1y ago

But test against what? This only makes sense for the property based testing, but then random sampling should give the same result up to a statistical margin of error (unless there's some specific pathological case). I suppose this is good for extremely critical routines, kind of like the modified condition/decision coverage safe critical flight software has to undergo.

spenczar51y ago

If you have an inverse function, use that to test all round tripa?

owlbite1y ago

That quickly becomes unfeasible for things that are not single input (and still requires being able to identify a correct result).

eesmith1y ago

ecef_to_lla requires two values, so even if one is willing to take the reduced geospatial resolution of float32, that's still 64-bits of parameter space.

if you do need 1 mm accuracy, including in the Pacific Ocean around 180 E/W then neither float32 nor scaled 32-bit ints are enough.

henshao1y ago· 2 in thread

For coordinate transforms specifically, I also like to run through the whole chain of available, implemented transforms and back again - asserting I have the same coordinate at the end. One method might be wrong, but they "can't" all be wrong.

boscillatorOP1y ago

Yah, I wrote a `f(f^-1(x)) == x` test but ended up cutting for brevity. Also, running through every possible transform seems like a pain to debug if the error is somewhere in the middle. Perhaps it would be better to test every pair using the `GENERATE` clause in catch. That way you could narrow it down to two different algorithms.

sevensor1y ago

This works great with property based tests especially. You can identify the parts of the domain where things blow up.

physicsguy1y ago· 1 in thread

All good stuff, I’d add though that for many numerical routines you end up testing on simple well known systems that have well defined analytic answers.

So for e.g. if you’re writing solvers for ODE systems, then you tend to test it on things like simple harmonic oscillators.

If you’re very lucky, some algorithms have bounded errors. For e.g. the fast multipole method for evaluating long range forces does, so you can check this for very simple systems.

LeanderK1y ago

yeah that's how I do it. You start simple (known solutions for trivial points), then easy cases with analytic solutions and then just some random stuff where you test that the solutions is reached without errors and makes sense (correct sign etc.), here called the property based tests.

amelius1y ago· 1 in thread

If you have a numerical routine that converts from one representation to another, you can test it by calling the routine implementing the inverse conversion.

pclmulqdq1y ago

Assuming you don't have an identical bug in the inverse routine.

RhysU1y ago· 1 in thread

Start simpler.

Check expected behavior of...

Nan, +0, -0, +1, -1, +Inf, -Inf, +Eps, -Eps, +1+Eps, +1-Eps, -1+Eps, -1-Eps

...on each coordinate when supplying a sensible constant for the other coordinates.

Check the outer product of the above list taken across all coordinates.

Check that you get the same answer when you call the function twice in each of these outer product cases.

This approach works for any floating point function. You will be surprised how many mistakes these tests will catch, inclusive of later maintenance on the code under test.

eesmith1y ago

To be more specific for those wanting to follow your advice, use nextafter to compute +Eps, +1+Eps, etc. rather than assume it's a fixed value like DBL_EPSILON.

  >>> from math import nextafter
  >>> nextafter(0, 1)
  5e-324
  >>> nextafter(0, -1)
  -5e-324
  >>> nextafter(-1, -10)
  -1.0000000000000002

My numeric code became much more robust once I started doing this, including using it to find test cases, for example, by computing all distinct rationals (up to a given denominator) in my range of interest, then using nextafter() to generate all values within 5 or so steps from that rational.

For example, one algorithm I had failed with values like 0.21000000000000002 and 0.39999999999999997 , which are one step away from 0.21 and 0.4, respectively.

Arech1y ago· 1 in thread

Honestly, I was expecting to see suggestions for testing for numerical stability. This might be a super annoying issue if the data is just right to hit certain peculiarities of floating point numbers representation.

qayxc1y ago

Unfortunately numerical stability is a very complex topic and depends entirely on the use-case.

Sure, there's your usual suspects like subtractive cancellation, but iterative methods in particular can vary quite a lot in terms of stability.

Sometimes it's due to step sizes, sometimes problems only occur within certain value ranges, other times it's peculiarities with specific hardware or even compilers. There's really no good general way of finding/avoiding these pitfalls that would apply to most numerical problems.

quinnirill1y ago

Makes me wonder if there should be an open source test suite for numerical routines, i.e. just a big table (in JSON or something) of inputs, transformations and expected outputs, so new projects could get easily started with a fairly robust (if not a bit generic) test suite they could filter to suite their needs as they grow the feature set.

j / k navigate · click thread line to collapse

42 comments

30 comments · 9 top-level

phdelightful1y ago· 6 in thread

The simplest and highest-impact tests are all the edge cases - if an input matrix/vector/scalar is 0/1/-1/NaN, that usually tells you a lot about what the outputs should be.

Solvers are tested using analytical solutions and by inverting them, e.g. if we're solving Ax = y, for x, then Ax should come out "close" to the original y (see error tolerance discussion above).

For this reason we find regression tests to be useful as well.

dahart1y ago

> sometimes (usually) we just picked a magic tolerance out of thin air that seems to work.

tomsmeding1y ago

Of course, as you say, absolute error is also crap in general: it's overly restrictive for large inputs and overly permissive for small ones.

I'm not a numerics person, but I do end up needing to decide on something sensible for error bounds on computations sometimes. How does one do this properly? Interval arithmetic or something?

2 more replies

essexedwards1y ago

AlotOfReading1y ago

2 more replies

RhysU1y ago

Nx0, 0xN, and 0x0 matrices are great edge cases.

0-length vectors, too.

Then, do the same with Nx1, 1xN, and 1x1 matrices.

boscillatorOP1y ago

alexpotato1y ago· 5 in thread

Mentioning this b/c it was something that surprised me when I first heard it:

tomsmeding1y ago

If you rely on exact, reproducible floating-point results, you have more to worry about than just software versions.

owlbite1y ago

atoav1y ago

samatman1y ago

This is a useful bit of conventional wisdom, but it helps to know when it is and isn't applicable.

wesselbindt1y ago

Of course, but regardless, version bumps do sometimes affect the outcomes of pipelines. Unit tests are a good thing to have.

RcouF1uZ4gsC1y ago· 4 in thread

Another trick is to use 32 bit float and just test every single number.

4 billion is something computers can handle, and it will give you pretty much all the edge cases of floating point representation.

boscillatorOP1y ago

spenczar51y ago

If you have an inverse function, use that to test all round tripa?

owlbite1y ago

That quickly becomes unfeasible for things that are not single input (and still requires being able to identify a correct result).

eesmith1y ago

ecef_to_lla requires two values, so even if one is willing to take the reduced geospatial resolution of float32, that's still 64-bits of parameter space.

if you do need 1 mm accuracy, including in the Pacific Ocean around 180 E/W then neither float32 nor scaled 32-bit ints are enough.

henshao1y ago· 2 in thread

boscillatorOP1y ago

sevensor1y ago

This works great with property based tests especially. You can identify the parts of the domain where things blow up.

physicsguy1y ago· 1 in thread

All good stuff, I’d add though that for many numerical routines you end up testing on simple well known systems that have well defined analytic answers.

So for e.g. if you’re writing solvers for ODE systems, then you tend to test it on things like simple harmonic oscillators.

If you’re very lucky, some algorithms have bounded errors. For e.g. the fast multipole method for evaluating long range forces does, so you can check this for very simple systems.

LeanderK1y ago

amelius1y ago· 1 in thread

If you have a numerical routine that converts from one representation to another, you can test it by calling the routine implementing the inverse conversion.

pclmulqdq1y ago

Assuming you don't have an identical bug in the inverse routine.

RhysU1y ago· 1 in thread

Start simpler.

Check expected behavior of...

Nan, +0, -0, +1, -1, +Inf, -Inf, +Eps, -Eps, +1+Eps, +1-Eps, -1+Eps, -1-Eps

...on each coordinate when supplying a sensible constant for the other coordinates.

Check the outer product of the above list taken across all coordinates.

Check that you get the same answer when you call the function twice in each of these outer product cases.

This approach works for any floating point function. You will be surprised how many mistakes these tests will catch, inclusive of later maintenance on the code under test.

eesmith1y ago

To be more specific for those wanting to follow your advice, use nextafter to compute +Eps, +1+Eps, etc. rather than assume it's a fixed value like DBL_EPSILON.

  >>> from math import nextafter
  >>> nextafter(0, 1)
  5e-324
  >>> nextafter(0, -1)
  -5e-324
  >>> nextafter(-1, -10)
  -1.0000000000000002

For example, one algorithm I had failed with values like 0.21000000000000002 and 0.39999999999999997 , which are one step away from 0.21 and 0.4, respectively.

Arech1y ago· 1 in thread

qayxc1y ago

Unfortunately numerical stability is a very complex topic and depends entirely on the use-case.

Sure, there's your usual suspects like subtractive cancellation, but iterative methods in particular can vary quite a lot in terms of stability.

quinnirill1y ago

j / k navigate · click thread line to collapse