Where do our flaky tests come from? (opens in new tab)

(testing.googleblog.com)

85 pointsmonkeyshelli9y ago38 comments

38 comments

25 comments · 6 top-level

kazinator9y ago· 7 in thread

I worked on a system that performed a flaky test at run time and made it a condition that it had to succeed in order to run. Oh boy.

Basically, it generated a sample of numbers from the RNG (not PRNG: real random source). Then it ran statistical tests of randomness to validate, "yes, we have good random-numbers for security and whatnot".

Of course, these tests can fail; there is a nonzero probability that some pattern emerges which looks non-random to the tests; that's a consequence of them being really random.

How lame.

JoachimSchipper9y ago

For a true random number generator, that's simply a matter of tuning the probability of random failure to be acceptable; e.g. the old FIPS 140-2 randomness tests were calibrated to spuriously fail with probability of (IIRC) about one in a million.

There are problems with testing a random source, but calibration can solve that one. (The bigger issue is that the "true" random going into your (PRNG) postprocessor should be whatever comes out of your best entropy-generating device, which typically isn't uniformly random; and that the data coming out of your postprocessor will look random to a black-box test even if the input contains basically no entropy.)

beobab9y ago

And of course, there's the classic "dilbert" random number generator: http://dilbert.com/strip/2001-10-25

setr9y ago

Wouldn't a statistical test presumably have some minimum number of trials to hit some confidence value

I mean like the half the point of stats is to verify a result isn't due to chance, at least to an acceptable degree

Wouldn't gaurantee the test won't misdiagnose, but enough trials should push that chance into oblivion

psyc9y ago

This is a good example of something that is virtually guaranteed, but I will still never, ever make it a condition for exiting unceremoniously, or worse, the condition for terminating a loop.

flukus9y ago

Can that be fixed with a larger sample size at startup? Of course the validation should be done long before startup.

rcthompson9y ago

If the RNG is truly random, then it must have a non-zero probability of failing any test for randomness that uses a finite sample of the RNG's output. For instance, the RNG could theoretically return a sequence of all the same number. You can't "fix" this.

1 more reply

runeks9y ago

I don't think testing for randomness makes sense. Randomness is something we assume. There's always a non-zero probability that randomness will appear non-random.

In fact, this whole comment is a random string of characters. I just got really lucky, and there's essentially no way to disprove that. Throwing a dice a trillion times and getting all 6's is not an invalid outcome, just very improbable.

Randomness is about unpredictability. How can you assert that something is or is not predictable? It's always possible you just haven't observed it for long enough. An initial, randomly-appearing sequence may turn out to start repeating itself after some point in time; and an initial, self-repeating sequence can always be statistically cancelled out by later data.

Randomness is not something we can have; it's something we don't have (the ability to predict).

3 more replies

devman9y ago· 3 in thread

Strange article.

Does one really need to do data analysis to see that it is simply natural for large tests to be more flaky? All else being equal large tests perform more operations. If on average the probability of any single operation to fail is constant (fair assumption given the large number of tests) than the test that performs more operations has higher probability to fail (and no it is not linear as shown in the article, it is asymptotic approaching probability of one). The "analysis" of tool types are strange too. Webdriver tests are more flaky because of inherent nature of UI tests operating on less stable interface (compared to e.g. typical unit test which usually deals with a pretty stable contract).

I used to spend quite some time on test stability issues in my previous team and wrote a couple posts on the topic and how to successfully go from unstable to stable test suite:

https://blogs.vmware.com/management/2015/09/automated-tests-...

[Edit: formatting, spelling]

hugs9y ago

It's not strange at all. I appreciate the article's analysis. As the founder of the Selenium project, I frequently hear people blame Selenium for their problems. Sure, Selenium isn't perfect, but it seems like it is so much easier for people to blame their tools instead of questioning other things. But I understand why people do it. For a typical software developer, without an obvious cause for flakiness, the apparent randomness of a flaky test tends to make it easier (and plausibly justifiable) to "shoot the messenger".

vosper9y ago

OT, but thank you for Selenium!

1 more reply

stygiansonic9y ago

The reasoning behind why large and Webdriver tests ought to be more flakey is sound. However, I don't see why having empirical data and doing the analysis is "strange".

Doing studies and collecting data should be how any assumption, seemingly reasonable or not, is proven right or wrong.

kyoob9y ago· 3 in thread

It is heartening to know that the great and mighty Google spends a real amount of time and effort dealing with flaky Selenium tests, just like the rest of us!

paulddraper9y ago

And it prompts the question: Is this really a challenge so difficult that even the Gods struggle?

flukus9y ago

I'm surprised that google can't manage it, they have proper test engineers. But automated testing is a discipline that very few learn to do well. Even most unit test you see out in the wild are absolutely woeful. I've got a long list of horror stories about testing if you'd like?

I used to believe that devs should write the tests, but now I'm not so sure, all to often they just produce more technical debt and fail at improving quality. The industry considers knowing the api of a test framework as an experience automated tester.

billsmithaustin9y ago

Yes, it really is. Automated UI tests are notoriously fragile.

1 more reply

skybrian9y ago· 3 in thread

It would be interesting to know how this correlates with how long the test normally takes to run, versus its timeout. (Any timeout is a race condition in disguise - and yet we need timeouts.)

discodave9y ago

I agree, picking binary size as a variable seems like not a great choice to me.

More interesting things would be:

  Number of network calls.
  Network IO
  Disk IO
  Runtime
  Number of syscalls

I say this because all of those things seem much more likely to be the source of failures to me. We all know about network unreliability, CAP theorem and the Byzantine generals problem, so why doesn't it apply to testing too?

kazinator9y ago

Umm, no; only timeouts which are related to ensuring some execution order are race conditions.

E.g. B must not be done until A finishes. A takes too long; we time out on it, and do B anyway. (Alternatively: A provides no reliable notification of being done, so we delay for N seconds and assume A is done.)

If having done B is incorrect if A still happens after the timeout, then we have a problem. It's a race between A and B, where A has a big "head start", that's all.

Famous example (maybe forgotten now): Microsoft's ten second COM DLL unload delay. The last reference count on a COM DLL is dropped by a function that is in the DLL itself. So the refcount is zero, but the thread which dropped the refcount to zero must still execute instructions in that DLL until it returns out of that code. Oops! Anyone who calls the DLL's DLLCanUnloadNow function will see a true result: we can unload it now! But actually we cannot. Solution: oh, ten seconds should be enough for that last thread to vacate. Long page faults under thrashing? Who cares; the system is probably dying in that situation anyway.

(That should also be considered a data point in the GC versus refcounting debates. A garbage collector can tell that a thread's instruction pointer is still referencing a chunk of dynamic code.)

skybrian9y ago

I suppose it depends on how you define "race condition", but what I mean is that there's literally a race between the executing code and the timeout handler. Either one can win. The output is non-deterministic; it depends on scheduling.

Or to put it another way, adding a timeout makes any function's output non-deterministic, no matter how "pure" it is. It's only deterministic if you wait as long as necessary for it to finish.

ams61109y ago· 3 in thread

Clearly we should be doing more testing of our tests.

yorwba9y ago

You can regression test your regression tests by running them against previous buggy versions of your program. If the tests do not detect the bug they were supposed to prevent, you have to fix them.

joshschreuder9y ago

Isn't this a common bug fixing method? Write a test that fails due to the bug's behaviour, then update the code to make the test pass (ie. fix the bug)

1 more reply

goatlover9y ago

Perhaps there should be a testing test suite. Call it react-test-tester. Maybe implement ML in Rust to do the test testing.

ISL9y ago

These plots would be much more clear if:

a) They were log/log, especially the first one.

b) There were units on the horizontal axis.

j / k navigate · click thread line to collapse

38 comments

25 comments · 6 top-level

kazinator9y ago· 7 in thread

I worked on a system that performed a flaky test at run time and made it a condition that it had to succeed in order to run. Oh boy.

Of course, these tests can fail; there is a nonzero probability that some pattern emerges which looks non-random to the tests; that's a consequence of them being really random.

How lame.

JoachimSchipper9y ago

beobab9y ago

And of course, there's the classic "dilbert" random number generator: http://dilbert.com/strip/2001-10-25

setr9y ago

Wouldn't a statistical test presumably have some minimum number of trials to hit some confidence value

I mean like the half the point of stats is to verify a result isn't due to chance, at least to an acceptable degree

Wouldn't gaurantee the test won't misdiagnose, but enough trials should push that chance into oblivion

psyc9y ago

This is a good example of something that is virtually guaranteed, but I will still never, ever make it a condition for exiting unceremoniously, or worse, the condition for terminating a loop.

flukus9y ago

Can that be fixed with a larger sample size at startup? Of course the validation should be done long before startup.

rcthompson9y ago

1 more reply

runeks9y ago

I don't think testing for randomness makes sense. Randomness is something we assume. There's always a non-zero probability that randomness will appear non-random.

Randomness is not something we can have; it's something we don't have (the ability to predict).

3 more replies

devman9y ago· 3 in thread

Strange article.

I used to spend quite some time on test stability issues in my previous team and wrote a couple posts on the topic and how to successfully go from unstable to stable test suite:

https://blogs.vmware.com/management/2015/09/automated-tests-...

[Edit: formatting, spelling]

hugs9y ago

vosper9y ago

OT, but thank you for Selenium!

1 more reply

stygiansonic9y ago

The reasoning behind why large and Webdriver tests ought to be more flakey is sound. However, I don't see why having empirical data and doing the analysis is "strange".

Doing studies and collecting data should be how any assumption, seemingly reasonable or not, is proven right or wrong.

kyoob9y ago· 3 in thread

It is heartening to know that the great and mighty Google spends a real amount of time and effort dealing with flaky Selenium tests, just like the rest of us!

paulddraper9y ago

And it prompts the question: Is this really a challenge so difficult that even the Gods struggle?

flukus9y ago

billsmithaustin9y ago

Yes, it really is. Automated UI tests are notoriously fragile.

1 more reply

skybrian9y ago· 3 in thread

It would be interesting to know how this correlates with how long the test normally takes to run, versus its timeout. (Any timeout is a race condition in disguise - and yet we need timeouts.)

discodave9y ago

I agree, picking binary size as a variable seems like not a great choice to me.

More interesting things would be:

  Number of network calls.
  Network IO
  Disk IO
  Runtime
  Number of syscalls

kazinator9y ago

Umm, no; only timeouts which are related to ensuring some execution order are race conditions.

If having done B is incorrect if A still happens after the timeout, then we have a problem. It's a race between A and B, where A has a big "head start", that's all.

(That should also be considered a data point in the GC versus refcounting debates. A garbage collector can tell that a thread's instruction pointer is still referencing a chunk of dynamic code.)

skybrian9y ago

Or to put it another way, adding a timeout makes any function's output non-deterministic, no matter how "pure" it is. It's only deterministic if you wait as long as necessary for it to finish.

ams61109y ago· 3 in thread

Clearly we should be doing more testing of our tests.

yorwba9y ago

You can regression test your regression tests by running them against previous buggy versions of your program. If the tests do not detect the bug they were supposed to prevent, you have to fix them.

joshschreuder9y ago

Isn't this a common bug fixing method? Write a test that fails due to the bug's behaviour, then update the code to make the test pass (ie. fix the bug)

1 more reply

goatlover9y ago

Perhaps there should be a testing test suite. Call it react-test-tester. Maybe implement ML in Rust to do the test testing.

ISL9y ago

These plots would be much more clear if:

a) They were log/log, especially the first one.

b) There were units on the horizontal axis.

j / k navigate · click thread line to collapse