Basically, it generated a sample of numbers from the RNG (not PRNG: real random source). Then it ran statistical tests of randomness to validate, "yes, we have good random-numbers for security and whatnot".
Of course, these tests can fail; there is a nonzero probability that some pattern emerges which looks non-random to the tests; that's a consequence of them being really random.
How lame.
There are problems with testing a random source, but calibration can solve that one. (The bigger issue is that the "true" random going into your (PRNG) postprocessor should be whatever comes out of your best entropy-generating device, which typically isn't uniformly random; and that the data coming out of your postprocessor will look random to a black-box test even if the input contains basically no entropy.)
I mean like the half the point of stats is to verify a result isn't due to chance, at least to an acceptable degree
Wouldn't gaurantee the test won't misdiagnose, but enough trials should push that chance into oblivion
In fact, this whole comment is a random string of characters. I just got really lucky, and there's essentially no way to disprove that. Throwing a dice a trillion times and getting all 6's is not an invalid outcome, just very improbable.
Randomness is about unpredictability. How can you assert that something is or is not predictable? It's always possible you just haven't observed it for long enough. An initial, randomly-appearing sequence may turn out to start repeating itself after some point in time; and an initial, self-repeating sequence can always be statistically cancelled out by later data.
Randomness is not something we can have; it's something we don't have (the ability to predict).
Does one really need to do data analysis to see that it is simply natural for large tests to be more flaky? All else being equal large tests perform more operations. If on average the probability of any single operation to fail is constant (fair assumption given the large number of tests) than the test that performs more operations has higher probability to fail (and no it is not linear as shown in the article, it is asymptotic approaching probability of one). The "analysis" of tool types are strange too. Webdriver tests are more flaky because of inherent nature of UI tests operating on less stable interface (compared to e.g. typical unit test which usually deals with a pretty stable contract).
I used to spend quite some time on test stability issues in my previous team and wrote a couple posts on the topic and how to successfully go from unstable to stable test suite:
https://blogs.vmware.com/management/2015/09/automated-tests-...
https://blogs.vmware.com/management/2015/09/automated-tests-...
[Edit: formatting, spelling]
Doing studies and collecting data should be how any assumption, seemingly reasonable or not, is proven right or wrong.
I used to believe that devs should write the tests, but now I'm not so sure, all to often they just produce more technical debt and fail at improving quality. The industry considers knowing the api of a test framework as an experience automated tester.
More interesting things would be:
Number of network calls.
Network IO
Disk IO
Runtime
Number of syscalls
I say this because all of those things seem much more likely to be the source of failures to me. We all know about network unreliability, CAP theorem and the Byzantine generals problem, so why doesn't it apply to testing too?E.g. B must not be done until A finishes. A takes too long; we time out on it, and do B anyway. (Alternatively: A provides no reliable notification of being done, so we delay for N seconds and assume A is done.)
If having done B is incorrect if A still happens after the timeout, then we have a problem. It's a race between A and B, where A has a big "head start", that's all.
Famous example (maybe forgotten now): Microsoft's ten second COM DLL unload delay. The last reference count on a COM DLL is dropped by a function that is in the DLL itself. So the refcount is zero, but the thread which dropped the refcount to zero must still execute instructions in that DLL until it returns out of that code. Oops! Anyone who calls the DLL's DLLCanUnloadNow function will see a true result: we can unload it now! But actually we cannot. Solution: oh, ten seconds should be enough for that last thread to vacate. Long page faults under thrashing? Who cares; the system is probably dying in that situation anyway.
(That should also be considered a data point in the GC versus refcounting debates. A garbage collector can tell that a thread's instruction pointer is still referencing a chunk of dynamic code.)
Or to put it another way, adding a timeout makes any function's output non-deterministic, no matter how "pure" it is. It's only deterministic if you wait as long as necessary for it to finish.
a) They were log/log, especially the first one.
b) There were units on the horizontal axis.