I think people can get frustrated at CI when it fails, so they're explaining that that's the whole purpose of it and why it's is a actually good thing.
I would personally actually frame it slightly different than the author. Non-flaky CI errors: your code failed CI. Flaky CI errors: CI failed. Just to be clear, that's more precise but would never catch on because people would simplify "your code failed CI" to "CI failed" over time, but I don't thing that changes it from being an interesting way to frame.
I always feel obliged to point out that we can have 100% coverage without making a single assertion (beware Goodhart's law)
Or you have a concurrency issue in your production code?
If you just rerun and don't go to find out what exactly caused CI to fail, you end up at the author's conclusion:
> (but it could also just have been flaky again).
Solid writeup. Definitely keeping in my personal notes.
The outcome still isn't the same. CI, even when everything passes, enables other developers to build on top of your partially-built work as it becomes available. This is the real purpose of CI. Test automation is necessary, but only to keep things sane amid you continually throwing in fractionally-complete work.
Green meaning "to the best of our knowledge, everything is good with the software" is well understood.
Using green to mean "we know that this doesn't work at all" is incredibly poor UI (EDITED from "beyond idiotic" due to feedback, my bad).
And whilst flaky tests are the most problematic for a CI system, it's because they often work (and usually, from my experience most flaky tests are because they are modelling situations that don't usually happen in production) and so are often potentially viable builds for deployment with a caveat. If anything, they should be marked orange if they are tests that are known to be problematic.
"beyond idiotic" -> "misleading | poor UX"
(I agree it's a terrible choice, but civility matters, and strengthens your case.)
An issue I have with a lot of unit tests is they are too strongly coupled to the implementation. What that means is any change to the implementation ultimately means you have to change tests.
IMO, good tests are relatively immutable. You should be able to have multiple valid implementations. You should add new tests to describe the new functionality of that implementation, however, the old tests should remain relatively untouched.
If it turns out that a single change to an implementation requires you to change and update 20 tests, those are bad tests.
What I want as a dev is to immediately think "I must have broken something" when a test fails, not "I need to go fix 20 tests".
For example, let's say you have a method which sorts data.
A bad test will check "did you call this `swap` function 5 times". A good test will say "I gave the method this unsorted data set, is the data set sorted?". Heck, a good test can even say something like "was this large data set sorted in under x time". That's more tricky to do well, but still a better test than the "did you call swap the right number of times" or even worse "Did you invoke this sequence of swap calls".
Taken to extreme this would mean getting rid of unit tests altogether in favor of functional and/or end-to-end testing. Which is... a strategy. I don't know if it is a good or bad strategy, but I can see it being viable for some projects.
> Flaky CI is nasty because it means that a CI failure no longer reliably indicates that a mistake was caught. And it is doubly nasty because it is unfixable (in theory); sometimes machines just explode.
> Luckily flakiness can be detected: Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.
One of the specialties that I have (unwillingly!) specialized in at my current company is CI flakes. Nearly all flakes, well over 90% of them, are not "unfixable", nor are they even really some boogy man unreliable thing that can't be understood.
The single biggest change I think we made that helped was having our CI system record the order¹ in which tests are run. Rerunning the tests, in the same order, makes most flakes instantly reproduce locally. Probably the next biggest reproducer is "what was the time the test ran?" and/or running it in UTC.
But once you get from "it's flakey" (and fails "seeming" "at" "random") to "it fails 100% of the time on my laptop when run this way" then it becomes easier to debug, b/c you can re-run it, attach a debugger, etc. Database sort issues (SQL is not deterministically ordered unless you ORDER BY), issues with database IDs (e.g., test expects row ID 3, usually gets row ID 3, but some other test has bumped us to row ID 4²), timezones — those are probably the biggest categories of "flakes".
While I know what people express with "flake", "flake" as a word is usually "failure mode I don't understand".
(Excluding truly transitory issues like a network failure interfering with a docker image pull, or something.)
(¹there are a lot of reasons people don't have deterministically ordered CI runs. Parallelism, for example. Our order is deterministic, b/c we made a value judgement that random orderings introduce too much chaos. But we still shard our tests across multiple VMs, and that sharding introduces its own changes to the order, as sometimes we rebalance one test to a different shard as devs add or remove tests.)
²this isn't usually because the ID is hardcoded, it is usually b/c, in the test, someone is doing `assert Foo.id == Bar.id`, unknowningly. (The code is usually not straight-forward about what the ID is an ID to.) I call this ID type confusion, and it's basically weakly-typed IDs in langs where all IDs are just some i32 type. FooId and BarId types would be better, and if I had a real type system in my work's lang of choice…
We do also have some genuinely flaky tests, but it's pretty tempting to hit the big "just retry" button when there's all this flakiness we can't control mixed in there.
Interesting. In my experience, it is always either a concurrency issue in the program under test or PBTs finding some extreme edge case that was never visited before.