The purpose of continuous integration is to fail (opens in new tab)

(blog.nix-ci.com)

49 pointsNorfair2mo ago48 comments

48 comments

1970-01-012mo ago

Oversimplified click bait. The purpose never changed from catching bad bugs before it was sent to prod. The goal of CI is to prevent the resulting problems from doing damage and requiring emergency repairs.

bilkow2mo ago

I don't really understand the point you're trying to make, I don't see anywhere in the post nor the title claiming the purpose changed and the title is directly related to the content. In fact, it seems like you are just agreeing with the post.

I think people can get frustrated at CI when it fails, so they're explaining that that's the whole purpose of it and why it's is a actually good thing.

I would personally actually frame it slightly different than the author. Non-flaky CI errors: your code failed CI. Flaky CI errors: CI failed. Just to be clear, that's more precise but would never catch on because people would simplify "your code failed CI" to "CI failed" over time, but I don't thing that changes it from being an interesting way to frame.

1 more reply

jgbuddy2mo ago

This is of course true as a blanket "gotcha" headline- although I wouldn't call a failed test the CI itself failing. A real failure would be a false positive, a pass where there wasn't coverage, or a failure when there was no breaking change. Covering all of these edge cases can become as tiresome as maintaining the application in the first place (of course this is a generalization)

chriswarbo2mo ago

> a pass where there wasn't coverage

I always feel obliged to point out that we can have 100% coverage without making a single assertion (beware Goodhart's law)

n2d42mo ago

True, but you can't have complete tests without 100% coverage. It's a necessary, but not a sufficient condition; as long as it doesn't become the sole goal, it's still a useful metric.

SAI_Peregrinus2mo ago

100% coverage is an EXPTIME problem.

zelos2mo ago

> Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.

Or you have a concurrency issue in your production code?

NorfairOP2mo ago

Then the test is still flaky. If there's a bug you want the test to consistently fail, not just sometimes.

9rx2mo ago

The parent is talking about when the implementation is flaky, not the test. When you go to fix the problem under that scenario there is no reason for you to modify the test. The test is fine.

1 more reply

throw_await2mo ago

But also a flaky test is a bug by itself.

fxwin2mo ago

I thought that line was kind of funny: When a CI run fails, you don't rerun it and wait for the result, you rerun it and check why the original run failed in the meantime. Is it flaky? Is it a pipeline issue? Connectivity issue? Did some Key expire?

If you just rerun and don't go to find out what exactly caused CI to fail, you end up at the author's conclusion:

> (but it could also just have been flaky again).

cestith2mo ago

It's possibly something else nondeterministic, which may be even more subtle from an external look than a race condition. That should be rare, but it’s been known to happen.

stego-tech2mo ago

I’m one of today’s lucky 10k, because this judo-threw me with how I (didn’t) understand CI/CD. My experience with it has largely been a cumbersome add-on to existing processes that are often incredibly fragile and impossible to amend; turns out, that’s kind of the point. Understanding that it’s the equivalent of doing rocket tests on kit you expect to fail and using that to build better rockets suddenly makes its value far more recognizable, at least to my eyes.

Solid writeup. Definitely keeping in my personal notes.

lgunsch2mo ago

Some of the other practices of CI are also important. Not explicitly mentioned by the article, but perhaps implied. CI is a lot more than just running tests on pull request. It's a whole suite of practices enabling teams to perform and ship better. Some of which include keeping branches short lived by merging back to main early and often. Keeping code ready for deployment at any time by using strategies like feature switches. This keeps the cost of shipping a feature as low as possible, avoiding issues like spending lots of time rebasing and merging long lived feature branches.

9rx2mo ago

> When it passes, it's just overhead: the same outcome you'd get without CI.

The outcome still isn't the same. CI, even when everything passes, enables other developers to build on top of your partially-built work as it becomes available. This is the real purpose of CI. Test automation is necessary, but only to keep things sane amid you continually throwing in fractionally-complete work.

cestith2mo ago

It also allows for much better record keeping than just spinning up new versions in production without the pipeline.

ralferoo2mo ago

The premise of the article has some weight, but the final conclusion with the suggestion to change the icons seems completely crazy.

Green meaning "to the best of our knowledge, everything is good with the software" is well understood.

Using green to mean "we know that this doesn't work at all" is incredibly poor UI (EDITED from "beyond idiotic" due to feedback, my bad).

And whilst flaky tests are the most problematic for a CI system, it's because they often work (and usually, from my experience most flaky tests are because they are modelling situations that don't usually happen in production) and so are often potentially viable builds for deployment with a caveat. If anything, they should be marked orange if they are tests that are known to be problematic.

NorfairOP2mo ago

Hey, author here: I completely agree, that's why I also haven't used those strange colours for https://nix-ci.com. I just thought they would make for a cool visual representation of the point of the blog post.

chrisweekly2mo ago

Good insights but I'd suggest

"beyond idiotic" -> "misleading | poor UX"

(I agree it's a terrible choice, but civility matters, and strengthens your case.)

ralferoo2mo ago

Fair point, updated my wording.

chriswarbo2mo ago

I agree. The same can be said for testing too: their main purpose is to find mistakes (with secondary benefits of documenting, etc.). Whenever I see my tests fail, I'm happy that they caught a problem in my understanding (manifested either as a bug in my implementation, or a bug in my test statement).

cogman102mo ago

This ultimately is what shapes my view of what a good test is vs a bad test.

An issue I have with a lot of unit tests is they are too strongly coupled to the implementation. What that means is any change to the implementation ultimately means you have to change tests.

IMO, good tests are relatively immutable. You should be able to have multiple valid implementations. You should add new tests to describe the new functionality of that implementation, however, the old tests should remain relatively untouched.

If it turns out that a single change to an implementation requires you to change and update 20 tests, those are bad tests.

What I want as a dev is to immediately think "I must have broken something" when a test fails, not "I need to go fix 20 tests".

For example, let's say you have a method which sorts data.

A bad test will check "did you call this `swap` function 5 times". A good test will say "I gave the method this unsorted data set, is the data set sorted?". Heck, a good test can even say something like "was this large data set sorted in under x time". That's more tricky to do well, but still a better test than the "did you call swap the right number of times" or even worse "Did you invoke this sequence of swap calls".

vova_hn22mo ago

> IMO, good tests are relatively immutable. You should be able to have multiple valid implementations. You should add new tests to describe the new functionality of that implementation, however, the old tests should remain relatively untouched.

Taken to extreme this would mean getting rid of unit tests altogether in favor of functional and/or end-to-end testing. Which is... a strategy. I don't know if it is a good or bad strategy, but I can see it being viable for some projects.

cogman102mo ago

If you can't tell, I actually think functional tests have a lot more value than most unit tests :)

Kent Dodd agrees with me. [1]

This isn't to say I see no value in unit tests, just that they should tend towards describing the function of the code under test, not the implementation.

[1] https://kentcdodds.com/blog/the-testing-trophy-and-testing-c...

1 more reply

9rx2mo ago

> Taken to extreme this would mean getting rid of unit tests all together in favor of functional and/or end-to-end testing.

The dirty little secret in CS is that unit, functional, and end-to-end tests are all the exact same thing. Watch next time someone tries to come up with definitions to separate them and you'll soon notice that they didn't actually find a difference or they invent some kind of imagined way of testing that serves no purpose and nobody would ever do.

Regardless, even if you want to believe there is a difference, the advice above isn't invalidated by any of them. It is only saying test the visible, public interface. In fact, the good testing frameworks out there even enforce that — producing compiler errors if you try to violate it.

2 more replies

skydhash2mo ago

It took me a bit of time (and two or three different view) to finally get this. That is mostly why I hardcode my values in the tests. Make them simpler. If something fails, either the values are wrong or the algorithm of the implementation is wrong.

chriswarbo2mo ago

Comparing actual outputs against expected ones is the ideal situation, IMHO. My own preference is for property-checking; but hard-coding a few well-chosen values is also fine.

That's made easier when writing (mostly) pure code, since the output is all we have (we're not mutating anything, or triggering other processes, etc. that would need extra checking).

I also think it's important to make sure we're checking the values we actually care about; since those might not be the literal return value of the "function under test". For example, if we're testing that some function correctly populates a table cell, I would avoid comparing the function's result against a hard-coded table, since that's prone to change over time in ways that are irrelevant. Instead, I would compare that cell of the result against a hard-coded value. (Rather than thinking about the individual values, I like to think of such assertions as relating one piece of code to another, e.g. that the "get_total" function is related to the "populate_total" function, in this way...).

The reason I find this important, is that breaking a test requires us to figure out what it's actually trying to test, and hence whether it should have broken or not; i.e. is it a useful signal that requires us to change our approach (the table should look like that!), or is it noise that needs its incidental details updated (all those other bits don't matter!). That can be hard to work out many years after the test was written!

skydhash2mo ago

Also agree. There’s also a diminishing returns with test cases. Which is why I focus mainly on what I do not want to fail. The goal is not really to prove that my code work (formal verification is the tool for that), but to verify that certain failure cases will not happen. If one does, the code is not merged in.

yrjrjjrjjtjjr2mo ago

The purpose of a car's crumple zone is to crumple.

jquaint2mo ago

https://github.com/srid/nixci Is this the project or is this a completely different Nix based CI/CD tool? I can't find a Github or anything on the website.

NorfairOP2mo ago

Author here: NixCI (https://nix-ci.com) is not open-source. https://github.com/srid/nixci has been replaced by om ci: https://omnix.page/om/ci.html

globular-toast2mo ago

I think this can be generalised into saying that the purpose of tests is to fail. I've seen far too many tests that are written to pass. You need to write tests to fail.

IshKebab2mo ago

This is stupidly obvious but you'd be surprised how many people have the attitude that competent developers should have tested their code manually before making PRs so you shouldn't need CI.

deathanatos2mo ago

> One dreaded and very common situation is when a failing CI run can be made to pass by simply re-running it. We call this flaky CI.

> Flaky CI is nasty because it means that a CI failure no longer reliably indicates that a mistake was caught. And it is doubly nasty because it is unfixable (in theory); sometimes machines just explode.

> Luckily flakiness can be detected: Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.

One of the specialties that I have (unwillingly!) specialized in at my current company is CI flakes. Nearly all flakes, well over 90% of them, are not "unfixable", nor are they even really some boogy man unreliable thing that can't be understood.

The single biggest change I think we made that helped was having our CI system record the order¹ in which tests are run. Rerunning the tests, in the same order, makes most flakes instantly reproduce locally. Probably the next biggest reproducer is "what was the time the test ran?" and/or running it in UTC.

But once you get from "it's flakey" (and fails "seeming" "at" "random") to "it fails 100% of the time on my laptop when run this way" then it becomes easier to debug, b/c you can re-run it, attach a debugger, etc. Database sort issues (SQL is not deterministically ordered unless you ORDER BY), issues with database IDs (e.g., test expects row ID 3, usually gets row ID 3, but some other test has bumped us to row ID 4²), timezones — those are probably the biggest categories of "flakes".

While I know what people express with "flake", "flake" as a word is usually "failure mode I don't understand".

(Excluding truly transitory issues like a network failure interfering with a docker image pull, or something.)

(¹there are a lot of reasons people don't have deterministically ordered CI runs. Parallelism, for example. Our order is deterministic, b/c we made a value judgement that random orderings introduce too much chaos. But we still shard our tests across multiple VMs, and that sharding introduces its own changes to the order, as sometimes we rebalance one test to a different shard as devs add or remove tests.)

²this isn't usually because the ID is hardcoded, it is usually b/c, in the test, someone is doing `assert Foo.id == Bar.id`, unknowningly. (The code is usually not straight-forward about what the ID is an ID to.) I call this ID type confusion, and it's basically weakly-typed IDs in langs where all IDs are just some i32 type. FooId and BarId types would be better, and if I had a real type system in my work's lang of choice…

pm2152mo ago

A fairly large category of the flaky CI jobs I see is "dodgy infrastructure". For instance one recurring type for our project is one I just saw fail this afternoon, where a gitlab CI runner tries to clone the git repo from gitlab itself and gets an HTTP 502 error. We've also had issues with "the s390 VM that does CI job running is on an overloaded host, so mostly it's fine but occasionally the VM gets starved of CPU and some of the tests time out".

We do also have some genuinely flaky tests, but it's pretty tempting to hit the big "just retry" button when there's all this flakiness we can't control mixed in there.

win311fwg2mo ago

> those are probably the biggest categories of "flakes".

Interesting. In my experience, it is always either a concurrency issue in the program under test or PBTs finding some extreme edge case that was never visited before.

j / k navigate · click thread line to collapse

48 comments

1970-01-012mo ago

bilkow2mo ago

I think people can get frustrated at CI when it fails, so they're explaining that that's the whole purpose of it and why it's is a actually good thing.

1 more reply

jgbuddy2mo ago

chriswarbo2mo ago

> a pass where there wasn't coverage

I always feel obliged to point out that we can have 100% coverage without making a single assertion (beware Goodhart's law)

n2d42mo ago

True, but you can't have complete tests without 100% coverage. It's a necessary, but not a sufficient condition; as long as it doesn't become the sole goal, it's still a useful metric.

SAI_Peregrinus2mo ago

100% coverage is an EXPTIME problem.

zelos2mo ago

> Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.

Or you have a concurrency issue in your production code?

NorfairOP2mo ago

Then the test is still flaky. If there's a bug you want the test to consistently fail, not just sometimes.

9rx2mo ago

The parent is talking about when the implementation is flaky, not the test. When you go to fix the problem under that scenario there is no reason for you to modify the test. The test is fine.

1 more reply

throw_await2mo ago

But also a flaky test is a bug by itself.

fxwin2mo ago

If you just rerun and don't go to find out what exactly caused CI to fail, you end up at the author's conclusion:

> (but it could also just have been flaky again).

cestith2mo ago

It's possibly something else nondeterministic, which may be even more subtle from an external look than a race condition. That should be rare, but it’s been known to happen.

stego-tech2mo ago

Solid writeup. Definitely keeping in my personal notes.

lgunsch2mo ago

9rx2mo ago

> When it passes, it's just overhead: the same outcome you'd get without CI.

cestith2mo ago

It also allows for much better record keeping than just spinning up new versions in production without the pipeline.

ralferoo2mo ago

The premise of the article has some weight, but the final conclusion with the suggestion to change the icons seems completely crazy.

Green meaning "to the best of our knowledge, everything is good with the software" is well understood.

Using green to mean "we know that this doesn't work at all" is incredibly poor UI (EDITED from "beyond idiotic" due to feedback, my bad).

NorfairOP2mo ago

chrisweekly2mo ago

Good insights but I'd suggest

"beyond idiotic" -> "misleading | poor UX"

(I agree it's a terrible choice, but civility matters, and strengthens your case.)

ralferoo2mo ago

Fair point, updated my wording.

chriswarbo2mo ago

cogman102mo ago

This ultimately is what shapes my view of what a good test is vs a bad test.

An issue I have with a lot of unit tests is they are too strongly coupled to the implementation. What that means is any change to the implementation ultimately means you have to change tests.

If it turns out that a single change to an implementation requires you to change and update 20 tests, those are bad tests.

What I want as a dev is to immediately think "I must have broken something" when a test fails, not "I need to go fix 20 tests".

For example, let's say you have a method which sorts data.

vova_hn22mo ago

cogman102mo ago

If you can't tell, I actually think functional tests have a lot more value than most unit tests :)

Kent Dodd agrees with me. [1]

This isn't to say I see no value in unit tests, just that they should tend towards describing the function of the code under test, not the implementation.

[1] https://kentcdodds.com/blog/the-testing-trophy-and-testing-c...

1 more reply

9rx2mo ago

> Taken to extreme this would mean getting rid of unit tests all together in favor of functional and/or end-to-end testing.

2 more replies

skydhash2mo ago

chriswarbo2mo ago

Comparing actual outputs against expected ones is the ideal situation, IMHO. My own preference is for property-checking; but hard-coding a few well-chosen values is also fine.

That's made easier when writing (mostly) pure code, since the output is all we have (we're not mutating anything, or triggering other processes, etc. that would need extra checking).

skydhash2mo ago

yrjrjjrjjtjjr2mo ago

The purpose of a car's crumple zone is to crumple.

jquaint2mo ago

https://github.com/srid/nixci Is this the project or is this a completely different Nix based CI/CD tool? I can't find a Github or anything on the website.

NorfairOP2mo ago

Author here: NixCI (https://nix-ci.com) is not open-source. https://github.com/srid/nixci has been replaced by om ci: https://omnix.page/om/ci.html

globular-toast2mo ago

I think this can be generalised into saying that the purpose of tests is to fail. I've seen far too many tests that are written to pass. You need to write tests to fail.

IshKebab2mo ago

This is stupidly obvious but you'd be surprised how many people have the attitude that competent developers should have tested their code manually before making PRs so you shouldn't need CI.

deathanatos2mo ago

> One dreaded and very common situation is when a failing CI run can be made to pass by simply re-running it. We call this flaky CI.

> Luckily flakiness can be detected: Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.

While I know what people express with "flake", "flake" as a word is usually "failure mode I don't understand".

(Excluding truly transitory issues like a network failure interfering with a docker image pull, or something.)

pm2152mo ago

We do also have some genuinely flaky tests, but it's pretty tempting to hit the big "just retry" button when there's all this flakiness we can't control mixed in there.

win311fwg2mo ago

> those are probably the biggest categories of "flakes".

Interesting. In my experience, it is always either a concurrency issue in the program under test or PBTs finding some extreme edge case that was never visited before.

j / k navigate · click thread line to collapse