What CI looks like at a 100-person team (PostHog) (opens in new tab)

(mendral.com)

56 pointsshad421mo ago30 comments

30 comments

sd91mo ago

It just seems weird to me to throw all these stats together. Putting 75GB of logs in the same category as managing the compute for this many parallel workflows and so on seem like problems on totally different scales.

Unfortunately I didn’t really get the point of the article after being bombarded with stats, expect that the authors have an AI tool to sell.

joncrane1mo ago

We get it! They have 22,477 tests with a 99.98% pass rate, ship 65 commits to main daily, and keep 98 engineers productive on a single monorepo.

I thought the repetition of these statitics was a little tired, but overall that's an impressive solution. Also totally get that the hardest part is log ingestion and indexing.

Havoc1mo ago

To me that reads more like monorepo is a central point of failure and they’re scrambling to bandaid the consequence of that decision. And the bandaids aren’t gonna scale to 1000 people

I guess they’re missing whatever Google has to make their monorepo scale

dpark1mo ago

Problems don’t go away with fractured repos. They just change shape. Many repos maybe get you more reliable CI, but you pay for it with increased cost of integrating dependencies and increased complexity with debugging breaks in production (assuming many repos mean many services).

In my experience, multiple small repos don’t even have better CI reliability than a mono repo as less is invested because it affects fewer people. 10 person repos regularly have flaky tests that never get addressed because “we’ll deal with it later”. The tolerance for flakiness goes up when you can attribute it to a close teammate you know is heads down on something critical instead of it feeling like a random test you don’t even care about.

Havoc1mo ago

> Problems don’t go away with fractured repos.

Not the problems but the part where broken CI cause everything to stop.

Fractured repos have their own downsides but less chance of literally everyone sitting and waiting is greatly reduced

dpark1mo ago

Kind of? Almost tautologically, if you have multiple repos, it’s less likely that everyone will stuck at once. But it’s entirely possible that the total “stuck time” per engineer is no lower across a year.

In my experience the only repos that never get stuck are ones with no checkin gates.

shad42OP1mo ago

Mendral co-founder here. What happens at PostHog is not uncommon. While building Mendral, we talked to hundreds of team and they all have a similar situation. Initially they come to us to make their CI pipelines faster. But as the agent dives in, the urgency becomes keeping all pipelines reliable. It comes from growing a code base with a test suite. Of course it has to change eventually: splitting the test suite, running specific part of the CI depending on the code, etc... But the situation described in the article is widespread with a product that grows quickly.

simianwords1mo ago

interesting that they have an agent that is triggered on flaky CI failures. but it seems far too specific -- you can have pull request on many other triggers.

there doesn't seem to be any upside on having it only for flaky tests because the workflow is really agnostic to the context.

SirensOfTitan1mo ago

I don't really think this is at all at the quality bar for posts here. This is obviously AI-slop -- why should I invest more time reading your slop than you took to write it?

Even so, at what point do we consider the LLM-ification of all of tech a hazard? I've seen Claude go and lazily fix a test by loosening invariants. AI writes your code, AI writes your tests. Where is your human judgment?

Someone is going to lose money or get hurt by this level of automation. If the humans on your team cannot keep track of the code being committed, then I would prefer not to use your product.

Tade01mo ago

> I've seen Claude go and lazily fix a test by loosening invariants.

He does pull a sneaky on you from time to time, even nowadays, in v4.6, doesn't he?

To me it's analogous to the current situation at the strait of Hormuz - it's an enormous crisis but since almost everyone has a buffer of oil stockpiles, we can pretend it's not there.

fullstackchris1mo ago

this is extremely strawman - with this your basically saying any software ever that has parts written by automation or cron jobs (even before llms) is not a product worth using? foolish.

SirensOfTitan1mo ago

Your response reads much more like a strawman than my original comment.

I’d challenge you to identify where in my post I said I wouldn’t use software that employs automation?

It is pretty clear I am not talking about running CI for automated and predictable signals or cron jobs. I am talking about using AI to write code and also fix tests.

It is exceedingly clear in practice that the volume of code produced by LLMs is too much for the humans using these tools to read and understand. We are collectively throwing decades of best practices out of the window in service of “velocity.” Even the FAANG shops I know of who previously had good engineering cultures seem to be endorsing the cult of: AI generated everything with stamp approval.

rogerrogerr1mo ago

Cron jobs are not capable of flat-out deceit.

jofzar1mo ago

> These are not the numbers of a team with a CI problem. These are the numbers of a team that moves extremely fast and takes testing seriously.

Please no AI slop, write your own bloody blog posts.

IshKebab1mo ago

> Every commit to main triggers an average of 221 parallel jobs

Jesus, this is why Bazel was invented.

elteto1mo ago

I think this is the first article that truly gave me “slop nausea”. So many “It’s not X. It’s Y.” Do people not realize how awful this reads? It’s not a novel either, just a few thousand words, just fucking write it and edit it yourself.

zeristor1mo ago

I'm guessing they have a workflow for blog posts, with 100k workflows I was wondering something seems a bit weird.

zX41ZdbW1mo ago

Two problematic statements in this article:

1. Test pass rate is 99.98% is not good - the only acceptable rate is 100%.

2. Tests should not be quarantined or disabled. Every flaky test deserves attention.

lab141mo ago

a test pass rate of 100% is a fairy tale. maybe achievable on toy or dormant projects, but real world applications that make money are a bit more messy than that.

alkonaut1mo ago

I definitely have 100% pass rate on our tests for most of the time (in master, of course). By "most of the time" I mean that on any given day, you should be able to run the CI pipeline 1000 times and it would succeed all of them, never finding a flaky test in one or more runs.

In the rare case that one is flaky, it's addressed. During the days when there is a flaky test, of course you don't have 100% pass rate, but on those days it's a top priority to fix.

But importantly: this is library and thick client code. It should be deterministic. There are no DB locks, docker containers, network timeouts or similar involved. I imagine that in tiered application tests you always run the risk of various layers not cooperating. Even worse if you involve any automation/ui in the mix.

Obviously there are systems it depends on (Source control, package servers) which can fail, failing the build. But that's not a _test_ failure.

If the build it fails, it should be because a CI machine or a service the build depends on failed, not because an individually test randomly failed due to a race condition, timeout, test run order issue or similar

salomonk_mur1mo ago

If one is flaky, then you are below 100% friend.

1 more reply

lab141mo ago

"most of the time" != 100% pass rate

1 more reply

_heimdall1mo ago

When I was at Microsoft my org had a 100% pass rate as a launch gate. It was never expected that you would keep 100% but we did have to hit it once before we shipped.

I always assumed the purpose was leadership wanting an indicator that implied that someone had at least looked at every failing test.

YetAnotherNick1mo ago

Even something as simple as docker pull fails for 0.02% of the time.

zX41ZdbW1mo ago

https://github.com/ClickHouse/ClickHouse/pull/99828

rkomorn1mo ago

On top of 2., new tests should be stress-tested to make sure they aren't flaky so that the odds of merging them go down.

lionkor1mo ago

I can run flaky tests on my machine a thousand times without failure, whereas they fail in CI sometimes.

rkomorn1mo ago

Yes, that's why you need to stress test in CI.

j / k navigate · click thread line to collapse

30 comments

sd91mo ago

Unfortunately I didn’t really get the point of the article after being bombarded with stats, expect that the authors have an AI tool to sell.

joncrane1mo ago

We get it! They have 22,477 tests with a 99.98% pass rate, ship 65 commits to main daily, and keep 98 engineers productive on a single monorepo.

I thought the repetition of these statitics was a little tired, but overall that's an impressive solution. Also totally get that the hardest part is log ingestion and indexing.

Havoc1mo ago

To me that reads more like monorepo is a central point of failure and they’re scrambling to bandaid the consequence of that decision. And the bandaids aren’t gonna scale to 1000 people

I guess they’re missing whatever Google has to make their monorepo scale

dpark1mo ago

Havoc1mo ago

> Problems don’t go away with fractured repos.

Not the problems but the part where broken CI cause everything to stop.

Fractured repos have their own downsides but less chance of literally everyone sitting and waiting is greatly reduced

dpark1mo ago

In my experience the only repos that never get stuck are ones with no checkin gates.

shad42OP1mo ago

simianwords1mo ago

interesting that they have an agent that is triggered on flaky CI failures. but it seems far too specific -- you can have pull request on many other triggers.

there doesn't seem to be any upside on having it only for flaky tests because the workflow is really agnostic to the context.

SirensOfTitan1mo ago

I don't really think this is at all at the quality bar for posts here. This is obviously AI-slop -- why should I invest more time reading your slop than you took to write it?

Someone is going to lose money or get hurt by this level of automation. If the humans on your team cannot keep track of the code being committed, then I would prefer not to use your product.

Tade01mo ago

> I've seen Claude go and lazily fix a test by loosening invariants.

He does pull a sneaky on you from time to time, even nowadays, in v4.6, doesn't he?

To me it's analogous to the current situation at the strait of Hormuz - it's an enormous crisis but since almost everyone has a buffer of oil stockpiles, we can pretend it's not there.

fullstackchris1mo ago

this is extremely strawman - with this your basically saying any software ever that has parts written by automation or cron jobs (even before llms) is not a product worth using? foolish.

SirensOfTitan1mo ago

Your response reads much more like a strawman than my original comment.

I’d challenge you to identify where in my post I said I wouldn’t use software that employs automation?

It is pretty clear I am not talking about running CI for automated and predictable signals or cron jobs. I am talking about using AI to write code and also fix tests.

rogerrogerr1mo ago

Cron jobs are not capable of flat-out deceit.

jofzar1mo ago

> These are not the numbers of a team with a CI problem. These are the numbers of a team that moves extremely fast and takes testing seriously.

Please no AI slop, write your own bloody blog posts.

IshKebab1mo ago

> Every commit to main triggers an average of 221 parallel jobs

Jesus, this is why Bazel was invented.

elteto1mo ago

zeristor1mo ago

I'm guessing they have a workflow for blog posts, with 100k workflows I was wondering something seems a bit weird.

zX41ZdbW1mo ago

Two problematic statements in this article:

1. Test pass rate is 99.98% is not good - the only acceptable rate is 100%.

2. Tests should not be quarantined or disabled. Every flaky test deserves attention.

lab141mo ago

a test pass rate of 100% is a fairy tale. maybe achievable on toy or dormant projects, but real world applications that make money are a bit more messy than that.

alkonaut1mo ago

In the rare case that one is flaky, it's addressed. During the days when there is a flaky test, of course you don't have 100% pass rate, but on those days it's a top priority to fix.

Obviously there are systems it depends on (Source control, package servers) which can fail, failing the build. But that's not a _test_ failure.

salomonk_mur1mo ago

If one is flaky, then you are below 100% friend.

1 more reply

lab141mo ago

"most of the time" != 100% pass rate

1 more reply

_heimdall1mo ago

When I was at Microsoft my org had a 100% pass rate as a launch gate. It was never expected that you would keep 100% but we did have to hit it once before we shipped.

I always assumed the purpose was leadership wanting an indicator that implied that someone had at least looked at every failing test.

YetAnotherNick1mo ago

Even something as simple as docker pull fails for 0.02% of the time.

zX41ZdbW1mo ago

https://github.com/ClickHouse/ClickHouse/pull/99828

rkomorn1mo ago

On top of 2., new tests should be stress-tested to make sure they aren't flaky so that the odds of merging them go down.

lionkor1mo ago

I can run flaky tests on my machine a thousand times without failure, whereas they fail in CI sometimes.

rkomorn1mo ago

Yes, that's why you need to stress test in CI.

j / k navigate · click thread line to collapse