story

A case study in testing with 100+ Claude agents in parallel (opens in new tab)

imbue.com

68 pointsthejash1mo ago54 comments

54 comments

The debugging part at this scale is harder than you would expect - behavioral drift between parallel agent instances is nearly invisible without something aggregating what they are actually doing across runs. We hit this ourselves: two agents completing the same task successfully via completely different paths, one of which quietly broke edge cases in prod. The only thing that caught it was treating the conversation traces as a dataset, not just logs.

qi_imbue1mo ago

Imbue team member here - that's an interesting problem in general, but we haven't really run into this a lot here. Each testing agent is asked to work on one single issue and, to our slight surprise, most of the changes merge cleanly.

When they don't merge cleanly, it is time for human intervention, and the integration step would leave traces on which branches failed to merge.

Finally, when you do need to debug individual agents:

- Because mngr is, at the low level, just managed tmux sessions (local and remote), it's very easy to just attach to those sessions (`mngr connect`). It works even if the agent has been stopped, because mngr remembers enough about an agent to resurrect it.

- `mngr message` also allows you batch-message a bunch of agents. So if you do need to resume a lot of agents, you can experiment on one agent, figure out a good prompt, and then batch-message every other agent.

In this testing scenario, most agents don't actually require human intervention, and we've found that just connecting to a few individual agents to resolve problems is smooth and easy enough.

khazhoux1mo ago

Me: has to babysit every feature for hours in Claude Code, building a good plan but then still iterating many many times over things that need to be fixed and tweaked until the feature can be called done.

Bloggers: Here's how we use 3,000 parallel agents to write, test, and ship a new feature to production every 17 minutes in an 8M-LOC codebase (all agent-generated!).

... I'm doing something wrong, or other people are doing something wrong?

jiffy_lubricant1mo ago

> 8M-LOC codebase

I think this is the difference. These toy examples of using parallel agents are *not* running against large codebases, allowing them to iterate more effectively. Once you are in real codebases (>1M LoC), these systems break down.

thejashOP1mo ago

(author here) I strongly agree that these systems start to break down once the code base gets larger (we've seen that with our own projects)

But our reaction to it has been to say "ok, well the best practice in software engineering is to make small, well-isolated components anyway, so what if we did that?"

We've been trying to really break things apart into smaller pieces (and that's even evident in mngr, where much of the code is split out into separate plugins), and have been having a ton of success with it.

I realize that that might not be an option for more brownfield / existing / legacy projects, but when making something new, I've really been enjoying this way of building things.

tossandthrow1mo ago

To an extend you are likely doing something wrong.

I understand that the natural instinct is to correct the output when you see your agent doing something wrong.

That is not productive.

The instinct should be to tweak the agent to do it right.

At this point I am almost not writing any code in an enterprise code base.

pinkmuffinere1mo ago

> The instinct should be to tweak the agent to do it right.

I'm extremely doubtful of this. It doesn't save time to tell it "you have an error on line 19", because that's (often) just as much work as fixing the error. Likewise, saying "be careful and don't make mistakes" is not going to achieve anything. So how can you possibly tweak the agent to "do it right" reliably without human intervention? That's not even a solved problem for working with _humans_ who don't have the context window limitations, let alone an LLM that deletes everything past 30k tokens.

faeyanpiraat1mo ago

Are you seriously interested in the answer, or are you just mad?

I could give you some pointers, but will only type it out if there is a point

1 more reply

khazhoux1mo ago

I'm not touching code. I'm trying out the feature, and there's any number of things to tweak (because I missed some detail during planning, or agent made bad assumption, etc).

lelanthran1mo ago

> The instinct should be to tweak the agent to do it right.

Ah, yes; must always remember to add "And don't make any mistakes" into the prompt /s

tossandthrow1mo ago

I am not entirely sure what you are referring to.

Improving the agent means improving the code base such that the agent can effectively work on it.

It can not Com as a surprise that an agent is better at working on a well documented code base with clear architecture.

On the other hand, if you expect that an agent can add the right amount of ketchup to your undocumented speghatti code, then you will continue to have a bad time.

npodbielski1mo ago

If this will be future of software in 20 years nobody will understand what the hell software actually does. If nobody will things will get to implode quickly.

thejashOP1mo ago

(author here) I agree that it's super important to understand what software actually does--that's part of the whole reason we made mngr in the first place!

I believe we can use these types of tools to make software more understandable, and mngr is an example of how to do that.

In our case study, we're using AI to increase our test coverage, and if you look at it, I would argue that we are making it more understandable--now instead of just having 100's of tests, we simply have a document that describes how the software is supposed to work, and the tests are linked to that document, and checked to ensure that they conform.

That means that anyone--not just the author of the software--is now able to read through the high level tutorial description of how the commands work in order to understand what the program should do!

And as for the tests themselves, we've been able to make nice testing infrastructure--like the transcripts and recordings that were highlighted in the post--to make it even easier for us to verify the behavior of the software.

We also have an incredibly detailed style guide and set of tests and guidelines to ensure that the entire code base is consistent, and high quality. You can drop into any of the code and pretty quickly understand what is happening. And if not, claude will do an excellent job of describing how any given component works, and how it relates to the others.

Finally, mngr itself is designed to be fully transparent when it is running--you can literally attach to the coding agent you are running and see exactly what is happening, and the program makes extensive log outputs for everything it does (feel free to open a PR if you'd like to see more!)

It's not perfect formal verification, but it does feel like we're making meaningful progress on making it easier to understand software--not harder.

npodbielski1mo ago

Answer from an author! Wow!

And it is great! Really! Reading your post I was thinking if I could not do the same to write tests in an automated way in project I am working on. It would be awesome!

Though in an other hand we are living in a corporate, capitalistic, and a lot inhumane economic system. If this way of automation would work and deliver consistent output in a way of working software for 2 or 3 years, how long it would take to C-level suits to figure out that it is way better to have 2 or 3 Product Owners and maybe one Designer to write description of the entire programme and then just feed it to one of those automation pipeline? If tech giants will price product like that reasonably and it will work actually, how long it will be till it will cause entire industry to collapse and you will be able to produce software by paying to those tech giants? And it there will be like 5 of those only in the entire world - because nobody else will have enough GPUs. How soon till they came to agreement and split the world in areas of monopoly:

- if your company is in Asia you can either buy your application from Google or Alibaba.

In a world when everything is done in a computer via the software, such concentration of power would be bad for everyone.

Of course I doubt it will come that, simply because this would be very hard to achieve with our level of technology and some human involment will be necessary. But maybe I am kiding myself and I will loose my job entirely in few years along with tens of thousands other Software Engineers in a few years.

thejashOP1mo ago

These are real problems, and I think you nailed it: the concentration of power is the core issue.

I don't have a simple, perfect solution. We're just trying to make it possible for individuals and smaller companies to have access to the same kinds of tooling that the largest companies already have access to, and hopefully equalize the playing field at least a little bit...

If anyone has better ideas, I'd love to hear them!

coder681mo ago

20 years is quite an optimistic timeline. Of course, we will use agents to solve the problems of agents!

echelon1mo ago

Billions of years of evolution and we increasingly understand what the genome does. And that's about as random as it gets.

I think we'll be fine.

This feels more like Y2K panic than grounded in truth. Senior software engineers guide these systems effectively today without creating a mess. I'm sure in some years agents will fill the role of maintainability engineer too. We are not special or irreplaceable.

It's not like we won't be spending an incredible amount of energy to overcome issues with understandably and maintenance. The sheer economic forces will absolutely will this problem solved. It must be solved, because trillions of dollars urgently want it to be solved. That's evolutionary pressure if I've ever seen it.

Also, we ceremoniously ascribe too much value to the software we create. With the exception of a few places, almost all of it gets replaced before our careers are over. At the end of the day, business automation is value creation. It's not sacred. It has a finite life, and then it too dies.

The software artifact just needs to facilitate economic/interest flux long enough to be useful, then it can be replaced with something better or more relevant.

npodbielski1mo ago

I think we are talking about different timespans. I am talking about change in the world after decades of something like that happening. How those Senior Engineers will know how good software looks like if they would never write it themeselves? Imagine looking at someone driving a car for 20 years. Will it be enough for you to drive a car yourself?

Thinking about that always makes me think about Foundation, The Merchant Princess. Mallow travels to the edge of the Empire to look how things are on one of those worlds. He learns that there is the cast of the tech priests and those people have absolutely no idea how those devices actually work.

He said:

> The machines work from generation to generation automatically, and the caretakers are a hereditary caste who would be helpless if a single D-tube in all that vast structure burned out

It was a sign of severe decline of the entire empire. People had no idea how devices work and they would not be able to reproduce it or even repair if one would broke.

It was recurring premise of civilisation decline in the series: no proper maintaince and people loosing interests and knowledge how things are done and how they work.

I just wondering if this is not the same thing starting to happining know with our civilisation.

And evolution? Evolution means mass extinction of species and its normal. I am not sure about you but I would rather avoid any mass extinction regarding humanity.

echelon1mo ago

> I think we are talking about different timespans. I am talking about change in the world after decades of something like that happening. How those Senior Engineers will know how good software looks like if they would never write it themeselves?

How will we know what good software looks like if we no longer write assembler?

> Imagine looking at someone driving a car for 20 years. Will it be enough for you to drive a car yourself?

You don't have to drive stick to be able to drive.

Whatever the economically important functions are, the miracle of capitalism will find a way to staff it and solve it.

People fill all the gaps. No problem goes uninvestigated, no opportunity goes ignored.

At the end of the day we're delivering value. We'll be judged on value creation, and that'll map itself to whatever the tools of the day happen to be.

dakolli1mo ago

this is a pitch to sell an agent orchestration product and services.

kanjun1mo ago

Kanjun here, cofounder of Imbue (we put out this blog post, and I'm quite surprised it's on the front page of HN!)

The agent orchestration library (mngr) is open source, so we aren't selling anything. There is literally no way for us to make money on it.

We shipped it this way instead of trying to monetize because we believe open agents must win over closed / verticalized platforms in order for humans to live freely in our AI future. We have plenty of money and runway as a company, and this feels much more important to work on.

1 more reply

Yokohiii1mo ago

> Finally, remember that mngr runs your agent in a tmux session

what the hell?

kanjun1mo ago

It's a CLI tool so you can build composable workflows with agents. You're welcome to make your own UI on top of it.

maxbeech1mo ago

the thing that actually burns token budget at scale isn't the agent count itself—it's understanding the cost model of orchestrating them. 100 agents running in parallel is fine if they're short-lived queries. but once you start running them on a schedule (hourly checks, overnight batch work), the math changes fast.

each agent run against a real codebase probably spends 20-50k tokens just on context: repo structure, relevant files, recent changes. multiply that by 100 agents running every hour across 10-20 repos, and you're already hitting millions of tokens a day before any actual work happens. add in re-runs for failures or retries, and the cost curve gets steep quickly.

the harder problem is observability. with one agent you can read logs and understand what went wrong. with 100 agents you need aggregation, pattern detection, alerting on the common failure modes. if 3 agents fail silently but identically, was that a real issue or just rate limiting? if 40 agents all timeout at the same step, was it a dependency problem or infrastructure saturation? at scale you're debugging distributions, not individual runs.

also helps to be ruthless about concurrency. the async pattern isn't "run as many as possible at once"—it's "run exactly as many as the API and your budget can support without making the failure modes harder to diagnose." for claude api work that's usually smaller than people expect.

petcat1mo ago

Curious how people and companies like this are approaching matters of intellectual property now that the courts have ruled that basically no part of AI generated content or code is copyrightable and is therefore impossible to claim ownership of.

Are people just not going to open source anything anymore since licenses don't matter? Might as well just keep the code secret, right?

SpicyLemonZest1mo ago

It was always a bit weird how heavily software companies leaned on copyright, and I think you could basically replicate the same intuitions and dynamics on top of trade secret law if you had to. KFC didn't go out of business when a Chicago Tribune reporter found what's most likely the secret recipe.

I'm also not sure that the current precedent on the matter is _quite_ as strong as you're thinking. The high-profile case you're most likely thinking of was from a guy Stephen Thaler, who was seeking not just to claim copyright on AI-generated content but to specify the AI as the sole author. (IIUC, he planned to still own the copyright on the theory that it was a work-for-hire.)

measurablefunc1mo ago

There are no secrets when you are using AI providers. They track all interactions b/c that's valuable information for improving their models.

petcat1mo ago

I'm talking about sharing things publicly that you are trying to claim as your own

measurablefunc1mo ago

It doesn't matter. If someone has the same idea then they can use AI the same way you did to recreate it. Keeping it a secret benefits no one other than the AI providers b/c now they can charge money for giving someone else "your" code. The AI providers don't care about license restrictions so it's the perfect way to launder code. If you want credit for something then you'll have to claim it publicly b/c the AI providers sure as hell are not going to give you any credit.

1 more reply

EnPissant1mo ago

Not true.

https://developers.openai.com/api/docs/guides/your-data

heavyset_go1mo ago

Even if you believe the "we don't train on your data" claim/lie, that leaves a whole lot of things they can do with it besides training directly on it.

Analytics can be run on it, they can run it through their own models, synthetic training data can be derived from it, it can be used to build profiles on you/your business, they could harvest trade/literal secrets from it, they could store derivatives of your data to one day sell to competitors/compete themselves, they can use it to gauge just how dependent you've made yourself/business on their LLMs and price accordingly, etc.

1 more reply

measurablefunc1mo ago

That's about the API. It doesn't say anything about their other products like Codex. Moreover, even in the API it says you have to qualify for zero retention policies. They retain the data for however long each jurisdiction requires data retention & they are always improving their abuse detection using the retained data.

> Our use of content. We may use Content to provide, maintain, develop, and improve our Services, comply with applicable law, enforce our terms and policies, and keep our Services safe. If you're using ChatGPT through Apple's integrations, see this Help Center article (opens in a new window) for how we handle your Content.

> Opt out. If you do not want us to use your Content to train our models, you can opt out by following the instructions in this article . Please note that in some cases this may limit the ability of our Services to better address your specific use case.

https://openai.com/policies/row-terms-of-use/ https://openai.com/policies/how-your-data-is-used-to-improve...

1 more reply

throw_m2393391mo ago

And you believe them?

1 more reply

nradov1mo ago

There is no such court ruling.

petcat1mo ago

Courts ruled that AI works can't be copyrighted

https://fairuse.stanford.edu/case/thaler-v-perlmutter/

hmry1mo ago

Please read the link you're citing

> The court held that the Copyright Act requires all eligible works to be authored by a human being. Since Dr. Thaler listed the Creativity Machine, a non-human entity, as the sole author, the application was correctly denied. The court did not address the argument that the Constitution requires human authorship, nor did it consider Dr. Thaler’s claim that he is the author by virtue of creating and using the Creativity Machine, as this argument was waived before the agency.

Or in other words: They ruled you can't register copyright with an AI listed as the author on the application. They made no comment on whether a human can be listed as the author if an AI did the work.

1 more reply

nradov1mo ago

Bullshit. Did you even read the court's opinion in that case? The Dunning-Kruger effect strikes again.

appcustodian21mo ago

You think people open sourced things mostly because of license obligations?

j / k navigate · click thread line to collapse

54 comments

shubhamintech1mo ago

qi_imbue1mo ago

When they don't merge cleanly, it is time for human intervention, and the integration step would leave traces on which branches failed to merge.

Finally, when you do need to debug individual agents:

In this testing scenario, most agents don't actually require human intervention, and we've found that just connecting to a few individual agents to resolve problems is smooth and easy enough.

khazhoux1mo ago

Bloggers: Here's how we use 3,000 parallel agents to write, test, and ship a new feature to production every 17 minutes in an 8M-LOC codebase (all agent-generated!).

... I'm doing something wrong, or other people are doing something wrong?

jiffy_lubricant1mo ago

> 8M-LOC codebase

thejashOP1mo ago

(author here) I strongly agree that these systems start to break down once the code base gets larger (we've seen that with our own projects)

But our reaction to it has been to say "ok, well the best practice in software engineering is to make small, well-isolated components anyway, so what if we did that?"

I realize that that might not be an option for more brownfield / existing / legacy projects, but when making something new, I've really been enjoying this way of building things.

tossandthrow1mo ago

To an extend you are likely doing something wrong.

I understand that the natural instinct is to correct the output when you see your agent doing something wrong.

That is not productive.

The instinct should be to tweak the agent to do it right.

At this point I am almost not writing any code in an enterprise code base.

pinkmuffinere1mo ago

> The instinct should be to tweak the agent to do it right.

faeyanpiraat1mo ago

Are you seriously interested in the answer, or are you just mad?

I could give you some pointers, but will only type it out if there is a point

1 more reply

khazhoux1mo ago

I'm not touching code. I'm trying out the feature, and there's any number of things to tweak (because I missed some detail during planning, or agent made bad assumption, etc).

lelanthran1mo ago

> The instinct should be to tweak the agent to do it right.

Ah, yes; must always remember to add "And don't make any mistakes" into the prompt /s

tossandthrow1mo ago

I am not entirely sure what you are referring to.

Improving the agent means improving the code base such that the agent can effectively work on it.

It can not Com as a surprise that an agent is better at working on a well documented code base with clear architecture.

On the other hand, if you expect that an agent can add the right amount of ketchup to your undocumented speghatti code, then you will continue to have a bad time.

npodbielski1mo ago

If this will be future of software in 20 years nobody will understand what the hell software actually does. If nobody will things will get to implode quickly.

thejashOP1mo ago

(author here) I agree that it's super important to understand what software actually does--that's part of the whole reason we made mngr in the first place!

I believe we can use these types of tools to make software more understandable, and mngr is an example of how to do that.

It's not perfect formal verification, but it does feel like we're making meaningful progress on making it easier to understand software--not harder.

npodbielski1mo ago

Answer from an author! Wow!

And it is great! Really! Reading your post I was thinking if I could not do the same to write tests in an automated way in project I am working on. It would be awesome!

- if your company is in Asia you can either buy your application from Google or Alibaba.

In a world when everything is done in a computer via the software, such concentration of power would be bad for everyone.

thejashOP1mo ago

These are real problems, and I think you nailed it: the concentration of power is the core issue.

If anyone has better ideas, I'd love to hear them!

coder681mo ago

20 years is quite an optimistic timeline. Of course, we will use agents to solve the problems of agents!

echelon1mo ago

Billions of years of evolution and we increasingly understand what the genome does. And that's about as random as it gets.

I think we'll be fine.

The software artifact just needs to facilitate economic/interest flux long enough to be useful, then it can be replaced with something better or more relevant.

npodbielski1mo ago

He said:

> The machines work from generation to generation automatically, and the caretakers are a hereditary caste who would be helpless if a single D-tube in all that vast structure burned out

It was a sign of severe decline of the entire empire. People had no idea how devices work and they would not be able to reproduce it or even repair if one would broke.

It was recurring premise of civilisation decline in the series: no proper maintaince and people loosing interests and knowledge how things are done and how they work.

I just wondering if this is not the same thing starting to happining know with our civilisation.

And evolution? Evolution means mass extinction of species and its normal. I am not sure about you but I would rather avoid any mass extinction regarding humanity.

echelon1mo ago

How will we know what good software looks like if we no longer write assembler?

> Imagine looking at someone driving a car for 20 years. Will it be enough for you to drive a car yourself?

You don't have to drive stick to be able to drive.

Whatever the economically important functions are, the miracle of capitalism will find a way to staff it and solve it.

People fill all the gaps. No problem goes uninvestigated, no opportunity goes ignored.

At the end of the day we're delivering value. We'll be judged on value creation, and that'll map itself to whatever the tools of the day happen to be.

dakolli1mo ago

this is a pitch to sell an agent orchestration product and services.

kanjun1mo ago

Kanjun here, cofounder of Imbue (we put out this blog post, and I'm quite surprised it's on the front page of HN!)

The agent orchestration library (mngr) is open source, so we aren't selling anything. There is literally no way for us to make money on it.

1 more reply

Yokohiii1mo ago

> Finally, remember that mngr runs your agent in a tmux session

what the hell?

kanjun1mo ago

It's a CLI tool so you can build composable workflows with agents. You're welcome to make your own UI on top of it.

maxbeech1mo ago

petcat1mo ago

Are people just not going to open source anything anymore since licenses don't matter? Might as well just keep the code secret, right?

SpicyLemonZest1mo ago

measurablefunc1mo ago

There are no secrets when you are using AI providers. They track all interactions b/c that's valuable information for improving their models.

petcat1mo ago

I'm talking about sharing things publicly that you are trying to claim as your own

measurablefunc1mo ago

1 more reply

EnPissant1mo ago

Not true.

https://developers.openai.com/api/docs/guides/your-data

heavyset_go1mo ago

Even if you believe the "we don't train on your data" claim/lie, that leaves a whole lot of things they can do with it besides training directly on it.

1 more reply

measurablefunc1mo ago

https://openai.com/policies/row-terms-of-use/ https://openai.com/policies/how-your-data-is-used-to-improve...

1 more reply

throw_m2393391mo ago

And you believe them?

1 more reply

nradov1mo ago

There is no such court ruling.

petcat1mo ago

Courts ruled that AI works can't be copyrighted

https://fairuse.stanford.edu/case/thaler-v-perlmutter/