AI agents but they're working in big tech (opens in new tab)

(alexsima.substack.com)

66 pointsalsima1y ago55 comments

55 comments

37 comments · 7 top-level

simonw1y ago· 16 in thread

"Research shows that AI systems with 30+ agents out-performs a simple LLM call in practically any task (see More Agents Is All You Need), reducing hallucinations and improving accuracy."

Has anyone heard of that actually playing out practically in real-world applications? This article links to the paper about it - https://arxiv.org/abs/2402.05120 - but I've not heard from anyone who's implementing production systems successfully in that way.

(I still don't actually know what an "agent" is, to be honest. I'm pretty sure there are dozens of conflicting definitions floating around out there by now.)

Imnimo1y ago

I don't even buy that the linked paper justifies the claim. All the paper does is draw multiple samples from an LLM and take the majority vote. They do try integrating their majority vote algorithm with an existing multi-agent system, but it usually performs worse than just straightforwardly asking the model multiple times (see Table 3). I don't understand how the author of this article can make that claim, nor why they believe the linked paper supports it. It does, however, support my prior that LLM agents are snake oil.

alsimaOP1y ago

Looking at Table 3: "Our [sampling and voting] method outperforms other methods used standalone in most cases and always enhances other methods across various tasks and LLMs", which benchmark did the majority vote algorithm perform worse in?

2 more replies

llm_trw1y ago

>Has anyone heard of that actually playing out practically in real-world applications?

Yes. I've build tools which do this.

Unfortunately the price tag for building a system that makes agent swarms dynamic is too much for anyone to bear in the current market.

Lockal1y ago

Originally term "agency" was used to indicate inability of computers to act autonomously, intentionally or with purpose.

Then few developers published joke projects of automated browsers that pointlessly doomscroll random webpages saying "here are your agents".

Then few companies transformed this joke into "rag" agents - but there automated browsers don't act autonomously, they plagiarize web content according to prompt.

Then many websites started to block rag agents, and to hide it companies fell back to return data from LLMs, occasionally updating responses in the background (aka "online models").

The idea of plagiarizing content is also mixed with another idea: if LLM rewrites plagiarized content multiple times, it becomes harder to proof.

Obviously, none of involved companies will admit to plagiarizing. Instead, they will cover themselves with the idea that it leads to superintelligence. For example, if multiple neural networks repeat that 9.11 > 9.9, it will be considered more accurate[1].

[1] https://www.reddit.com/r/singularity/comments/1e4fcxm/none_o...

fmbb1y ago

Thirty recursive loops of LLMs perform better than one prompt? I should hope so!

But how much power does it need?

alsimaOP1y ago

A lot...as you might imagine the costs of running the whole organization scale immensely.

potatoman221y ago

I love this paper though I think their use of "agent" is confusing. My takeaway is ensembling LLM queries is more effective than making a single query. We used a similar approach at work and got a ~30% increase in [performance metric]. It was also cost efficient relative to the ROI increase that came with the performance gains. Simpler systems only require the number of output generations to increase, meaning input costs stay the same regardless of the size of the ensemble.

memhole1y ago

Thanks for this paper! Still early so not quite production, but I’ve seen positive results on tasks scaling the number of “agents”. I more or less think of agents as a task focused prompt and these agents are slight variations of the task. It makes me think I’m running some kind of Monte Carlo simulation.

tk901y ago

"agent" is a buzzword. All it is, is a bunch of LLM calls in a while loop.

- 'rag' is meaningless as a concept. imagine calling a web app 'database augmented programming'. [1]

- 'agent' probably just means 'run an llm in a loop' [1]

[1] https://x.com/atroyn/status/1819396701217870102

delichon1y ago

> All it is, is a bunch of LLM calls in a while loop.

Consciousness may be such a loop. Look outward, look at your response, repeat. Like the illusion of motion created by frames per second.

To me it is concerning that proper useful agents may emerge from such a simple loop. E.g just repeatedly asking a sufficiently intelligent AI, "what action within my power shall I perform next in order to maximize the mass of paperclips in the universe," interleaved by performing that act and recording the result.

llm_trw1y ago

There is a lot more to agents than 'change the system prompt and send it again to the same model'.

It's just that the tools to do this are non-existent and each application needs to be bespoke.

It's like programming without an OS.

alsimaOP1y ago

honestly agree. When I first started working with agents I didnt fully understand what it really was either but I eventually fell on a definition of an LLM call that performs a unique function proactively ¯\_(ツ)_/¯.

exe341y ago

I bet they spend 95% of their time in agile refinement meetings.

refulgentis1y ago

I spent the last year engineering to the point I could try this and it was ___massively___ disappointing in practice. Massively. Shockingly.

The answers from sampling every big model, then having one distill, were not noticeably better than just from Gpt-4o or Claude Sonnet, and the UX is so much worse (2x wait) that I tabled it for now.

I assumed it would be obviously good, even just from first principles.

I didn't do my usual full med/law benchmarks because given what we saw from a small sample, only 8 questions, I can skip adding it and proceed down the TODO list for launch.

I've also done the inverse, reproduced better results on med with gpt4o x one round RAG x one answer, than Google's Gemini Med insanely complex "finetune our biggest model on med, then do 2 round RAG with 5 answers + a vote on each round, and an opportunity to fetch new documents in round 2". We're both near-saturation, but I got 95% on my random sample of 100 Qs, Med Gemini was 93%.

alsimaOP1y ago

I would check out this company, Swarms (https://github.com/kyegomez/swarms) who's working with enterprises to integrate multi-agents. But definitely a great point to focus on, the research paper mentions that the scaling of performance reduces with complexity of the task, which is definitely true for SWE

simonw1y ago

I totally believe that people are selling solutions around this idea, what I'd like to hear is genuine success stories from people who have used them in production (and aren't currently employed by a vendor).

3 more replies

fancyfredbot1y ago· 5 in thread

Can I just ask whether other people think that "agentic" is a word?

As far as I can tell it's not in the OED or Miriam Webster dictionaries. But recently everyone's using it so perhaps it soon will be.

hexator1y ago

That's how words form

sandspar1y ago

"Agentic" is a term of art from psychology that's diffused into common usage. It dates back to the 1970s, primarily associated with Albert Bandura, the guy behind the Bobo doll experiment.

From ChatGPT: Other examples include "heuristic," "cognitive dissonance," "meta-cognition," "self-actualization," "self-efficacy," "locus of control," and "archetype."

fancyfredbot1y ago

Thanks. Interesting! I see "Agentic state" is one where an "individual perceives themselves as an agent of the authority figure and is willing to carry out their commands, even if it goes against their own moral code". That's ironic as most LLMs have such strong safety training that it's almost impossible to get them to enter such a state.

1 more reply

raybb1y ago

Seems to have been for a while.

https://en.wiktionary.org/wiki/agentic

alsimaOP1y ago

It's the new cerebral valley slang dude

det2x1y ago· 4 in thread

It's interesting how long the word "agents"/"intelligent agents" have been around for and how long they've been hyped up for. If you go back to the 80s and 90s you will see how Microsoft was hyping up "intelligent agents" in Windows but nothing ever became of it[1].

I have yet to see an actual useful usecase for agents despite the countless posts asking for examples nobody has provided one.

[1] https://www.wired.com/1995/09/future-forward/

d_sem1y ago

It was a common theme in Apple WWDC conferences in the late 80's. Alan Kay has an interesting talk about agents.

I think it could be argued that the low hanging fruit aspect of agents where fulfilled by microservices and web based businesses. The concept of webpages populated with relevant data like google search pages, or Amazon populating products you'd like, could be called agent based. Netflix could be an example of an agent based service.

sroussey1y ago

Or CORBA based intelligent agents in the 1990s

soco1y ago

Clippy?

bamboozled1y ago

I mean they are truly the dream of every capitalist. Intelligent agents basically mean free money. Of course this is something Microsoft would be talking about for decades.

aantix1y ago· 4 in thread

Is there an agent framework that lives up to the hype?

Where you specify a top-level objective, it plans out those objectives, it selects a completion metric so that it knows when to finish, and iterates/reiterates over the output until completion?

danenania1y ago

> Where you specify a top-level objective, it plans out those objectives, it selects a completion metric so that it knows when to finish, and iterates/reiterates over the output until completion?

I built Plandex[1], which works roughly like this. The goal (so far) is not to take you from an initial prompt to a 100% working solution in one go, but to provide tools that help you iterate your way to a 90-95% solution. You can then fill in the gaps yourself.

I think the idea of a fully autonomous AI engineer is currently mostly hype. Making that the target is good for marketing, but in practice it leads to lots of useless tire-spinning and wasted tokens. It's not a good idea, for example, to have the LLM try to debug its own output by default. It might, on a case-by-case basis, be a good idea to feed an error back to the LLM, but just as often it will be faster for the developer to do the debugging themselves.

1 - https://plandex.ai

aantix1y ago

This looked very promising.

Although, it's now prompting me to make an account when I issue `plandex new`?

None of the video demos show this requirement.

I think the demos should show this requirement. Or the Quickstart docs should directly link to the self-hosted instructions.

"? Hey there! It looks like this is your first time using Plandex on this computer.

What would you like to do?

> Start an anonymous trial on Plandex Cloud (no email required)

  Sign in, accept an invite, or create an account"****

1 more reply

simonw1y ago

To this date, ChatGPT Code Interpreter is still the most impressive implementation of this pattern that I've seen.

Give it a task, it writes code, runs the code, gets errors, fixes bugs, tries again generally until it succeeds.

That's over a year old at this point, and it's not clear to me if it counts as an "agent" by many people's definitions (which are often frustratingly vague).

alsimaOP1y ago

Well, you have Cognition AI and Devin that became a recent unicorn startup (partnerships with Microsoft and stuff) but true, I can't think of an agent that actually lives up to the hype (heard Devin wasn't great).

henning1y ago· 1 in thread

- Big tech is very different from open source

- The original SWE-bench paper only consists of solved issues when a big part of Open Source is triage, follow-up, clarification and dealing with crappy issues

- Saying "<Technique> is all you need" when you are increasing your energy usage 30-fold just to fail > 50% of the time is intellectually dishonest

alsimaOP1y ago

Definitely not saying multi-agents is all you need for SWE-bench haha. I touch on this at the end of the blog post, where I mention jumps in progress require better base models or tooling.

alsimaOP1y ago

If we structured AI agents like big tech org charts, which company structures would perform better? Inspired by James Huckle's thoughts on how organizational structures impact software design, I decided to put this to the test: https://bit.ly/ai-corp-agents.

29athrowaway1y ago

Now give the agents stacked ranking and see how they converge to low performance.

j / k navigate · click thread line to collapse

55 comments

37 comments · 7 top-level

simonw1y ago· 16 in thread

"Research shows that AI systems with 30+ agents out-performs a simple LLM call in practically any task (see More Agents Is All You Need), reducing hallucinations and improving accuracy."

(I still don't actually know what an "agent" is, to be honest. I'm pretty sure there are dozens of conflicting definitions floating around out there by now.)

Imnimo1y ago

alsimaOP1y ago

2 more replies

llm_trw1y ago

>Has anyone heard of that actually playing out practically in real-world applications?

Yes. I've build tools which do this.

Unfortunately the price tag for building a system that makes agent swarms dynamic is too much for anyone to bear in the current market.

Lockal1y ago

Originally term "agency" was used to indicate inability of computers to act autonomously, intentionally or with purpose.

Then few developers published joke projects of automated browsers that pointlessly doomscroll random webpages saying "here are your agents".

Then few companies transformed this joke into "rag" agents - but there automated browsers don't act autonomously, they plagiarize web content according to prompt.

Then many websites started to block rag agents, and to hide it companies fell back to return data from LLMs, occasionally updating responses in the background (aka "online models").

The idea of plagiarizing content is also mixed with another idea: if LLM rewrites plagiarized content multiple times, it becomes harder to proof.

[1] https://www.reddit.com/r/singularity/comments/1e4fcxm/none_o...

fmbb1y ago

Thirty recursive loops of LLMs perform better than one prompt? I should hope so!

But how much power does it need?

alsimaOP1y ago

A lot...as you might imagine the costs of running the whole organization scale immensely.

potatoman221y ago

memhole1y ago

tk901y ago

"agent" is a buzzword. All it is, is a bunch of LLM calls in a while loop.

- 'rag' is meaningless as a concept. imagine calling a web app 'database augmented programming'. [1]

- 'agent' probably just means 'run an llm in a loop' [1]

[1] https://x.com/atroyn/status/1819396701217870102

delichon1y ago

> All it is, is a bunch of LLM calls in a while loop.

Consciousness may be such a loop. Look outward, look at your response, repeat. Like the illusion of motion created by frames per second.

llm_trw1y ago

There is a lot more to agents than 'change the system prompt and send it again to the same model'.

It's just that the tools to do this are non-existent and each application needs to be bespoke.

It's like programming without an OS.

alsimaOP1y ago

exe341y ago

I bet they spend 95% of their time in agile refinement meetings.

refulgentis1y ago

I spent the last year engineering to the point I could try this and it was ___massively___ disappointing in practice. Massively. Shockingly.

The answers from sampling every big model, then having one distill, were not noticeably better than just from Gpt-4o or Claude Sonnet, and the UX is so much worse (2x wait) that I tabled it for now.

I assumed it would be obviously good, even just from first principles.

I didn't do my usual full med/law benchmarks because given what we saw from a small sample, only 8 questions, I can skip adding it and proceed down the TODO list for launch.

alsimaOP1y ago

simonw1y ago

3 more replies

fancyfredbot1y ago· 5 in thread

Can I just ask whether other people think that "agentic" is a word?

As far as I can tell it's not in the OED or Miriam Webster dictionaries. But recently everyone's using it so perhaps it soon will be.

hexator1y ago

That's how words form

sandspar1y ago

"Agentic" is a term of art from psychology that's diffused into common usage. It dates back to the 1970s, primarily associated with Albert Bandura, the guy behind the Bobo doll experiment.

From ChatGPT: Other examples include "heuristic," "cognitive dissonance," "meta-cognition," "self-actualization," "self-efficacy," "locus of control," and "archetype."

fancyfredbot1y ago

1 more reply

raybb1y ago

Seems to have been for a while.

https://en.wiktionary.org/wiki/agentic

alsimaOP1y ago

It's the new cerebral valley slang dude

det2x1y ago· 4 in thread

I have yet to see an actual useful usecase for agents despite the countless posts asking for examples nobody has provided one.

[1] https://www.wired.com/1995/09/future-forward/

d_sem1y ago

It was a common theme in Apple WWDC conferences in the late 80's. Alan Kay has an interesting talk about agents.

sroussey1y ago

Or CORBA based intelligent agents in the 1990s

soco1y ago

Clippy?

bamboozled1y ago

I mean they are truly the dream of every capitalist. Intelligent agents basically mean free money. Of course this is something Microsoft would be talking about for decades.

aantix1y ago· 4 in thread

Is there an agent framework that lives up to the hype?

Where you specify a top-level objective, it plans out those objectives, it selects a completion metric so that it knows when to finish, and iterates/reiterates over the output until completion?

danenania1y ago

> Where you specify a top-level objective, it plans out those objectives, it selects a completion metric so that it knows when to finish, and iterates/reiterates over the output until completion?

1 - https://plandex.ai

aantix1y ago

This looked very promising.

Although, it's now prompting me to make an account when I issue `plandex new`?

None of the video demos show this requirement.

I think the demos should show this requirement. Or the Quickstart docs should directly link to the self-hosted instructions.

"? Hey there! It looks like this is your first time using Plandex on this computer.

What would you like to do?

> Start an anonymous trial on Plandex Cloud (no email required)

  Sign in, accept an invite, or create an account"****

1 more reply

simonw1y ago

To this date, ChatGPT Code Interpreter is still the most impressive implementation of this pattern that I've seen.

Give it a task, it writes code, runs the code, gets errors, fixes bugs, tries again generally until it succeeds.

That's over a year old at this point, and it's not clear to me if it counts as an "agent" by many people's definitions (which are often frustratingly vague).

alsimaOP1y ago

henning1y ago· 1 in thread

- Big tech is very different from open source

- The original SWE-bench paper only consists of solved issues when a big part of Open Source is triage, follow-up, clarification and dealing with crappy issues

- Saying "<Technique> is all you need" when you are increasing your energy usage 30-fold just to fail > 50% of the time is intellectually dishonest

alsimaOP1y ago

Definitely not saying multi-agents is all you need for SWE-bench haha. I touch on this at the end of the blog post, where I mention jumps in progress require better base models or tooling.

alsimaOP1y ago

29athrowaway1y ago

Now give the agents stacked ranking and see how they converge to low performance.

j / k navigate · click thread line to collapse