Has anyone heard of that actually playing out practically in real-world applications? This article links to the paper about it - https://arxiv.org/abs/2402.05120 - but I've not heard from anyone who's implementing production systems successfully in that way.
(I still don't actually know what an "agent" is, to be honest. I'm pretty sure there are dozens of conflicting definitions floating around out there by now.)
Yes. I've build tools which do this.
Unfortunately the price tag for building a system that makes agent swarms dynamic is too much for anyone to bear in the current market.
Then few developers published joke projects of automated browsers that pointlessly doomscroll random webpages saying "here are your agents".
Then few companies transformed this joke into "rag" agents - but there automated browsers don't act autonomously, they plagiarize web content according to prompt.
Then many websites started to block rag agents, and to hide it companies fell back to return data from LLMs, occasionally updating responses in the background (aka "online models").
The idea of plagiarizing content is also mixed with another idea: if LLM rewrites plagiarized content multiple times, it becomes harder to proof.
Obviously, none of involved companies will admit to plagiarizing. Instead, they will cover themselves with the idea that it leads to superintelligence. For example, if multiple neural networks repeat that 9.11 > 9.9, it will be considered more accurate[1].
[1] https://www.reddit.com/r/singularity/comments/1e4fcxm/none_o...
But how much power does it need?
- 'rag' is meaningless as a concept. imagine calling a web app 'database augmented programming'. [1]
- 'agent' probably just means 'run an llm in a loop' [1]
Consciousness may be such a loop. Look outward, look at your response, repeat. Like the illusion of motion created by frames per second.
To me it is concerning that proper useful agents may emerge from such a simple loop. E.g just repeatedly asking a sufficiently intelligent AI, "what action within my power shall I perform next in order to maximize the mass of paperclips in the universe," interleaved by performing that act and recording the result.
It's just that the tools to do this are non-existent and each application needs to be bespoke.
It's like programming without an OS.
The answers from sampling every big model, then having one distill, were not noticeably better than just from Gpt-4o or Claude Sonnet, and the UX is so much worse (2x wait) that I tabled it for now.
I assumed it would be obviously good, even just from first principles.
I didn't do my usual full med/law benchmarks because given what we saw from a small sample, only 8 questions, I can skip adding it and proceed down the TODO list for launch.
I've also done the inverse, reproduced better results on med with gpt4o x one round RAG x one answer, than Google's Gemini Med insanely complex "finetune our biggest model on med, then do 2 round RAG with 5 answers + a vote on each round, and an opportunity to fetch new documents in round 2". We're both near-saturation, but I got 95% on my random sample of 100 Qs, Med Gemini was 93%.
As far as I can tell it's not in the OED or Miriam Webster dictionaries. But recently everyone's using it so perhaps it soon will be.
From ChatGPT: Other examples include "heuristic," "cognitive dissonance," "meta-cognition," "self-actualization," "self-efficacy," "locus of control," and "archetype."
I have yet to see an actual useful usecase for agents despite the countless posts asking for examples nobody has provided one.
I think it could be argued that the low hanging fruit aspect of agents where fulfilled by microservices and web based businesses. The concept of webpages populated with relevant data like google search pages, or Amazon populating products you'd like, could be called agent based. Netflix could be an example of an agent based service.
Where you specify a top-level objective, it plans out those objectives, it selects a completion metric so that it knows when to finish, and iterates/reiterates over the output until completion?
I built Plandex[1], which works roughly like this. The goal (so far) is not to take you from an initial prompt to a 100% working solution in one go, but to provide tools that help you iterate your way to a 90-95% solution. You can then fill in the gaps yourself.
I think the idea of a fully autonomous AI engineer is currently mostly hype. Making that the target is good for marketing, but in practice it leads to lots of useless tire-spinning and wasted tokens. It's not a good idea, for example, to have the LLM try to debug its own output by default. It might, on a case-by-case basis, be a good idea to feed an error back to the LLM, but just as often it will be faster for the developer to do the debugging themselves.
Although, it's now prompting me to make an account when I issue `plandex new`?
None of the video demos show this requirement.
I think the demos should show this requirement. Or the Quickstart docs should directly link to the self-hosted instructions.
"? Hey there! It looks like this is your first time using Plandex on this computer.
What would you like to do?
> Start an anonymous trial on Plandex Cloud (no email required)
Sign in, accept an invite, or create an account"****Give it a task, it writes code, runs the code, gets errors, fixes bugs, tries again generally until it succeeds.
That's over a year old at this point, and it's not clear to me if it counts as an "agent" by many people's definitions (which are often frustratingly vague).
- The original SWE-bench paper only consists of solved issues when a big part of Open Source is triage, follow-up, clarification and dealing with crappy issues
- Saying "<Technique> is all you need" when you are increasing your energy usage 30-fold just to fail > 50% of the time is intellectually dishonest