That all makes sense to me and I think is the right direction to be headed. However, it's been a bit since the inception of some of these projects/cool demos but I haven't seen anyone who uses agents as a core/regular part of their workflow.
I'm curious if you use these agents regularly or know someone that does. Or if you're working on one of these, I'd love to know what are some of the hidden challenges to making a useful product with agents? What's the main bottle neck?
Any thoughts are welcome!
One thing that is still confusing to me, is that we've been building products with machine learning pretty heavily for a decade now and somehow abandoned all that we have learned about the process now that we're building "AI".
The biggest thing any ML practitioner realizes when they step out of a research setting is that for most tasks accuracy has to be very high for it be productizable.
You can do handwritten digit recognition with 90% accuracy? Sounds pretty good, but if you need to turn that into recognizing a 12 digit account number you now have a 70% chance of getting at least one digit incorrect. This means a product worthy digit classifier needs to be much higher accuracy.
Go look at some of the LLM benchmarks out there, even in these happy cases it's rare to see any LLM getting above 90%. Then consider you want to chain these calls together to create proper agent based workflows. Even with 90% accuracy in each task, chain 3 of these together and you're down to 0.9 x 0.9 x 0.9 = 0.73, 73% accuracy.
This is by far this biggest obstacle towards seeing more useful products built with agents. There are cases where lower accuracy results are acceptable, but most people don't even consider this before embarking on their journey to build an AI product/agent.
I think that ChatGPT's success might be partly attributable to its chat interface. For whatever reason, a lot of people - including me! - are much more forgiving of inconsistencies, slip-ups, and inaccuracies when in a conversational format. Kind of like how you might forgive a real human for making a mistake in conversation.
I don't think that's necessarily good, and might not have much connection to attempts to build new non-conversational products on top of LLMs, but maybe it has some explanatory power for the current situation.
The key term here is "conversation". If I query something from the machine and it disappears and rumbles and then prints off something like a 1980s mainframe, with paper that has those holes on the side that you tear off... and then it's wrong, it's wasted time.
Meanwhile with the conversation I'm watching it in real time, and can stop it, refine it, or ask or clarification immediately and effectively. There is an expectation of give and take and "talking through" things to get to an answer, which I find is effective. I don't need it to be 100% right all the time, just 80% and then start parsing answers out of it to refine it to 90% accuracy with high confidence.
Completion models are obviously wrong very often. Instruct model was kinda ok, but you know it's a dumb machine.
Chat was a bit of an uncanny valley. I treated the instruct model like a child, but chat felt like having a conversation with someone of 80 IQ. It felt frustrating, and you ended up going "no no no, what I meant WAS ..." It felt like dealing with an incompetent colleague.
But I guess there's lots of views on it. Some expected it to be an oracle, even a god. Some treated it like Stack Overflow, then got frustrated that it was giving poor quality answers to poor quality questions. Some were just abusive to it. I suppose it's a mirror in a sense.
- copilots are useful
- chat is entertaining and useful
- future tech is coming
- investment moneyThe handwritten automations have performed better and the issues are reproducible, so even when there are issues, there's some sense of forward progress as you fix them. With handing it all over to an agent, it really feels like running around in circles.
I think there's probably something here, but it's less trivial than just tossing a webpage at chatGPT and hoping for the best.
Another opportunity is that you can have less steps or more shared context. One interesting thing about Whisper is that it's not just straight speech recognition but can also be prompted and given context to understand what sort of thing the speech may be about, increasing its accuracy considerably. LLM Vision models also do this with things like OCR. This might not help it with the individual digits in an account number, but it does help with distinguishing an account number from a street address on a check.
Or to take another old-style ML technique, you probably shouldn't be doing sentiment analysis in some pipeline, because you don't need to: instead you should step back and look at the purpose of the sentiment analysis and see if you can connect that purpose directly with the original text.
All that said, you definitely can write pipelines with compounding errors. We haven't collectively learned how to factor problems and engineer these systems with LLMs yet. Among the things I think we have to do is connect the tools more directly with user intention (effectively flatting another error-inducing part of the pipeline), and make the pipelines collaborative with users. This is more complex and distinctly not autonomous, but then hopefully you are addressing a broader problem or doing so in a more complete way.
You are assuming that the probability of failure is independent, which couldn't be further from the truth. If a digit recogniser can recognise one of your "hard" handwritten digits, such as a 4 or a 9, it will likely be able to recognise all of them.
The same happens with AI agents. They are not good at some tasks, but really really food at others.
It's way more nuanced than this. Of course, you need a decent "accuracy" (not necessarily the metric), but in many business cases, you don't need high accuracy. But you need a solid process: you can catch errors later, you can cross references etc, you need to failsafe, you need to have post-mortem error handling, etc...
I shipped stuff (classical ML) that was nothing more than "a biased coin flip," but that still generates value ($) due to the process around it.
Now I am curious, what are some tasks that can accept a model that is at 80% as good as a human, but is 100x cheaper?(or, 100x faster?)
The volume of helpdesk tickets large enterprises deal with is very easily and vastly underestimated. If you can even route 30% away from the central triage with 90+% accuracy and drop everything else back to the central triage... you suddenly safe 2 FTEs in that spot in some places. And increase customer satisfaction for most of those tickets because they get resolved faster.
Or, as much as people hate it, chatbots as a customer front. Yes, everyone here as an expert in a lot of tech has had terrible experiences with chatbots. Please mark your hate with the word "Lemon" in the comments. But decently implemented chatbots with a few systems behind them can resolve staggering amounts of simple problems from non-techies without human interaction from the company deploying them. It remains important to eventually escalate to humans - including the history from all of these interactions to avoid frustrations, sure.
Or, ticket/request preprocessing. Remember how spelling that 10 digit account number to a call center agent hard of hearing sucks? Those 4 retries because of you not using a better way to communicate that number also costs the company. Now, you can push a few of these retries into an AI system. If you mail them, an AI system can try to extract information like account numbers, intent, start of the problem, problem descriptions and such into dedicated fields to make the support agents faster.
Companies are certainly overdoing it at the moment, I'm not denying that. But a lot of the support/helpdesk pre-screening can be automated with current AI/ML capabilities very decently. Especially if you learn to recognize and navigate it.
It doesn't have to be perfect. It's not as if the actual data in there is perfect. It just has to be in a form where I can search it, ideally with named entities mapped.
Quality - like deciphering the writing on scrolls buried in volcanic ash in Herculaneum - gets all the attention. But what I really want is quantity - I want to be able to search through those 5000 pages of 200 year old mildly damaged cadastral records in dense handwriting. I want to relieve the army of kind retirees who currently transcribe these sorts of documents one by one based on their own needs.
1: In my country, after ChatGPT launched last year, when you call customer support you are now prompted to “just say in a few words” what you want instead of going through tap-this-number menus (they exist as a fallback) and I believe the backend is an LLM. The user flow and voice recordings are still programmatically determined though, but I can easily see one streamlined model calling APIs and whatnot, handling it all.
Up until fairly recently many systems used non-LLM models for making decisions based on natural language. Their performance would have been far worse but they still did useful work. Examples would include content policy enforcement, semantic search and so on.
There are very many cases where a system will make an automated decision on a heuristic or random basis for lack of better options. ML improved those decision points and spawned new ones. LLMs improve a subset of those decision points and spawn new ones.
Ex. Content generation + zero-shot classification/mapping are powerful, and with a human in the loop (somewhat) responsible for accuracy, they can move much faster.
what do you think would help people consider this before going down that path?
And a LLM who only needs to call to a few API calls isn't hard.
Very little need perfect accuracy and for that we still have classical software.
Language is essential for human civilization, so are tools. We wouldn't get far without either.
maybe a language model can understand what it needs to do but not how to do it, so you give it a tool.
Humans can get pretty far without 100 percent accuracy, we can get a lot from AI models before they reach 100 percent, but being that at some point AI will be able to improve itself even remake itself daily with 2x the abilities, 100 percent or at least 99.7 percent is attainable.
Right now I can take any YouTube video summarize it and turn it into a podcast, short form videos, and a blog post.
There's definitely a lot of marketing uses right now for ai agents. If you think about embodied AI, it's only as good as it's body. if it doesn't have good grippers it will struggle to pick things up.
Also with a lot of things, accuracy is subjective one person might think ad copy is great and maybe their manager thinks it's shit. One person could give it a 100 percent score and another a 70 percent.
My point is we're so close here, and it's already amazing technology and we can augment failures by creating larger toolboxes.
Some recent actual uses cases for me where an agent would NOT be able to help me although I really wish it would:
1. An agent to automate generating web pages from design images - Given an image, produce the HTML and CSS. LLMs couldn't do this for my simple page from a web designer. Not even close, even mixing up vertical/horizontal flex arrangement. When I cropped the image to just a small section, it still couldn't do it. Tried a couple LLMs, none even came close. And these are pretty simple basic designs! I had to do it all manually.
2. Story Generator Agent - Write a story from a given outline (for educational purposes). Even at a very detailed outline level, and with a large context window, kept forgetting key points, repetitive language, no plot development. I just have to write the story myself.
3. Illustrator Agent - Image generation for above story. Images end up very "LLM" looking, often miss key elements in the story, but one thing is worst of all: no persistent characters. This is already a big problem with text, but an even bigger problems with images. Every image for the same story has a character who looks different, but I want them to be the same.
4. Publisher Agent - Package things together above so I can get a complete package of illustrated stories on topics available on web/mobile for viewing, tracking progress, at varying levels.
Just some examples of where LLMs are currently not moving the needle much if at all.
This is certainly true for more complex code generation. But there are a lot of "rote" work that I do use GPT to generate, and I feel like those have really improved my productivity.
The other use case for AI-assisted coding is that it _really_ helps me learn certain stuff. Whether it's a new language, or code that someone else wrote. Often times I know what I want done, but I don't know the corresponding utility functions in that language, and AI will not only be able to generate it for me but also through the process teach me about the existence of those things.(some of which are wrong lol, but it's correct enough for me to keep that behavior)
You have to break it down into smaller steps and provide way more detail than you think you do in the context. I did an experiment in story generation where I had "authors" that would write only from the perspective of one of the characters that was also completely generated starting first from genre, name, character traits, etc. Then for a given scene, within a given plot and where in the story you are, randomly rotate between authors for each generation, appending it in memory, but not all of the story fits in context. And each generation is only a couple hundred tokens where you ask it to start/continue/end the story. The context contains all of this information in a simple key:value format. And essentially treat the LLM like a loom and spin the story out.
Usually what it produces isn't quite the best, but that's okay, because you can further refine the generation by using different system/user prompts explicitly for editing the content. I found that asking it to suggest one refinement and phrase it as a direct command, then feeding that command with the original generation, works. This meta-prompting tends to produce changes that subjectively improve the text according to whatever dimensions specified in the system prompt.
If you treat the composition as way more mechanical with tightly constrained generation, you get a much better, much more controlled result.
That’s because none of the models have been trained on this. Create a dataset for this and train a model to do it and it will be able to do it.
Here's the CEO of Builder.io supporting your comment: he says they tried LLMs/agents, and it didn't work. Then, they collected a dataset and developed an in-house model only to assist where they couldn't solve with imperative programming
Then I asked it to add a test suite to a rails side project. It created missing factories, corrected a broken test database configuration, and wrote tests for the classes and controllers that I asked it to.
I didn't have to get involved with mundane details. I did have to intervene here and there, but not much. The tests aren't the best in the world, but IMO they're adding value by at least covering the happy path. They're not as good as an experienced person would write.
I did spend a non-trivial amount of time fiddling with the prompts I used to teach OI about Promptr as well as the prompts I used to get it to successfully create the test suite.
The total cost was around $11 using GPT4 turbo.
I think in this case it was a fun experiment. I think in the future, this type of tooling will be ubiquitous.
Another use case where the cost of being slightly worse than a human is totally fine.(coming from someone that doesn't write tests lol)
I'd love to learn in more detail how it created those factories, corrected broken test database. It _feels_ that some of these tasks require knowing different parts of the codebase decently well, which from my experience hasn't always been the strong suite for AI assisted coding.
In the case of generating unit tests using Promptr, I have an "include" file that I include from every prompt. The "include" file is specific to the project that I'm using Promptr in. It says something like "This is a rails 7 app that serves as an API for an SPA front end. Use rspec for tests. etc. etc."
Somewhere in that "include" file there is a summary of the main entities of the codebase, so that every request has a general understanding of the main concepts that the codebase is dealing with. In the case of the rspec tests that it generated, I included the relevant files in the prompt by including the path to the files in the prompt I give to Promptr.
For example, if a test is for the Book model then I mention book.rb in the prompt. Perhaps Book uses some services in app/services - if that's relevant for the task then I'll include a glob of files using a command line argument - something like `promptr -p prompt.liquid app/services/book*.rb` where prompt.liquid has my prompt mentioning book.rb
You have to know what to include in the prompts and don't be shy about stuffing it full of files. It works until it doesn't, but I've been surprised at well it works in a lot of cases.
Looking at the OI docs wasn't too helpful.
"I did spend a non-trivial amount of time fiddling with the prompts" was it writing prompts?
I am really interested and this seems like a cool use case that I want to explore. Could you share the prompts on a github gist?
The system prompt that adds the Promptr CLI tool is here: https://github.com/ferrislucas/open-interpreter/pull/1/files...
I actually forked OI and baked in a prompt that was something like "Promptr is a CLI etc. etc., give Promptr conceptual instructions to make codebase and configuration changes". I think I put this in the system message that OI uses on every request to the OpenAI API.
Once I had OI using Promptr then I worked on a prompt for OI that was something like "create a test suite for the rails in ~/rails-app - use rspec, use this or that dependency, etc.".
Thanks for your interest! I'll try to add more details later.
For example we use it for:
- Website Loading: Automate proxy and browser selection to load sites effectively. Start with the cheapest and simplest way of extracting data, which is fetching the site without any JS or actual browser. If that doesn't work, the agent tries to load the site with a browser and a simple proxy, and so on.
- Navigation: Detect navigation elements and handle actions like pagination or infinite scroll automatically.
- Network Analysis: Identify desired data within network calls.
- Validation: Hallucination checks and verification that the data is actually on the website and in the right format. (this is mostly traditional code though)
- Data transformation: Clean and map the data into the desired format. Finetuned small and performant LLMs are great at this task with a high reliability.
The main challenge:
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
The integration of tightly constrained agents with traditional engineering methods effectively solved this issue for us.
Edit: You can try out a simplified version of this in our playground: https://www.kadoa.com/add
You can improve on the retrieved documents in many ways, like - by better chunking,
- better embedding,
- embedding several rephrased versions of the query,
- embedding a hypothetical answer to the prompt,
- hybrid retrieval (vector similarity + keyword/tfidf/bm25 related search),
- massively incorporating meta data,
- introducing additional (or hierarchical) summaries of the documents,
- returning not only the chunks but also adjacent text,
- re-ranking the candidate documents,
- fine tuning the LLM and much, much more.
However, at the end of the day a RAG system usually still has a hard time answering questions that require an overview of your data. Example questions are:
- "What are the key differences between the new and the old version of document X?"
- "Which documents can I ask you questions about?"
- "How do the regulations differ between case A and case B?"
In these cases it is really helpful to incorporate LLMs to decide how to process the prompt. This can be something simple like query-routing, or rephrasing/enhancing the original prompt until something useful comes up. But it can also be agents that come up with sub-queries and a plan on how to combine the partial answers. You can also build a network of agents with different roles (like coordinator/planner, reviewer, retriever, ...) to come up with an answer.
* edited the formatting
My experience has been that they are far too unpredictable to be of use.
In my testing with agent networks, it was a challenge to force it to provide a response, even if it was imperfect. So if there's a "reviewer" in the pool, it seemed to cause the cycle to keep going with no clear way of forcing it to break out.
3.5 actually worked better than 4 because it ran out of context sooner.
I am certain that I could have tuned it to get it to work, but at the end of the day, it felt like it was easier and more deterministic to do a few steps of old-fashioned data processing and then handing the data to the LLM.
Maybe my use case is narrow enough, so that in combination with a rather constraining and strict system message an answer is easy to find.
Second, I have lately played a lot with locally running LLMs. Their answers often break the formatting required for the agent to automatically proceed. So maybe I just don't see spiraling into oblivion, because I run into errors early ;)
It also feels like we are at a bottle neck when it comes to the knowledge retrieval problem. I wonder if the "solution" to all of these is just a smarter foundational model, which will come out of 100x more compute, which will cost approximately 7 trillion dollars.
In particular, I wonder if RAG systems will soon be a thing of the past, because end to end trained gigantic networks with longer attention spans, compression of knowledge, or hierarchical attention will at some point outperform retrieval. On the other hand, I can also see a completely different direction coming, where we develop architectures that, like operating systems, deal with memory management, scheduling and so on.
So some "experts" could be staying quiet because they don't have one. But some may stay quiet because they are working on or benefiting from it?
1. Find, annotate, aggregate, organize, summarize, etc all of my knowledge from notes
2. A Google substitute with direct answers in place of SEO junktext and countless ads
3. Writing boilerplate code, especially in unfamiliar languages
4. Dynamic, general, richly nuanced multimodal content moderation without the human labor bill
5. As an extremely effective personal tutor for learning nearly anything
I view AI as commoditizing general intelligence. You can supply it, like turning on the tap, wherever intelligence helps. I inject intelligence into moderating Discord message harassment, to detect when my 3D prints fail, to filter fluff from articles, clean up unstructured data, flag inappropriate images, etc. (All with the same model!) The world is overwhelmingly starved of intelligence. What extremely limited supply we have of this scarce resource (via humans) is woefully insufficient, and often extreme overkill where deployed. I now have access to a pennies-on-the-dollar supply of (low/mediocre quality) intelligence. Bet that I'll use it anywhere possible to unlock personal value and free up my intelligence for use where it's actually needed.
how do you get around this issue?
Granted on (3), you can just verify yourself by running the code, so trust/accuracy isn't as much an issue here but still annoying when things don't work.
You have a problem. The candidate must reliably solve it. What are their skills, general aptitudes, and observed reliability for this problem? Set them up to succeed, but move on if you distrust them to meet the role’s responsibility. We are all flawed, and that’s the nature of uncertainty when working with others.
Past that, there’s little situational advice that one can give about a general intelligence. If you want specific advice, give your specific attempt at a solution!
But then again, it's just another search engine, essentially. So for how long would it stay useful before it accepts payments to promote certain offers?
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
https://www.thriftbooks.com/w/smart-enough-systems-how-to-de...
Note I can hit a button on a link and prepare a post for Hacker News which goes into a queue that drains about as fast as I think I can get away with. I could easily have the model schedule top-scoring posts on metrics like "likely to have a knock-down-drag-out discussion" but I think that would be wrong. It is a feature not a bug that YOShInOn requires my assent in that I can enforce my own values and because I work closely with it, it learns certain aspects of those values.
YOShInOn Enterprise Edition would have a plurality of classification and generative models connected with the user interface for that co-working with the plan that the system processes asynchronous workflows (e.g. "generate a series of blog posts", "respond to customer requests") where some of the steps are automated and some are manual and the long-term goal is to reduce the manual, in the short term you are going to be making a lot of labels.
I've been dumping large chunks of code into GPT-4 to spot things I've overlooked. That has been very useful, particularly with low level C work.
The use cases are pretty straight forward and low risk:
1. Run a Google web search.
2. Query a news API.
3. Write a document based on the above, while citing sources.
Here's an example of something written yesterday, where I'm forecasting whether July 2024 will be the hottest on record: https://emergingtrajectories.com/a/forecast/74
This is working well in that the writeups are great and there are some "aha" moments, like the agent finding and referencing the The National Snow and Ice Data Center (NSIDC)... Very cool! I wouldn't have thought of it.
Then there's the part where the agent also tells me that the Oregon Department of Transportation has holidays during the summer, which doesn't matter at all.
So, YMMV, as they say... But I am more productive with these agents. I wouldn't publish anything formally without confirming and reviewing the content, though.
I guess that the agent was influenced by results reported by the Oregon Dept of Transport and if they were all out on holidays and not releasing their weather info it would impact the proxy that is being used to determine if the temperature is higher.
For me much of my interest in LLMs is these unexpected associations.
But they're universally garbage because they require the LLM to do a lot of things that LLMs are completely incompetent at. It's just way too early to expect to be able to remove that work and have it be done by an LLM.
The fact is LLMs are useful because they easily do some work that you're terrible at, and you easily do a lot of work that it's terrible at, and this makes the LLM a good tool because you+LLM is better than either part of that equation alone.
It's natural to think of the things that come effortlessly to you as easy, and to not even notice you're doing any work. But that doesn't change the fact that the LLM is completely incompetent at many of these things. It's way too early to remove the human from the loop.
Looking again at it from that direction - think about plugins, functions, GPTs, custom instructions, and now memory. These are all attempts to get more out of the LLM.
And they haven't really made much progress. Certainly less than I expected 9 months ago when I was hopeful the iteration loop would get compressed, even if I was highly skeptical about closing it. This is pretty conclusive to me - if it's this hard to get much more value per prompt out of current LLMs then it's really unlikely to be able to usefully close any loops.
If you are using AI agents to automate a workflow [1] execution, then the question to ask is where is non-determinism in the workflow. As in, where do humans scratch their head as opposed to rely on deterministic computations.
It turns out, a lot of times, as humans, we scratch our head just once for a given kind of objectives to come with a plan. Once we devise a plan, we execute the same plan over and over again without much difficulty.
This inherent pattern in how humans solve problems sort of diminishes the value of AI agents because even in the best case scenario the agents would only be solving a one-time, front-loaded pain. The value add would have been immense if the pain has been recurrent for a given objective.
That is not to say there is no role for AI agents. We are trying to infuse AI agents into an environment where we as humans adapted pretty well. AI agents will have to create newer objectives and goals that we humans have not realized. Finding that uncharted territory, or blue ocean, is where the opportunity is.
[1] By 'workflow' I mean a series of steps to take in order to achieve an overall objective.
The problem is temporary: good AI agents don't exist, because sufficiently intelligent AI doesn't yet exist.
(Agency and broad-domain intelligence are basically the same thing. Being able to answer questions relevant to planning is planning.)
This state of affairs is in stark contrast to the crypto/Web3 space, where no one ever presented a use case even conditional on the existence of good blockchain technology.
I wonder if all the work that's being put in right now by agent projects will become more or less "useless" similar to those specialized classification models before LLMs. Or will it be an AI with OK intelligence + 100 novel tricks/hacks that creates an Upwork level general agent.
I'm pretty convinced at this point that the term "agents" is almost useless, because so many people are carrying entirely different mental models of what the term means - so it invites conversations where no-one is actually talking about the same exact idea.
Honestly, I'm not toooo sure how to segment the term "agents", but in my mind there seems to be one realm for retrieval assistance. Ie. how do we make the ChatGPT-ish experience better. How can I better extract information I need from the collective human knowledge base. And another realm for letting the agent do things so I don't have to do it. Ie. "how can I get an Upwork assistant/Chief of staff/freelancer for cheaper and faster".
Nevertheless, editing the post now would simply create more confusion. Hopefully this discussion at least invites conversation about the conversation on agents itself haha.
Seems like information retrieval of any sorts is one use case where the cost of being wrong is not super high. I guess that's why ChatGPT took off lol.
The more notable common paradigm of Agent workflows that will persist even if there's an AI crash is retrieval-augmented generation (RAG), which at a high-level essentially is few-shot text generation based on prior existing examples. There will always be value in aligning LLM output to be much more expected, such as "generate text in the style of these examples" or "use these examples to answer the user's question."
Startups that just market themselves as "chat with your data!", even though they are RAG based, are gimmicks though and won't survive because they have no moat.
1. Planning is hard and exponential decay: Most demos try to start with a single sentence e.g. "order me a Dominos pizza" and go do the whole thing. Turns out planning has been one of the things that LLMs are not that good at. Also, even for a low probability p of failure at a given step, you'd get all steps rights with probability (1-p)^n which gets bad as n grows.
2. Reliability matters and vision is not quite there yet: GPT4V is great, and there have been a handful of domain-specific open source models more focused on understanding screenshots but most of them are not good enough yet to work reliably. And for most applications, reliability is key if you are going to trust the agent to do things on your behalf.
Disclaimer: I'm one of the founders of Autotab (https://www.autotab.com/), we're building a desktop app that lets anyone teach an AI to do a task just by showing it once. We've gone all in on reliability, building our own browser on top of Chromium to give us the bare metal control needed to deliver 98%+ reliability without any site-specific fine tuning.
The other opinionated thing we've done is to focus on "Show, don't tell". We've found that for most important automations it is easier to show the agent the workflow than it would be to write a paragraph describing the steps. If you were to train a human, would you explain where to click or just share your screen & explain with a voice over?
Some stories from our users: One works in IT and sometimes spends hours on- and off-boarding employees (60,000 people company), they need to do 20 different steps across 8 different software applications. Another example is a recruiting company that has many employees looking for candidates and sending messages on LinkedIn all day. In general we mostly see automations that take action or sync data across different software applications.
This is a consequence of the "auto-regressive" model and its lack of in-built self-correction, and it is a limiting factor in actual applications.
LeCun's tweet:
Well, except customer service bots (assuming the goal is to inexpensively absorb the energy of unhappy customers so they give up rather than actually getting the result they want or leaving, both of which cost the company money).
I've had success in building multi-agent workflows. Which in a sense are an ensemble of experts that have different prompts to help bounce and validate answers off each other. For example, with one LLM prompt you can ask a question and another can validate the answer. A bit of strength in numbers defense against hallucinations.
I wrote an example doing this in this article: https://medium.com/neuml/ai-powered-parenting-can-ai-help-yo...
They're simply better than naive RAG, especially when you need to access APIs, format content or compare different sections of the knowledge base.
Here are a few demos we have in the open:
> HackerNews AI: Interacts with the hackernews API - https://hn.aidev.run
> ArXiv AI: Reads, summarizes and compares arxiv papers - https://arxiv.aidev.run
(love that it can give you a comparison between 2 papers)
These use cases can only be possible using agents (or whatever that means)
I can honestly say that my use of search engines has decreased drastically and replaced with SOTA LLMs + Web retrieval.
- Cleaning up / changing something in bulk ( eg. cleaning attributes from a class)
- Generating unit tests ! ( just follow up on what it actually tests though)
Feed in a collection of docs about applications in use at an organization including their user guides; summarize what the capability of each application is; identify what capabilities are high risk; prioritize which applications need the most security visibility
Usually this is a classic difficult problem of inventory and 100 meetings.
Perfect? Nope. A huge leap forward? Yes.