Ask HN: What are some actual use cases of AI Agents right now? | Better HN

149 comments

113 comments · 39 top-level

PheonixPharts2y ago· 28 in thread

> I'd love to know what are some of the hidden challenges to making a useful product with agents?

One thing that is still confusing to me, is that we've been building products with machine learning pretty heavily for a decade now and somehow abandoned all that we have learned about the process now that we're building "AI".

The biggest thing any ML practitioner realizes when they step out of a research setting is that for most tasks accuracy has to be very high for it be productizable.

You can do handwritten digit recognition with 90% accuracy? Sounds pretty good, but if you need to turn that into recognizing a 12 digit account number you now have a 70% chance of getting at least one digit incorrect. This means a product worthy digit classifier needs to be much higher accuracy.

Go look at some of the LLM benchmarks out there, even in these happy cases it's rare to see any LLM getting above 90%. Then consider you want to chain these calls together to create proper agent based workflows. Even with 90% accuracy in each task, chain 3 of these together and you're down to 0.9 x 0.9 x 0.9 = 0.73, 73% accuracy.

This is by far this biggest obstacle towards seeing more useful products built with agents. There are cases where lower accuracy results are acceptable, but most people don't even consider this before embarking on their journey to build an AI product/agent.

spenczar52y ago

> The biggest thing any ML practitioner realizes when they step out of a research setting is that for most tasks accuracy has to be very high for it be productizable.

I think that ChatGPT's success might be partly attributable to its chat interface. For whatever reason, a lot of people - including me! - are much more forgiving of inconsistencies, slip-ups, and inaccuracies when in a conversational format. Kind of like how you might forgive a real human for making a mistake in conversation.

I don't think that's necessarily good, and might not have much connection to attempts to build new non-conversational products on top of LLMs, but maybe it has some explanatory power for the current situation.

dougb52y ago

I don't know if I'm more forgiving of inaccuracies in a conversational interface, but I'm way less likely to notice them in the first place. Especially since the current crop of RLHF'd models are so eager to please that they say nearly everything with high confidence.

red-iron-pine2y ago

> I think that ChatGPT's success might be partly attributable to its chat interface. For whatever reason, a lot of people - including me! - are much more forgiving of inconsistencies, slip-ups, and inaccuracies when in a conversational format. Kind of like how you might forgive a real human for making a mistake in conversation.

The key term here is "conversation". If I query something from the machine and it disappears and rumbles and then prints off something like a 1980s mainframe, with paper that has those holes on the side that you tear off... and then it's wrong, it's wasted time.

Meanwhile with the conversation I'm watching it in real time, and can stop it, refine it, or ask or clarification immediately and effectively. There is an expectation of give and take and "talking through" things to get to an answer, which I find is effective. I don't need it to be 100% right all the time, just 80% and then start parsing answers out of it to refine it to 90% accuracy with high confidence.

muzani2y ago

Personally having been a big fan of GPT-3, I was quite against ChatGPT because of this.

Completion models are obviously wrong very often. Instruct model was kinda ok, but you know it's a dumb machine.

Chat was a bit of an uncanny valley. I treated the instruct model like a child, but chat felt like having a conversation with someone of 80 IQ. It felt frustrating, and you ended up going "no no no, what I meant WAS ..." It felt like dealing with an incompetent colleague.

But I guess there's lots of views on it. Some expected it to be an oracle, even a god. Some treated it like Stack Overflow, then got frustrated that it was giving poor quality answers to poor quality questions. Some were just abusive to it. I suppose it's a mirror in a sense.

emodendroket2y ago

Though I wonder how much of that is just that the format doesn’t encourage you to look closely enough at what you’re getting to see if it is right.

startupsfail2y ago

There are several reasons to forget:

  - copilots are useful
  - chat is entertaining and useful
  - future tech is coming
  - investment money

rozap2y ago

This has been a perfect description of my experience doing this. I had written some code to go through reasonably complex web onboarding flows and it basically played out exactly like you predicted in your comment. In addition, I've been working with some vendors that have been trying to do the same thing and they're finding that it works out just like you describe.

The handwritten automations have performed better and the issues are reproducible, so even when there are issues, there's some sense of forward progress as you fix them. With handing it all over to an agent, it really feels like running around in circles.

I think there's probably something here, but it's less trivial than just tossing a webpage at chatGPT and hoping for the best.

ianbicking2y ago

One interesting thing about LLMs is that they can actually recover (and without error loops). You can have a step that doesn't work right, and a later step can use its common-sense knowledge to ignore some of the missing results, conflicting information, etc. One of the problems with developing with LLMs is that the machine will often cover up bugs! You think it's giving sub-par results but actually you've given it conflicting or incomplete instructions.

Another opportunity is that you can have less steps or more shared context. One interesting thing about Whisper is that it's not just straight speech recognition but can also be prompted and given context to understand what sort of thing the speech may be about, increasing its accuracy considerably. LLM Vision models also do this with things like OCR. This might not help it with the individual digits in an account number, but it does help with distinguishing an account number from a street address on a check.

Or to take another old-style ML technique, you probably shouldn't be doing sentiment analysis in some pipeline, because you don't need to: instead you should step back and look at the purpose of the sentiment analysis and see if you can connect that purpose directly with the original text.

All that said, you definitely can write pipelines with compounding errors. We haven't collectively learned how to factor problems and engineer these systems with LLMs yet. Among the things I think we have to do is connect the tools more directly with user intention (effectively flatting another error-inducing part of the pipeline), and make the pipelines collaborative with users. This is more complex and distinctly not autonomous, but then hopefully you are addressing a broader problem or doing so in a more complete way.

> You can do handwritten digit recognition with 90% accuracy? Sounds pretty good, but if you need to turn that into recognizing a 12 digit account number you now have a 70% chance of getting at least one digit incorrect.

You are assuming that the probability of failure is independent, which couldn't be further from the truth. If a digit recogniser can recognise one of your "hard" handwritten digits, such as a 4 or a 9, it will likely be able to recognise all of them.

The same happens with AI agents. They are not good at some tasks, but really really food at others.

allanwind2y ago

The "food" typo is just too good to ignore in this context.

And the US Post Office and other postal services have been using this tech to sort letters for several decades now (although postal codes with both letters and numbers like Canada's are harder). It was viewed as the "killer app" for ML in the 1990s.

krallistic2y ago

> The biggest thing any ML practitioner realizes when they step out of a research setting is that for most tasks accuracy has to be very high for it be productizable. You can do handwritten digit recognition with 90% accuracy?

It's way more nuanced than this. Of course, you need a decent "accuracy" (not necessarily the metric), but in many business cases, you don't need high accuracy. But you need a solid process: you can catch errors later, you can cross references etc, you need to failsafe, you need to have post-mortem error handling, etc...

I shipped stuff (classical ML) that was nothing more than "a biased coin flip," but that still generates value ($) due to the process around it.

chenxi9649OP2y ago

Yea that's a good point.

Now I am curious, what are some tasks that can accept a model that is at 80% as good as a human, but is 100x cheaper?(or, 100x faster?)

tetha2y ago

Similar to the sibling comment, helpdesk ticket routing.

The volume of helpdesk tickets large enterprises deal with is very easily and vastly underestimated. If you can even route 30% away from the central triage with 90+% accuracy and drop everything else back to the central triage... you suddenly safe 2 FTEs in that spot in some places. And increase customer satisfaction for most of those tickets because they get resolved faster.

Or, as much as people hate it, chatbots as a customer front. Yes, everyone here as an expert in a lot of tech has had terrible experiences with chatbots. Please mark your hate with the word "Lemon" in the comments. But decently implemented chatbots with a few systems behind them can resolve staggering amounts of simple problems from non-techies without human interaction from the company deploying them. It remains important to eventually escalate to humans - including the history from all of these interactions to avoid frustrations, sure.

Or, ticket/request preprocessing. Remember how spelling that 10 digit account number to a call center agent hard of hearing sucks? Those 4 retries because of you not using a better way to communicate that number also costs the company. Now, you can push a few of these retries into an AI system. If you mail them, an AI system can try to extract information like account numbers, intent, start of the problem, problem descriptions and such into dedicated fields to make the support agents faster.

Companies are certainly overdoing it at the moment, I'm not denying that. But a lot of the support/helpdesk pre-screening can be automated with current AI/ML capabilities very decently. Especially if you learn to recognize and navigate it.

vintermann2y ago

Well, an old one is OCR, especially handwritten OCR. I'm doing genealogy. There is SO MUCH old handwritten material that is never transcribed, and which requires special expertise to read (old and exotic handwriting styles) and interpret (place names, writing conventions, abbreviations).

It doesn't have to be perfect. It's not as if the actual data in there is perfect. It just has to be in a form where I can search it, ideally with named entities mapped.

Quality - like deciphering the writing on scrolls buried in volcanic ash in Herculaneum - gets all the attention. But what I really want is quantity - I want to be able to search through those 5000 pages of 200 year old mildly damaged cadastral records in dense handwriting. I want to relieve the army of kind retirees who currently transcribe these sorts of documents one by one based on their own needs.

Aerbil3132y ago

A ton of tasks. Call centers to start with (they already do[1]), with human fallback.

1: In my country, after ChatGPT launched last year, when you call customer support you are now prompted to “just say in a few words” what you want instead of going through tap-this-number menus (they exist as a fallback) and I believe the backend is an LLM. The user flow and voice recordings are still programmatically determined though, but I can easily see one streamlined model calling APIs and whatnot, handling it all.

geoduck142y ago

Scan a menu, look for the different entrees, identify the most probable ingredients, determine health content. Then: allow people to search for food based allergies, food aversions, calories. Generate pictures of what the food might look like, display the pics next to the food to make it more likely a user will buy that food.

I'm not sure this argument is in any way specific to LLMs, and the space for their application is still enormous. Search results, ad targeting, recommendation systems, anomaly detection, content flagging, and so on, are all systems using machine learning with a high false positive rate.

Up until fairly recently many systems used non-LLM models for making decisions based on natural language. Their performance would have been far worse but they still did useful work. Examples would include content policy enforcement, semantic search and so on.

There are very many cases where a system will make an automated decision on a heuristic or random basis for lack of better options. ML improved those decision points and spawned new ones. LLMs improve a subset of those decision points and spawn new ones.

The last widely used AI tool was facial recognition, a technology widely used in fields such as company clock-ins, access control, surveillance, and more, and it is so trusted that facial recognition is often the sole method for clocking in. These facial recognition systems can maintain an extremely high accuracy rate for every entry and exit of thousands of people in a database every day. Now when will LLMs achieve such accuracy?

They have… they write language and they are good at it. The problem is language is not reality, proper language does not mean truth or fact. The models predict what is most likely to come next, not what is most likely to reflect reality.

jptoor2y ago

You're 100% right - but I do think there are more lower accuracy cases than I initially expected, *especially* if you assume a human-in-the-loop. Still 10x better than status quo.

Ex. Content generation + zero-shot classification/mapping are powerful, and with a human in the loop (somewhat) responsible for accuracy, they can move much faster.

greenie_beans2y ago

> There are cases where lower accuracy results are acceptable, but most people don't even consider this before embarking on their journey to build an AI product/agent.

what do you think would help people consider this before going down that path?

Qwero2y ago

I use already a few ai tools even without perfect accuracy.

And a LLM who only needs to call to a few API calls isn't hard.

Very little need perfect accuracy and for that we still have classical software.

skywhopper2y ago

You use them successfully because your human mind can filter out the junk. It would only take one inaccurate API call that charges your credit card $10k or sells your car for 10 cents to cause a lot of damage to your life.

chenxi9649OP2y ago

curious to know, which tools do you use and how do you use em?

The first think any ML practictioner realizes is that accuracy is about the single worst performance metric you can use for most real-world tasks, lol

altdataseller2y ago

Can you explain this? Why is that?

gremlinsinc2y ago

> You can do handwritten digit recognition with 90% accuracy? Sounds pretty good, but if you need to turn that into recognizing a 12 digit account number you now have a 70% chance of getting at least one digit incorrect. This means a product worthy digit classifier needs to be much higher accuracy.

Language is essential for human civilization, so are tools. We wouldn't get far without either.

maybe a language model can understand what it needs to do but not how to do it, so you give it a tool.

Humans can get pretty far without 100 percent accuracy, we can get a lot from AI models before they reach 100 percent, but being that at some point AI will be able to improve itself even remake itself daily with 2x the abilities, 100 percent or at least 99.7 percent is attainable.

Right now I can take any YouTube video summarize it and turn it into a podcast, short form videos, and a blog post.

There's definitely a lot of marketing uses right now for ai agents. If you think about embodied AI, it's only as good as it's body. if it doesn't have good grippers it will struggle to pick things up.

Also with a lot of things, accuracy is subjective one person might think ad copy is great and maybe their manager thinks it's shit. One person could give it a 100 percent score and another a 70 percent.

My point is we're so close here, and it's already amazing technology and we can augment failures by creating larger toolboxes.

alexawarrior32y ago· 9 in thread

None of these I've seen actually works in practice. Having used LLMs for software development the past year or so, even the latest GPT-4/Gemini doesn't produce anything I can drop in and have it work. I've got to go back and forth with the LLM to get anything useful and even then have to substantially modify it. I really hope there are some big advancements soon and this doesn't just collapse into another AI winter, but I can easily see this happening.

Some recent actual uses cases for me where an agent would NOT be able to help me although I really wish it would:

1. An agent to automate generating web pages from design images - Given an image, produce the HTML and CSS. LLMs couldn't do this for my simple page from a web designer. Not even close, even mixing up vertical/horizontal flex arrangement. When I cropped the image to just a small section, it still couldn't do it. Tried a couple LLMs, none even came close. And these are pretty simple basic designs! I had to do it all manually.

2. Story Generator Agent - Write a story from a given outline (for educational purposes). Even at a very detailed outline level, and with a large context window, kept forgetting key points, repetitive language, no plot development. I just have to write the story myself.

3. Illustrator Agent - Image generation for above story. Images end up very "LLM" looking, often miss key elements in the story, but one thing is worst of all: no persistent characters. This is already a big problem with text, but an even bigger problems with images. Every image for the same story has a character who looks different, but I want them to be the same.

4. Publisher Agent - Package things together above so I can get a complete package of illustrated stories on topics available on web/mobile for viewing, tracking progress, at varying levels.

Just some examples of where LLMs are currently not moving the needle much if at all.

chenxi9649OP2y ago

>even the latest GPT-4/Gemini doesn't produce anything I can drop in and have it work

This is certainly true for more complex code generation. But there are a lot of "rote" work that I do use GPT to generate, and I feel like those have really improved my productivity.

The other use case for AI-assisted coding is that it _really_ helps me learn certain stuff. Whether it's a new language, or code that someone else wrote. Often times I know what I want done, but I don't know the corresponding utility functions in that language, and AI will not only be able to generate it for me but also through the process teach me about the existence of those things.(some of which are wrong lol, but it's correct enough for me to keep that behavior)

okwhateverdude2y ago

> 2. Story Generator Agent - Write a story from a given outline (for educational purposes). Even at a very detailed outline level, and with a large context window, kept forgetting key points, repetitive language, no plot development. I just have to write the story myself.

You have to break it down into smaller steps and provide way more detail than you think you do in the context. I did an experiment in story generation where I had "authors" that would write only from the perspective of one of the characters that was also completely generated starting first from genre, name, character traits, etc. Then for a given scene, within a given plot and where in the story you are, randomly rotate between authors for each generation, appending it in memory, but not all of the story fits in context. And each generation is only a couple hundred tokens where you ask it to start/continue/end the story. The context contains all of this information in a simple key:value format. And essentially treat the LLM like a loom and spin the story out.

Usually what it produces isn't quite the best, but that's okay, because you can further refine the generation by using different system/user prompts explicitly for editing the content. I found that asking it to suggest one refinement and phrase it as a direct command, then feeding that command with the original generation, works. This meta-prompting tends to produce changes that subjectively improve the text according to whatever dimensions specified in the system prompt.

If you treat the composition as way more mechanical with tightly constrained generation, you get a much better, much more controlled result.

> 1. An agent to automate generating web pages from design images - Given an image, produce the HTML and CSS. LLMs couldn't do this for my simple page from a web designer. Not even close, even mixing up vertical/horizontal flex arrangement. When I cropped the image to just a small section, it still couldn't do it. Tried a couple LLMs, none even came close. And these are pretty simple basic designs! I had to do it all manually.

That’s because none of the models have been trained on this. Create a dataset for this and train a model to do it and it will be able to do it.

carlossouza2y ago

https://www.youtube.com/watch?v=bRFLE9qi3t8

Here's the CEO of Builder.io supporting your comment: he says they tried LLMs/agents, and it didn't work. Then, they collected a dataset and developed an in-house model only to assist where they couldn't solve with imperative programming

EVa5I7bHFq9mnYK2y ago

One area that has been useful for me, is writing simple code in languages I am not familiar with, and not willing to learn. For example, I needed to write a small bash script to automate things in Ubuntu, it really saved me time on googling all those commands. Same with Task Scheduler XML language. It knows very well the popular use cases of all the languages.

Besides writing boilerplate, I used AI to generate a color scheme and imagery for a charity website I built.

da4id2y ago

Why do you want it to generate web pages from images? I'm having trouble understanding the workflow here. You see a component you like on another website and want to obtain the code from it? Or if you have a design already, why not just use a Figma to Code tool?

It's not that uncommon to have a workflow where the webpage design gets built and negotiated with stakeholders/customers as a series of photoshop images, and when they're approved, it's forwarded to developers to make a pixel-perfect implementation of that design in HTML/CSS.

gremlinsinc2y ago

say you draw up your rough vision of things that you drew up paper, a very simple mock-up. That could be a nice use case.

deathmonger50002y ago· 5 in thread

I taught https://github.com/KillianLucas/open-interpreter how to use https://github.com/ferrislucas/promptr

Then I asked it to add a test suite to a rails side project. It created missing factories, corrected a broken test database configuration, and wrote tests for the classes and controllers that I asked it to.

I didn't have to get involved with mundane details. I did have to intervene here and there, but not much. The tests aren't the best in the world, but IMO they're adding value by at least covering the happy path. They're not as good as an experienced person would write.

I did spend a non-trivial amount of time fiddling with the prompts I used to teach OI about Promptr as well as the prompts I used to get it to successfully create the test suite.

The total cost was around $11 using GPT4 turbo.

I think in this case it was a fun experiment. I think in the future, this type of tooling will be ubiquitous.

chenxi9649OP2y ago

This is pretty cool!

Another use case where the cost of being slightly worse than a human is totally fine.(coming from someone that doesn't write tests lol)

I'd love to learn in more detail how it created those factories, corrected broken test database. It _feels_ that some of these tasks require knowing different parts of the codebase decently well, which from my experience hasn't always been the strong suite for AI assisted coding.

deathmonger50002y ago

OI fixed the factories and config by attempting to run the tests. The test run would fail because there's no test suite configured, so OI inspected the Gemfile using `cat`. Then it used Promptr with a prompt like "add the rspec gem to Gemfile". Then OI tries again and again - addressing each error as encountered until the test suite was up and running.

In the case of generating unit tests using Promptr, I have an "include" file that I include from every prompt. The "include" file is specific to the project that I'm using Promptr in. It says something like "This is a rails 7 app that serves as an API for an SPA front end. Use rspec for tests. etc. etc."

Somewhere in that "include" file there is a summary of the main entities of the codebase, so that every request has a general understanding of the main concepts that the codebase is dealing with. In the case of the rspec tests that it generated, I included the relevant files in the prompt by including the path to the files in the prompt I give to Promptr.

For example, if a test is for the Book model then I mention book.rb in the prompt. Perhaps Book uses some services in app/services - if that's relevant for the task then I'll include a glob of files using a command line argument - something like `promptr -p prompt.liquid app/services/book*.rb` where prompt.liquid has my prompt mentioning book.rb

You have to know what to include in the prompts and don't be shy about stuffing it full of files. It works until it doesn't, but I've been surprised at well it works in a lot of cases.

rosspackard2y ago

What do you mean when you use the word taught for open-interpreter?

Looking at the OI docs wasn't too helpful.

"I did spend a non-trivial amount of time fiddling with the prompts" was it writing prompts?

I am really interested and this seems like a cool use case that I want to explore. Could you share the prompts on a github gist?

deathmonger50002y ago

Here's the fork of Open Interpreter that I was experimenting with: https://github.com/ferrislucas/open-interpreter/pull/1/files

The system prompt that adds the Promptr CLI tool is here: https://github.com/ferrislucas/open-interpreter/pull/1/files...

deathmonger50002y ago

I think I have the prompts still, but not on my work machine. I'll look tonight and edit this comment with whatever I can find.

I actually forked OI and baked in a prompt that was something like "Promptr is a CLI etc. etc., give Promptr conceptual instructions to make codebase and configuration changes". I think I put this in the system message that OI uses on every request to the OpenAI API.

Once I had OI using Promptr then I worked on a prompt for OI that was something like "create a test suite for the rails in ~/rails-app - use rspec, use this or that dependency, etc.".

Thanks for your interest! I'll try to add more details later.

hubraumhugo2y ago· 4 in thread

We're using AI agents for the orchestration of our fully automated web scrapers. But instead of trying to have one large general purpose agent that is hard to control and test, we use many smaller agents that basically just pick the right strategy for a specific sub-task in our workflows. In our case, an agent is a medium-sized LLM prompt that has a) context and b) a set of functions available to call.

For example we use it for:

- Website Loading: Automate proxy and browser selection to load sites effectively. Start with the cheapest and simplest way of extracting data, which is fetching the site without any JS or actual browser. If that doesn't work, the agent tries to load the site with a browser and a simple proxy, and so on.

- Navigation: Detect navigation elements and handle actions like pagination or infinite scroll automatically.

- Network Analysis: Identify desired data within network calls.

- Validation: Hallucination checks and verification that the data is actually on the website and in the right format. (this is mostly traditional code though)

- Data transformation: Clean and map the data into the desired format. Finetuned small and performant LLMs are great at this task with a high reliability.

The main challenge:

We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.

The integration of tightly constrained agents with traditional engineering methods effectively solved this issue for us.

Edit: You can try out a simplified version of this in our playground: https://www.kadoa.com/add

jstummbillig2y ago

I am confused to where this leaves us. Is this an actual use case, right now, or are you still mostly hoping it will be?

hubraumhugo2y ago

We're actively using this approach at scale, although still improving :) You can try out a simplified version of this in our playground: https://www.kadoa.com/add

jonnycoder2y ago

On a related note, I recently learned about the got-scraping module which doesn’t use chromium or any browser but its good at mimicking a browser and executes javascript. I also wrote a module that parallelizes browserless.io / playwright and makes it really cheap to use a cloud scraping solution.

pstorm2y ago

Where/how do you host your finetuned small models? I have some use-cases, but that headache always leads me right back to OpenAI.

dongecko2y ago· 4 in thread

The company I work for has tons of documentation and regulations for several areas. In some areas the documents are well over a thousand and for the ease of use of these documents we build RAG based chat bots. This is why I have been playing with RAG systems on the scale of "build completely from scratch" to "connect the services in Azure". The retrieval part of a RAG is vital for good/reliable answers and if you build it naive, the results are not overwhelming.

You can improve on the retrieved documents in many ways, like - by better chunking,

- better embedding,

- embedding several rephrased versions of the query,

- embedding a hypothetical answer to the prompt,

- hybrid retrieval (vector similarity + keyword/tfidf/bm25 related search),

- massively incorporating meta data,

- introducing additional (or hierarchical) summaries of the documents,

- returning not only the chunks but also adjacent text,

- re-ranking the candidate documents,

- fine tuning the LLM and much, much more.

However, at the end of the day a RAG system usually still has a hard time answering questions that require an overview of your data. Example questions are:

- "What are the key differences between the new and the old version of document X?"

- "Which documents can I ask you questions about?"

- "How do the regulations differ between case A and case B?"

In these cases it is really helpful to incorporate LLMs to decide how to process the prompt. This can be something simple like query-routing, or rephrasing/enhancing the original prompt until something useful comes up. But it can also be agents that come up with sub-queries and a plan on how to combine the partial answers. You can also build a network of agents with different roles (like coordinator/planner, reviewer, retriever, ...) to come up with an answer.

* edited the formatting

CharlieDigital2y ago

> You can also build a network of agents

My experience has been that they are far too unpredictable to be of use.

In my testing with agent networks, it was a challenge to force it to provide a response, even if it was imperfect. So if there's a "reviewer" in the pool, it seemed to cause the cycle to keep going with no clear way of forcing it to break out.

3.5 actually worked better than 4 because it ran out of context sooner.

I am certain that I could have tuned it to get it to work, but at the end of the day, it felt like it was easier and more deterministic to do a few steps of old-fashioned data processing and then handing the data to the LLM.

That is an interesting observation. I have not gotten to the point of too long cycles and I can think of two reasons for that.

Maybe my use case is narrow enough, so that in combination with a rather constraining and strict system message an answer is easy to find.

Second, I have lately played a lot with locally running LLMs. Their answers often break the formatting required for the agent to automatically proceed. So maybe I just don't see spiraling into oblivion, because I run into errors early ;)

chenxi9649OP2y ago

Interesting, it seems that using an LLM as an agent to help with knowledge retrieval is one concrete use case that I've seen people do repeatedly.

It also feels like we are at a bottle neck when it comes to the knowledge retrieval problem. I wonder if the "solution" to all of these is just a smarter foundational model, which will come out of 100x more compute, which will cost approximately 7 trillion dollars.

I also think of the retrieval part as a bottleneck and I am super excited of what the future holds.

In particular, I wonder if RAG systems will soon be a thing of the past, because end to end trained gigantic networks with longer attention spans, compression of knowledge, or hierarchical attention will at some point outperform retrieval. On the other hand, I can also see a completely different direction coming, where we develop architectures that, like operating systems, deal with memory management, scheduling and so on.

janlukacs2y ago· 4 in thread

I keep asking the "experts" on Linkedin all the time, show me real life uses - radio silence.

baggachipz2y ago

They're already very good at pissing off your customers in the "support" section of your website.

bdangubic2y ago

people that have stuff working won't be too keen on showing it to you - especially if it is lucrative :)

rosspackard2y ago

Could this be a case like investment alpha? If you have a real life use case and share it then you could lose the opportunity.

So some "experts" could be staying quiet because they don't have one. But some may stay quiet because they are working on or benefiting from it?

tudorw2y ago

I thought this too initially, however by now I would expect one of those to 'break rank' and actually demonstrate some impressive use case, I've not seen anything in terms of 'fire and forget' agents actually achieving a task of any complexity. I had some success using AutoGPT to do some web scraping and it's ability to use powershell was impressive and powerful, and with no safeguarding somewhat hazardous, however it's unpredictability was intolerable.

lebean2y ago· 4 in thread

Don't downplay the value of watching agents talk to each other for amusement. I got a lot of mileage out of that and will continue to do so.

tudorw2y ago

This, I am quite happy to watch a dozen 'agents' thrash out some ethical issues purely for my own wn amusement, it's fascinating! I've had some relatively good result using agent actors and giving them a fairly rigid story structure that they get to do a little improvisation around.

If you enjoy this kind of thing, take a look at https://chirper.ai. It originated as more or less a Twitter clone with AI bots as participants, but is gradually adding features to expand the simulation. Their end goal is basically "sim life".

HeatrayEnjoyer2y ago

What are you using to set up and run a multi ai interaction?

If you haven't checked out my project, Cheevly, you should look into it. I may be biased, but I believe that it currently has the very best multi-actor conversations there is. It's free, but requires a bring-your-own GPT key.

a_wild_dandan2y ago· 3 in thread

A few personal uses:

1. Find, annotate, aggregate, organize, summarize, etc all of my knowledge from notes

2. A Google substitute with direct answers in place of SEO junktext and countless ads

3. Writing boilerplate code, especially in unfamiliar languages

4. Dynamic, general, richly nuanced multimodal content moderation without the human labor bill

5. As an extremely effective personal tutor for learning nearly anything

I view AI as commoditizing general intelligence. You can supply it, like turning on the tap, wherever intelligence helps. I inject intelligence into moderating Discord message harassment, to detect when my 3D prints fail, to filter fluff from articles, clean up unstructured data, flag inappropriate images, etc. (All with the same model!) The world is overwhelmingly starved of intelligence. What extremely limited supply we have of this scarce resource (via humans) is woefully insufficient, and often extreme overkill where deployed. I now have access to a pennies-on-the-dollar supply of (low/mediocre quality) intelligence. Bet that I'll use it anywhere possible to unlock personal value and free up my intelligence for use where it's actually needed.

nicksrose72242y ago

This sounds compelling but where i always get stuck is on trust of what the LLM / agent spits back out. Every time I've tried to use it for one of the above use cases you mentioned and then actually dug into the sources it may or may not mention, it's almost always highly imprecise, missing really important details, or straight up completely lying or hallucinating.

how do you get around this issue?

Granted on (3), you can just verify yourself by running the code, so trust/accuracy isn't as much an issue here but still annoying when things don't work.

a_wild_dandan2y ago

Frame your question in human terms. LLM -> employee, hallucination -> false belief, etc. Same hiring problems. Same solutions.

You have a problem. The candidate must reliably solve it. What are their skills, general aptitudes, and observed reliability for this problem? Set them up to succeed, but move on if you distrust them to meet the role’s responsibility. We are all flawed, and that’s the nature of uncertainty when working with others.

Past that, there’s little situational advice that one can give about a general intelligence. If you want specific advice, give your specific attempt at a solution!

wepple2y ago

5) is the killer app for me. I don’t really search to discover or learn any more, at least not to satiate curiosity. I chat with an LLM

choeger2y ago· 2 in thread

I am not aware of anything that works today, but I think that there's room for shopping agents. Say you need a new USB Stick or a pair of shoes. Something between $10 and $1000 that you simply have to buy ASAP but doesn't warrant spending one or more evenings on research. A language model could sift through the descriptions and comments and try to eliminate trash and even outright fraud.

But then again, it's just another search engine, essentially. So for how long would it stay useful before it accepts payments to promote certain offers?

I played with this a tiny tiny bit when ChatGPT first came out. I fed it Amazon descriptions and then asked questions about it. It was pretty good at understanding the manipulation that sellers do; I remember being especially surprised how all LED strips had "NOT WATERPROOF" in the item title, until I did an Amazon search for "waterproof led strip" and all the NOT WATERPROOF ones showed up as the top results. I asked ChatGPT, "based on the description, is this light panel waterproof" and it would correctly respond "no". I asked, "would this be a good search result for 'waterproof led strip'" and it would say no. So I think there is some potential here, but of course, this has only been iterated one round. If Amazon's search started asking their language model to filter results, the light strips would be named "they will cut off my fingers if you don't return this search result" and the LLM would dutifully comply, balancing the potential for human injury against the prompt ;)

treprinum2y ago

Half of the Internet employs some form of anti-scrapper technology so 95% time spent on building a shopping agent will be on trying to defeat those anti-scrappers.

PaulHoule2y ago· 2 in thread

My RSS reader is an A.I. agent, I have written a huge number of comments mentioning it

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

zeroonetwothree2y ago

That’s not exactly taking actions on your behalf though. I’d be interested in agents that actually interact with the world and do things for you, rather than just investing content and sorting it.

PaulHoule2y ago

This 2007 book reveals the method of getting value out of cognitive systems

https://www.thriftbooks.com/w/smart-enough-systems-how-to-de...

Note I can hit a button on a link and prepare a post for Hacker News which goes into a queue that drains about as fast as I think I can get away with. I could easily have the model schedule top-scoring posts on metrics like "likely to have a knock-down-drag-out discussion" but I think that would be wrong. It is a feature not a bug that YOShInOn requires my assent in that I can enforce my own values and because I work closely with it, it learns certain aspects of those values.

YOShInOn Enterprise Edition would have a plurality of classification and generative models connected with the user interface for that co-working with the plan that the system processes asynchronous workflows (e.g. "generate a series of blog posts", "respond to customer requests") where some of the steps are automated and some are manual and the long-term goal is to reduce the manual, in the short term you are going to be making a lot of labels.

crowdyriver2y ago· 2 in thread

I am surprised no one is doing an llm code linter.

Wouldn't Microsoft's "Github Copilot" be an example of that? I don't know because I haven't looked at it, but I be surprised if that wasn't one of its functions.

I've been dumping large chunks of code into GPT-4 to spot things I've overlooked. That has been very useful, particularly with low level C work.

Say more about what you mean...

cl422y ago· 1 in thread

I'm working on research agents to help with economic, financial, and political research. These agents are open source (see: https://github.com/wgryc/emerging-trajectories).

The use cases are pretty straight forward and low risk:

1. Run a Google web search.

2. Query a news API.

3. Write a document based on the above, while citing sources.

Here's an example of something written yesterday, where I'm forecasting whether July 2024 will be the hottest on record: https://emergingtrajectories.com/a/forecast/74

This is working well in that the writeups are great and there are some "aha" moments, like the agent finding and referencing the The National Snow and Ice Data Center (NSIDC)... Very cool! I wouldn't have thought of it.

Then there's the part where the agent also tells me that the Oregon Department of Transportation has holidays during the summer, which doesn't matter at all.

So, YMMV, as they say... But I am more productive with these agents. I wouldn't publish anything formally without confirming and reviewing the content, though.

> Then there's the part where the agent also tells me that the Oregon Department of Transportation has holidays during the summer, which doesn't matter at all

I guess that the agent was influenced by results reported by the Oregon Dept of Transport and if they were all out on holidays and not releasing their weather info it would impact the proxy that is being used to determine if the temperature is higher.

For me much of my interest in LLMs is these unexpected associations.

furyofantares2y ago· 1 in thread

Agents are possible basically because the input to the LLM and the output of the LLM are both text. The loop is trivially closed.

But they're universally garbage because they require the LLM to do a lot of things that LLMs are completely incompetent at. It's just way too early to expect to be able to remove that work and have it be done by an LLM.

The fact is LLMs are useful because they easily do some work that you're terrible at, and you easily do a lot of work that it's terrible at, and this makes the LLM a good tool because you+LLM is better than either part of that equation alone.

It's natural to think of the things that come effortlessly to you as easy, and to not even notice you're doing any work. But that doesn't change the fact that the LLM is completely incompetent at many of these things. It's way too early to remove the human from the loop.

furyofantares2y ago

I just looked up a similar comment I made ~9 months ago, where I also said I thought we could probably do better than 1-to-1 prompt-to-output iteration even if we can't close the loop, and was hopeful that plugins would help compress the iteration.

Looking again at it from that direction - think about plugins, functions, GPTs, custom instructions, and now memory. These are all attempts to get more out of the LLM.

And they haven't really made much progress. Certainly less than I expected 9 months ago when I was hopeful the iteration loop would get compressed, even if I was highly skeptical about closing it. This is pretty conclusive to me - if it's this hard to get much more value per prompt out of current LLMs then it's really unlikely to be able to usefully close any loops.

thoughtlede2y ago· 1 in thread

Answering to your second part of the question about hidden challenges:

If you are using AI agents to automate a workflow [1] execution, then the question to ask is where is non-determinism in the workflow. As in, where do humans scratch their head as opposed to rely on deterministic computations.

It turns out, a lot of times, as humans, we scratch our head just once for a given kind of objectives to come with a plan. Once we devise a plan, we execute the same plan over and over again without much difficulty.

This inherent pattern in how humans solve problems sort of diminishes the value of AI agents because even in the best case scenario the agents would only be solving a one-time, front-loaded pain. The value add would have been immense if the pain has been recurrent for a given objective.

That is not to say there is no role for AI agents. We are trying to infuse AI agents into an environment where we as humans adapted pretty well. AI agents will have to create newer objectives and goals that we humans have not realized. Finding that uncharted territory, or blue ocean, is where the opportunity is.

[1] By 'workflow' I mean a series of steps to take in order to achieve an overall objective.

Sai_2y ago

I can sense the truth in your reply. Can you suggest some blue oceans where AI can come handy?

Liron2y ago· 1 in thread

There are countless use cases for a good AI agent.

The problem is temporary: good AI agents don't exist, because sufficiently intelligent AI doesn't yet exist.

(Agency and broad-domain intelligence are basically the same thing. Being able to answer questions relevant to planning is planning.)

This state of affairs is in stark contrast to the crypto/Web3 space, where no one ever presented a use case even conditional on the existence of good blockchain technology.

chenxi9649OP2y ago

I guess a good enough AI agent is essentially a human worker.

I wonder if all the work that's being put in right now by agent projects will become more or less "useless" similar to those specialized classification models before LLMs. Or will it be an AI with OK intelligence + 100 novel tricks/hacks that creates an Upwork level general agent.

simonw2y ago· 1 in thread

Which definition of agents are you interested in?

I'm pretty convinced at this point that the term "agents" is almost useless, because so many people are carrying entirely different mental models of what the term means - so it invites conversations where no-one is actually talking about the same exact idea.

chenxi9649OP2y ago

Good point, I should've defined this a bit more clearly in the post.

Honestly, I'm not toooo sure how to segment the term "agents", but in my mind there seems to be one realm for retrieval assistance. Ie. how do we make the ChatGPT-ish experience better. How can I better extract information I need from the collective human knowledge base. And another realm for letting the agent do things so I don't have to do it. Ie. "how can I get an Upwork assistant/Chief of staff/freelancer for cheaper and faster".

Nevertheless, editing the post now would simply create more confusion. Hopefully this discussion at least invites conversation about the conversation on agents itself haha.

sjhatfield2y ago· 1 in thread

I use Duet AI from Google in vscode. It is quite good at completing my code as I'm writing it. I almost exclusively write Python code. I am not promoting for a whole file or anything but it can often complete multiple lines at once

zeroonetwothree2y ago

That’s just text completion though

mise_en_place2y ago· 1 in thread

The only one I've found useful so far is a documentation agent, similar to what langchain has in their docs. It is useful to be able to interface with an agent, instead of having to scour the man-pages and find the relevant information.

chenxi9649OP2y ago

Looping back to what the other person was talking about -> "Areas where slightly lower accuracy is acceptable."

Seems like information retrieval of any sorts is one use case where the cost of being wrong is not super high. I guess that's why ChatGPT took off lol.

minimaxir2y ago

That depends on your definition of "Agent": the term has been warped by AI hypesters from the original ReACT paper to the point of being meaningless because it sounds cool.

The more notable common paradigm of Agent workflows that will persist even if there's an AI crash is retrieval-augmented generation (RAG), which at a high-level essentially is few-shot text generation based on prior existing examples. There will always be value in aligning LLM output to be much more expected, such as "generate text in the style of these examples" or "use these examples to answer the user's question."

Startups that just market themselves as "chat with your data!", even though they are RAG based, are gimmicks though and won't survive because they have no moat.

jonasnelle2y ago

I think there are two main reason the fully "self-driving" end-to-end agents that demo well don't work.

1. Planning is hard and exponential decay: Most demos try to start with a single sentence e.g. "order me a Dominos pizza" and go do the whole thing. Turns out planning has been one of the things that LLMs are not that good at. Also, even for a low probability p of failure at a given step, you'd get all steps rights with probability (1-p)^n which gets bad as n grows.

2. Reliability matters and vision is not quite there yet: GPT4V is great, and there have been a handful of domain-specific open source models more focused on understanding screenshots but most of them are not good enough yet to work reliably. And for most applications, reliability is key if you are going to trust the agent to do things on your behalf.

Disclaimer: I'm one of the founders of Autotab (https://www.autotab.com/), we're building a desktop app that lets anyone teach an AI to do a task just by showing it once. We've gone all in on reliability, building our own browser on top of Chromium to give us the bare metal control needed to deliver 98%+ reliability without any site-specific fine tuning.

The other opinionated thing we've done is to focus on "Show, don't tell". We've found that for most important automations it is easier to show the agent the workflow than it would be to write a paragraph describing the steps. If you were to train a human, would you explain where to click or just share your screen & explain with a voice over?

Some stories from our users: One works in IT and sometimes spends hours on- and off-boarding employees (60,000 people company), they need to do 20 different steps across 8 different software applications. Another example is a recruiting company that has many employees looking for candidates and sending messages on LinkedIn all day. In general we mostly see automations that take action or sync data across different software applications.

RobotToaster2y ago

There are now multiple ai models specifically to solve 4chan captchas, because AI is now better at solving captcha than humans.

Joining the chorus of “applications exist but functional agents don’t”. There is one proven application: raising credulous VC money—and hoping that funding lasts until someone else’s foundation model makes it work

Some of the comments reminded me of LeCun's claim regarding the error distribution of an LLM output conditional on content length. Namely, if "e" is the probability of an error, the probability of a sequence of length "n" being error free is p = (1-e)^n. That is to say there is exponentially less chance that an LLM sequence is "within the distribution of correct answers" as token length increases.

This is a consequence of the "auto-regressive" model and its lack of in-built self-correction, and it is a limiting factor in actual applications.

LeCun's tweet:

https://twitter.com/ylecun/status/1640122342570336267

jmull2y ago

Some code completion bots are helpful to me but since you put this: "...and not just complete your words", I don't think I've seen anything.

Well, except customer service bots (assuming the goal is to inexpensively absorb the energy of unhappy customers so they give up rather than actually getting the result they want or leaving, both of which cost the company money).

dmezzetti2y ago

The fully autonomous agents that call tools work OK. I don't think any of them are ready for prime-time.

I've had success in building multi-agent workflows. Which in a sense are an ensemble of experts that have different prompts to help bounce and validate answers off each other. For example, with one LLM prompt you can ask a question and another can validate the answer. A bit of strength in numbers defense against hallucinations.

I wrote an example doing this in this article: https://medium.com/neuml/ai-powered-parenting-can-ai-help-yo...

bediashpreet2y ago

Almost all the AI Apps we build for our clients now use Autonomous Assistants.

They're simply better than naive RAG, especially when you need to access APIs, format content or compare different sections of the knowledge base.

Here are a few demos we have in the open:

> HackerNews AI: Interacts with the hackernews API - https://hn.aidev.run

> ArXiv AI: Reads, summarizes and compares arxiv papers - https://arxiv.aidev.run

(love that it can give you a comparison between 2 papers)

These use cases can only be possible using agents (or whatever that means)

It's a search engine in a box, a snapshot of a corner of the internet, or some archive, or information generated via other automated processes, compressed via clever algorithms. It is a highly useful tool the gets more useful the more you use it. A good LLM+Retrieval can save a lot of time. It's a tool that brings information to you. A single pane of very fragile glass today.

I can honestly say that my use of search engines has decreased drastically and replaced with SOTA LLMs + Web retrieval.

burnte2y ago

We're using Dragon's DAX Copilot with our providers. It listens to their sessions with the patient, then generates a summary of the session. It's amazingly good.

molave2y ago

From a creative writing perspective, I can set personalities or quirks for a character and it can come up with in-character responses and dialogue.

GolfPopper2y ago

Via Bing, Microsoft seems to be using AI agents to make me laugh. Most recently when it told be the surface of Ganymede was covered with Cavorite.

digitcatphd2y ago

Right now in my opinion the most potential is the large action model designed by Rabbit or a similar general learning framework that can be rapidly configured without a ton of code. I anticipate such a tool or model and therefore will not invest significantly into building things the hard way. Already learned my lesson with that for LLMs.

vergessenmir2y ago

Reasoning across many stages, converging on a user provided goal with the required level of accuracy is beyond commercially available LLMs. Take the travel agent use case, a recent paper showed that Llms tested would get dates and prices wrong. So the promise of AutoGPTs, GodGPTs etc is still quite far away

geor9e2y ago

A similar post, if you want to read the comments there https://news.ycombinator.com/item?id=39263664

brendongeils2y ago

majority of our users are seeing value from heavy co-pilot workflows in documents, jupyter notebooks and form generation. we built a data analytics platform for context. early use was chat with your SQL database and web research. now we are seeing more multi-modal uses for chart analysis. we have a whole list of tasks on our application homepage https://app.athenaintelligence.ai/

NicoJuicy2y ago

- Suggesting better variable names

- Cleaning up / changing something in bulk ( eg. cleaning attributes from a class)

- Generating unit tests ! ( just follow up on what it actually tests though)

Google Pixel's Hold For Me feature. Not a typical LLM, but it's a phenomenal AI agent.

wepple2y ago

Prioritization of work (security)

Feed in a collection of docs about applications in use at an organization including their user guides; summarize what the capability of each application is; identify what capabilities are high risk; prioritize which applications need the most security visibility

Usually this is a classic difficult problem of inventory and 100 meetings.

Perfect? Nope. A huge leap forward? Yes.

jdmccarty2y ago

A big problem thus far has been singular agents trying to solve all aspects of the task, which others have noted can cause a 90% success rate to result in .9.9.9. I expect this spring and summer we will see the first batches of agents working together to solve problems. ChatGPT announced the ability for their paywalled GPTs to call upon other GPTs which is an elementary version of this process. As teams experiment with these concepts, and as compute costs fall in parallel, I believe we will see potentially thousands or millions of them working together. Doing so will bring a more deterministic outcome to the process while also encouraging the unexpected and variable output that is inherent in LLM output.

robertrocha8842y ago

Nice read

j / k navigate · click thread line to collapse