LLM function calls don't scale; code orchestration is simpler, more effective (opens in new tab)

https://news.ycombinator.com/item?id=37626877

madrox1y ago

nitwit0051y ago

I'm sure you can find it in chatbot documentation from the 90s. It's a generic term carried over from non-AI chat. People responding to support chats were called agents.

obiefernandez1y ago· 5 in thread

My team at Shopify just open sourced Roast [1] recently. It lets us embed non-deterministic LLM jobs within orchestrated workflows. Essential when trying to automate work on codebases with millions of lines of code.

[1] https://github.com/shopify/roast

TheTaytay1y ago

Wow - Roast looks fantastic. You architected and put names and constraints on some things that I've been wrestling with for a while. I really like how you are blending the determinism and non-determinism. (One thing that is not obvious to me after reading the README a couple of times (quickly), is whether/how the LLM can orchestrate multiple tool calls if necessary and make decisions about which tools to call in which order. It seems like it does when you tell it to refactor, but I couldn't tell if this would be suitable for the task of "improve, then run tests. Repeat until done.")

drewda1y ago

Nice to see Ruby continuing to exist and deliver... even in the age of "AI"

crakhamster011y ago

This looks pretty cool! I'm curious how these sort of workflows are being used internally at Shopify. Any examples you can share?

bandoti1y ago

This is great! Reading the docs tickles my brain. Nice way to package up LLM functionality in a declarative way!

The_Blade1y ago

good stuff!

i just broke Claude Code Research Preview, and i've crashed ChatGPT 4.5 Pro Deep Research. and i have the receipts :), so i'm looking for tools that work

CSMastermind1y ago· 5 in thread

LLMs clearly struggle when presented with JSON, especially large amounts of it.

There's nothing stopping your endpoints from returning data in some other format. LLMs actually seem to excel with XML for instance. But you could just use a template to define some narrative text.

ryoshu1y ago

I'm consistently surprised that people don't use XML for LLMs as the default given XML comes with built-in semantic context. Convert the XML to JSON output deterministically when you need to feed it to other pipelines.

iJohnDoe1y ago

Any reason for this for my own learning? Was XML more prevalent during training? Something better about XML that makes it easier for the LLM to work with?

XML seems more text heavy, more tokens. However, maybe more context helps?

CSMastermind1y ago

It's in the official OpenAI prompting guidelines: https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...

But it's also evident for anyone who has used these models. It's also not unique to OpenAI, this bias is prevalent in every model I've ever tested from GPT 3 to the latest offerings from every single frontier model provider.

As to why I would guess it's because XML bakes semantic meaning into the tags it uses so it's easier for the model to understand the structure of the data. <employee>...</employee> is a lot easier to understand than { "employee": { ... }}.

I would guess that the models are largely ignoring the angular brackets and just focusing on the words which have unique tokens and thus are easier to pair up than the curly braces that are the same throughout JSON. Just speculation on my part though.

And this only applies to the input. Earlier models struggled to reliably output JSON so they've been both fine-tuned and wrapped in specific formatters that reliably force clean JSON outputs.

nitwit0051y ago

I've seen the suggestion it's because it's been trained on a lot of HTML, but the GPT docs suggest using markdown as a default choice, which is relatively less common.

crabl1y ago

We've been using Markdown tables to return data to the LLM with some success

mehdibl1y ago· 4 in thread

The issue is not in function calls but HOW MCP got designed here and you are using.

Most MCP are replicating API. Returning blobs of data.

1. This is using a lot of input context in formating as JSON and escaping a Json inside already a JSON. 2. This contain a lot of irrelevant information that you can same on it.

So the issue is the MCP tool. It should instead flaten the data as possible as it's going back again thru JSON Encoding. And if needed remove some fields.

So MCP SAAS here are mainly API gateways.

That brings this noise! And most of ALL they are not optimizing MCP's.

jensneuse1y ago

This is what GraphQL was designed for. Only select fields you really need. We've built an OSS Gateway that turns a collection of GraphQL queries into an MCP server to make this simple: https://wundergraph.com/mcp-gateway

jokethrowaway1y ago

MCP doesn't help but filtering is not always a good solution - sometimes you just need the agent to process a lot of data.

In that scenario running code on the data with minimum evaluation of the data (eg. a schema with explanation) is a much better approach and it will scale up to use cases of a certain complexity.

Even this system is not perfect: once your data definition and orchestration grow to big you'll face the same problems.

This should allow you to scale to pretty complex problems though, while the naive approach of just embedding API responses in the chat fails soon (I run into this issue frequently, maintaining a relatively simple systems with a few tool calls).

The only proper solution is reproducing the level of granularity of human decisions in code and call this "decisional system" from an LLM (which would be then reduced to a mere language interface between human language and the internal system). Easier said than done, though.

never_inline1y ago

> 1. This is using a lot of input context in formating as JSON and escaping a Json inside already a JSON.

Isn't it a model problem that they don't respect complex json schemas?

devoutsalsa1y ago

Just for fun, I used ChatGPT to reverse a string as my first test of using their API. I was amused at how much work it took to get the LLM to give me just the reversed string, and even then I didn't feel I could fully trust it. I learned my lesson, and now I have multiple LLMs check to see of the string has actually been reversed. Soon I'll be spinning up a data center to host the GPUs necessary to correctly count the number of Rs in strawberry.

hintymad1y ago· 4 in thread

I feel that the optimal solution is hybrid, not polarized. That is, we use deterministic approach as much as we can, but leverage LLMs to handle the remaining complex part that is hard to spec out or describe deterministically

jngiam1OP1y ago

Yes - in particular, I think one interesting angle is use the LLM to generate deterministic approaches (code). And then, if the code works, save it for future use and it becomes deterministic moving forward.

hintymad1y ago

Yes, and the other way around: use the deterministic methods to generate the best possible input to LLM.

nowittyusername1y ago

I agree. You want to use as little LLM as possible in your workflows.

mort961y ago

I've been developing software for decades without LLMs, turns out you can get away with very little!

padjo1y ago· 4 in thread

Sorry I’ve been out of the industry for the last year or so, is this madness really what people are doing now?

_se1y ago

No, not most people. But some people are experimenting.

No one has found anything revolutionary yet, but there are some useful applications to be sure.

padjo1y ago

Or, we have a hammer and we’re hitting things with it to see if they’re nails.

tobyhinloopen1y ago

Some people believe that if you're not doing this now, you might be out of the industry again pretty soon.

czechdeveloper1y ago

My daily job by now is massively using AI to develop AI agent designer, which means a lot of stuff like this.

I really did not even want this, it just happened.

codyb1y ago· 3 in thread

I'm slightly confused as to why you'd use a LLM to sort structured data in the first place?

jngiam1OP1y ago

The goal is to do more complex data processing, like build dashboards, agentically figure out which tickets are stalled, do a quarterly review of things done, etc. Sorting is a tiny task in the bigger ones, but hopefully more easily exemplifies the problem.

kikimora1y ago

I don’t understand how this can work. Given probabilistic nature of LLMs the more steps you have more chances something goes off. What is good in the dashboard if you cannot be sure it was not partially hallucinated?

risyachka1y ago

Everything you described is already solved by Metabase and few other tools. It takes a few hours to make daily reports there and the dashboard of your dreams.

And its not like it changes every day. KPis etc stay the same for months. And then you can easily update it in a hour.

So what exactly does llm solve here?

avereveard1y ago· 3 in thread

That's kind of the entire premise of huggingface smolagent and while it does work really well when it works it also increase the challenges in rolling back failed actions

I guess one could in principle wrap the entire execution block into a distributed transaction, but llm try to make code that is robust, which works against this pattern as it makes hard to understand failure

jngiam1OP1y ago

Agree, the smolagent premise is good; but the hard part is handling execution, errors, etc.

For example, when the code execution fails mid-way, we really want the model to be able to pick up from where it failed (with the states of the variables at the time of failure) and be able to continue from there.

We've found that the LLM is able to generate correct code that picks up gracefully. The hard part now is building the runtime that makes that possible; we've something that works pretty well in many cases now in production at Lutra.

avereveard1y ago

I think in principle you can make the entire API exposed to the llm idempotent so that it bicomes irrelevant for the backend wheter the llm replay the whole action or just the failed steps

hooverd1y ago

Could you implement an actual state machine and have your agent work with that?

darkteflon1y ago· 3 in thread

What are the current best options for sandboxed execution environments? HuggingFace seems to have a tie-up with E2B, although by default smolagents runs something ephemeral in-process. I feel like there must be a good Docker container solution to this that doesn’t require signing up to yet another SaaS. Any recommendations?

ATechGuy1y ago

Are you looking for an open-source sandboxing solution? Self hosting is available for E2B. You still have to subscribe to a SaaS for ephemeral cloud compute though.

colonCapitalDee1y ago

Try gVisor

codethief1y ago

That sounds like a category error? An alternative OCI runtime is not what GP asked for.

[1] https://news.ycombinator.com/item?id=43909409

jacob0191y ago· 2 in thread

I've been building agentic systems for my ecommerce business. I evaluated smolagents. It's elegant and has a lot of appealing qualities, but adds a lot of complexity to the system. For some tasks it's perfect, dynamic reporting environments that can sort and aggregate data without schema might be a good one. For most tasks it's just overkill. Gemini and OpenAI both offer python interpreters as tools, which can cover a lot of the use cases for code agents. It's true that cramming endless message on a stack of tool calls and interactions is not scalable, that is not a good way to use these tools. Most agentic workflows are shortlived. Complexity is managed with structure and discipline. These are well known problems in software development, and the old lessons still apply to the new tools. Function calls can absolutely scale well in an agentic system, or they can become a mess, just like they can in any codebase. Personally, building a system that works well is as much about managing cognitive load as the developer as it is about managing control flow and runtime performance. A simple solution that works well enough is usually superior to a clever solution with great performance. Composing function calls is the simple solution. Structured data can be still be parsed and transformed the old fashioned way. If the structure is unknown, even the cheap models are great at parsing. Managing complexity in an agentic system can be broken down into a problem of carefully managing application state. The message stack can be manipulated as needed to feed the models the active context. It's memory management in a constrained environment.

qu0b1y ago

Great summary of the trade-offs in Agentic systems. We’ve tackled these exact challenges as we built out our conversational product discovery product for e-commerce at IsarTech [0].

I agree function composition and structured data are essential for keeping complexity in check. In our experience, well-defined structured outputs are the real scalability lever in tool calling. Typed schemas keep both cognitive load and system complexity manageable. We rely on deterministic behavior wherever possible, and reserve LLM processing for cases where schema-less data or ambiguity is involved. Its a great tool for mapping fuzzy user requests to a more structured deterministic system.

That said, finding the right balance between taking complexity out of high entropy input or introducing complexity through chained tool calling is a tradeoff and balance that needs to be struck carefully. In real-world commerce settings, you rarely get away with just one approach. Structured outputs are great until you hit ambiguous intents—then things get messy and you need fallback strategies.

[0] https://isartech.io/

jacob0191y ago

Ambiguity must be explicitly handled like uncertainty in predictive modeling, that can be challenging. I run into trouble with task complexity. At a certain point even the best models start making dumb mistakes, and it's tough to draw the line for decomposing tasks. Role playing to induce planning and reflection helps, but I feel that upper bound. I've noticed that the model performance declines when using constrained outputs. Last year I would go to all this trouble decomposing tasks in ways that seem silly given the current models. At the pace that things are moving, I expect to see models soon that can handle 10x complexity and 10mb context, I just hope I can afford to use them.

stavros1y ago· 2 in thread

I would really like to see output-aware LLM inference engines. For example, imagine if the LLM output some tokens that meant "I'm going to do a tool call now", and the inference engine (e.g. llama.cpp) changed the grammar on the fly so the next token could only be valid for the available tools.

Or, if I gave the LLM a list of my users and asked it to filter based on some criteria, the grammar would change to only output user IDs that existed in my list.

I don't know how useful this would be in practice, but at least it would make it impossible for the LLM to hallucinate for these cases.

molf1y ago

Of course it would hallucinate. It would just pick arbitrary/wrong values.

stavros1y ago

It would be wrong, but it wouldn't hallucinate non-existent IDs.

norcalkc1y ago· 2 in thread

> Allowing an execution environment to also access MCPs, tools, and user data requires careful design to where API keys are stored, and how tools are exposed.

If your tools are calling APIs on-behalf of users, it's better to use OAuth flows to enable users of the app to give explicit consent to the APIs/scopes they want the tools to access. That way, tools use scoped tokens to make calls instead of hard to manage, maintain API keys (or even client credentials).

vrv1y ago

Agreed, OAuth is certainly preferred for many reasons, but replace "API keys" with "OAuth access tokens" and you have the same fundamental challenge of ensuring an LLM or untrusted code never has access to the user's secrets.

iandanforth1y ago

Do you know of any examples which use MCP and oauth cleanly?

darkteflon1y ago· 2 in thread

We’ve been using smolagents, which takes this approach, and are impressed.

Slight tangent, but as a long term user of OpenAI models, I was surprised at how well Claude Sonnet 3.7 through the desktop app handles multi-hop problem solving using tools (over MCP). As long as tool descriptions are good, it’s quite capable of chaining and “lateral thinking” without any customisation of the system or user prompts.

For those of you using Sonnet over API: is this behaviour similar there out of the box? If not, does simply pasting the recently exfiltrated[1] “agentic” prompt into the API system prompt get you (most of the way) there?

3abiton1y ago

How does it compare to MCP servers?

darkteflon1y ago

Not sure if I correctly understand your question. I was saying that Sonnet 3.7 in the desktop app is good out-of-the-box at orchestrating tools exposed as MCP servers and asking whether that behaviour is also present over the Anthropic API or, if not, whether copy-pasting the exfiltrated system prompt gets you there.

iLoveOncall1y ago· 2 in thread

That's MCP for you.

MCP is literally just a wrapper around an API call, but because it has some LLM buzz sprinkled on top, people expect it to do some magic, when they wouldn't expect the same magic from the underlying API.

BeetleB1y ago

It is just a wrapper around an API call. And that's all you need for magic.

Explain how I would do this without an LLM:

https://blog.nawaz.org/posts/2025/May/gemini-figured-out-my-...

iLoveOncall1y ago

Is this a trick question? You lay out exactly how you would do this without an LLM in the prompt...

fullstackchris1y ago· 2 in thread

This is exactly what I've encountered, at least with Claude, it writes out huge artifacts (static ones retrieved from the file system or wherever) character for character - What I'm going to try this weekend is just integrating a redis cache or sqlite into the MCP tool calls, so claude doesnt have to write everything out character per character... no idea if it will work as expected...

also looking into "fire and forget" tools, to see even if that is possible

mehdibl1y ago

You don't have to use full write.

Use grep & edit lines. and sequences instead of full files.

This way you can edit files with 50kl loc without issue while Claude will blow out if you ever try to write such file.

In that case grep is fine, but if I have a specific artifact I need to transport from one function to another, I'll need some sort of background set / get.

deadbabe1y ago· 2 in thread

I’m confused as to why no one is just having LLMs dynamically produce and expose new tools on the fly as combinations of many small tools or even write new functions from scratch, to handle cases where there isn’t an ideal tool to process some input with one efficient tool call.

keithwhor1y ago

I am building a company in this space, so can hopefully give some insight [0].

The issue right now is that both (1) function calling and (2) codegen just aren't really very good. The hype train far exceeds capabilities. Giving great demos like fetching some Stripe customers, generating an email or getting the weather work flawlessly. But anything more sophisticated goes off the rails very quickly. It's difficult to get models to reliably call functions with the right parameters, to set up multi-step workflows and more.

Add codegen into the mix and it's hairier. You need a deployment and testing apparatus to make sure the code actually works... and then what is it doing? Does it need secret keys to make web requests to other services? Should we rely on functions for those?

The price / performance curve is a consideration, too. Good models are slow and expensive. Which means their utility has to be higher in order to charge a customer to pay for the costs, but they also take a lot longer to respond to requests which reduces perception of value. Codegen is even slower in this case. So there's a lot of alpha in finding the right "mixture of models" that can plan and execute functions quickly and accurately.

For example, OpenAI's GPT-4.1-nano is the fastest function calling model on the market. But it routinely tries to execute the same function twice in parallel. So if you combine it with another fast model, like Gemini Flash, you can reduce error rates - e.g. 4.1-nano does planning, Flash executes. But this is non-obvious to anybody building these systems until they've tried and failed countless times.

I hope to see capabilities improve and costs and latency trend downwards, but what you're suggesting isn't quite feasible yet. That said I (and many others) are interested in making it happen!

[0] https://instant.bot

deadbabe1y ago

Well in the mean time we could just have the LLM shoot Jira tickets at human developers to build out new tools it requires ASAP? And until it’s done have a placeholder message returned to the client? Could be a good way to keep developers working constantly. And eventually when the tech is good you replace the human devs with LLMs.

yahoozoo1y ago· 2 in thread

In the example request, they want a list of issues in their project but don’t need the ID of each issue. But, what about when you want a list of issues and DO want the ID?

vrv1y ago

If the output schema specifies an id field, the LLM can write a code snippet that references it based on the context of the subsequent request, but the LLM doesn't need to observe the underlying value unless necessary. E.g., it can pass the 'id' opaquely to another call that receives the "id" as an input. If the user specifically wants to see the "id", the code orchestration approach can have the LLM just print the content.

wyett1y ago

I had the same question.

koakuma-chan1y ago· 2 in thread

> TL;DR: Giving LLMs the full output of tool calls is costly and slow.

Is this true for all tool calls? Even if the tool returns little data?

https://github.com/pixlie/determined

from my experience its about the speed of a very competant human - one of my favorite custom tools ive written is just access to a series of bash commands - havent tested with others but claude very quickly browses through files, reads them, and so on to do whatever it was you prompted. But even then it is all contextual - for example, I had to remove 'find' because as one would expect, running 'find' against a huge directory set is very slow!

koakuma-chan1y ago

Well, the bottleneck there would usually be the LLM, because, e.g., a tool to inspect a filesystem directory would be very fast, and it wouldn't necessarily return a lot of data, so I am confused what this article is really trying to say.

bguberfain1y ago· 1 in thread

I think that there may be another solution for this, that is the LLM write a valid code that calls the MCP's as functions. See it like a Python script, where each MCP is mapped to a function. A simple example:

  def process(param1, param2):
     my_data = mcp_get_data(param1)
     sorted_data = mcp_sort(my_data, by=param2)
     return sorted_data

jngiam1OP1y ago

Yes! If you want to see how this can work in practice, check out https://lutra.ai ; we've been using a similar pattern there. The challenge is making the code runtime work well for it.

arjunchint1y ago· 1 in thread

I am kind of confused why can't you just create a new MCP tool that encapsulates parsing and other required steps together in a code block?

This would be more reliable than expecting the LLM to generate working code 100% of the time?

Centigonal1y ago

You should for sure do this for common post processing tasks. However, you're usually not going to know all the types of post-processing users will want to do with tool call output at design-time.

abelanger1y ago· 1 in thread

> Most execution environments are stateful (e.g., they may rely on running Jupyter kernels for each user session). This is hard to manage and expensive if users expect to be able to come back to AI task sessions later. A stateless-but-persistent execution environment is paramount for long running (multi-day) task sessions.

It's interesting how architectural patterns built at large tech companies (for completely different use-cases than AI) have become so relevant to the AI execution space.

You see a lot of AI startups learning the hard way that value of event sourcing and (eventually) durable execution, but these patterns aren't commonly adopted on Day 1. I blame the AI frameworks.

(disclaimer - currently working on a durable execution platform)

th0ma51y ago

I see all of this as a constant negotiation of what is and isn't needed out of traditional computing. Eventually they find that what they want from any of it is determinism, unfortunately for LLMs.

brainless1y ago

This is something I have been attempting for quite a while now. One simple tool I started is a deterministic data extraction system where AI helps in finding out the data to be extracted but then the code would try and "template" it. When we have the template, the extraction on any similar string would happen deterministically.

Think of extracting parts of an email subject. LLM is great at going through unseen subject lines and telling us what can be extracted. We ask LLM what it found, where. For things like dates, times, city, country etc, we can then deterministically re-run on new strings to extract.

visarga1y ago

Maybe we just need models that can reference spans by start:end range. Then they can pass arguments by reference instead of explicit quotation. We can use these spans as answers in extractive QA tasks, or as arguments for a code block, or to construct a graph from pointers, and do graph computation. If we define a "hide span" operation the LLM could dynamically open and close its context, which could lead to context size reduction. Basically - add explicit indexing to context memory, and make it powerful, the LLM can act like a CPU.

zackify1y ago

It’s because MCP return types are so basic. It’s text. Or image. Or one other type in the protocol I forget.

It’s not well thought out. I’ve been building one with the new auth spec and their official code and tooling is really lacking.

It could have been so much simpler and straight forward by now.

Instead you have 3 different server types and one is deprecated already (SSE) it’s almost funny

quotemstr1y ago

In which the industry reinvents the concept of a schema-ful API surface like the kinds we've had for 30 years. Rediscovering the past shouldn't be revolutionary

j / k navigate · click thread line to collapse

101 comments

83 comments · 25 top-level

madrox1y ago· 6 in thread

I've been saying for two years that "any sufficiently advanced agent is indistinguishable from a DSL."

ianbicking1y ago

I think I understand what you're proposing, but I'm not sure.

So in concrete terms I'm imagining:

1. Create a prompt that gives the complete API specification and some general guidance about what role the agent will have.

3. Now the agent can use these modified versions of the API that expose only what's really important from its perspective.

4. But there's no reason APIs and functions have to map 1:1. You can wrap multiple APIs in one function, or break things up however made most sense.

5. Now the API-consuming agent is just writing library routines for other agents, and creating a custom environment for those agents.

6. This is all really starting to look like a team of programmers building a platform.

And really you could have multiple APIs being used in one function call, and any number of ways to rephrase the raw capabilities as more targeted and specific capabilities.

4. Now the

symbolicAGI1y ago

Evidence that the path to ASI is not extending the capabilities of LLMs, but instead distilling out and compiling self-improving algorithms externally in a symbolic application.

fooker1y ago

Can you point to evidence of widespread use of the word 'agent' in this context from two years ago?

lolinder1y ago

[0] https://hn.algolia.com/?dateEnd=1685491200&dateRange=custom&...

[1] https://news.ycombinator.com/item?id=37626877

https://news.ycombinator.com/item?id=37626877

madrox1y ago

nitwit0051y ago

I'm sure you can find it in chatbot documentation from the 90s. It's a generic term carried over from non-AI chat. People responding to support chats were called agents.

obiefernandez1y ago· 5 in thread

[1] https://github.com/shopify/roast

TheTaytay1y ago

drewda1y ago

Nice to see Ruby continuing to exist and deliver... even in the age of "AI"

crakhamster011y ago

This looks pretty cool! I'm curious how these sort of workflows are being used internally at Shopify. Any examples you can share?

bandoti1y ago

This is great! Reading the docs tickles my brain. Nice way to package up LLM functionality in a declarative way!

The_Blade1y ago

good stuff!

i just broke Claude Code Research Preview, and i've crashed ChatGPT 4.5 Pro Deep Research. and i have the receipts :), so i'm looking for tools that work

CSMastermind1y ago· 5 in thread

LLMs clearly struggle when presented with JSON, especially large amounts of it.

There's nothing stopping your endpoints from returning data in some other format. LLMs actually seem to excel with XML for instance. But you could just use a template to define some narrative text.

ryoshu1y ago

iJohnDoe1y ago

Any reason for this for my own learning? Was XML more prevalent during training? Something better about XML that makes it easier for the LLM to work with?

XML seems more text heavy, more tokens. However, maybe more context helps?

CSMastermind1y ago

It's in the official OpenAI prompting guidelines: https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...

And this only applies to the input. Earlier models struggled to reliably output JSON so they've been both fine-tuned and wrapped in specific formatters that reliably force clean JSON outputs.

nitwit0051y ago

I've seen the suggestion it's because it's been trained on a lot of HTML, but the GPT docs suggest using markdown as a default choice, which is relatively less common.

crabl1y ago

We've been using Markdown tables to return data to the LLM with some success

mehdibl1y ago· 4 in thread

The issue is not in function calls but HOW MCP got designed here and you are using.

Most MCP are replicating API. Returning blobs of data.

1. This is using a lot of input context in formating as JSON and escaping a Json inside already a JSON. 2. This contain a lot of irrelevant information that you can same on it.

So the issue is the MCP tool. It should instead flaten the data as possible as it's going back again thru JSON Encoding. And if needed remove some fields.

So MCP SAAS here are mainly API gateways.

That brings this noise! And most of ALL they are not optimizing MCP's.

jensneuse1y ago

jokethrowaway1y ago

MCP doesn't help but filtering is not always a good solution - sometimes you just need the agent to process a lot of data.

In that scenario running code on the data with minimum evaluation of the data (eg. a schema with explanation) is a much better approach and it will scale up to use cases of a certain complexity.

Even this system is not perfect: once your data definition and orchestration grow to big you'll face the same problems.

never_inline1y ago

> 1. This is using a lot of input context in formating as JSON and escaping a Json inside already a JSON.

Isn't it a model problem that they don't respect complex json schemas?

devoutsalsa1y ago

hintymad1y ago· 4 in thread

jngiam1OP1y ago

hintymad1y ago

Yes, and the other way around: use the deterministic methods to generate the best possible input to LLM.

nowittyusername1y ago

I agree. You want to use as little LLM as possible in your workflows.

mort961y ago

I've been developing software for decades without LLMs, turns out you can get away with very little!

padjo1y ago· 4 in thread

Sorry I’ve been out of the industry for the last year or so, is this madness really what people are doing now?

_se1y ago

No, not most people. But some people are experimenting.

No one has found anything revolutionary yet, but there are some useful applications to be sure.

padjo1y ago

Or, we have a hammer and we’re hitting things with it to see if they’re nails.

tobyhinloopen1y ago

Some people believe that if you're not doing this now, you might be out of the industry again pretty soon.

czechdeveloper1y ago

My daily job by now is massively using AI to develop AI agent designer, which means a lot of stuff like this.

I really did not even want this, it just happened.

codyb1y ago· 3 in thread

I'm slightly confused as to why you'd use a LLM to sort structured data in the first place?

jngiam1OP1y ago

kikimora1y ago

risyachka1y ago

Everything you described is already solved by Metabase and few other tools. It takes a few hours to make daily reports there and the dashboard of your dreams.

And its not like it changes every day. KPis etc stay the same for months. And then you can easily update it in a hour.

So what exactly does llm solve here?

avereveard1y ago· 3 in thread

That's kind of the entire premise of huggingface smolagent and while it does work really well when it works it also increase the challenges in rolling back failed actions

jngiam1OP1y ago

Agree, the smolagent premise is good; but the hard part is handling execution, errors, etc.

avereveard1y ago

I think in principle you can make the entire API exposed to the llm idempotent so that it bicomes irrelevant for the backend wheter the llm replay the whole action or just the failed steps

hooverd1y ago

Could you implement an actual state machine and have your agent work with that?

darkteflon1y ago· 3 in thread

ATechGuy1y ago

Are you looking for an open-source sandboxing solution? Self hosting is available for E2B. You still have to subscribe to a SaaS for ephemeral cloud compute though.

colonCapitalDee1y ago

Try gVisor

codethief1y ago

That sounds like a category error? An alternative OCI runtime is not what GP asked for.

[1] https://news.ycombinator.com/item?id=43909409

jacob0191y ago· 2 in thread

qu0b1y ago

Great summary of the trade-offs in Agentic systems. We’ve tackled these exact challenges as we built out our conversational product discovery product for e-commerce at IsarTech [0].

[0] https://isartech.io/

jacob0191y ago

stavros1y ago· 2 in thread

Or, if I gave the LLM a list of my users and asked it to filter based on some criteria, the grammar would change to only output user IDs that existed in my list.

I don't know how useful this would be in practice, but at least it would make it impossible for the LLM to hallucinate for these cases.

molf1y ago

Of course it would hallucinate. It would just pick arbitrary/wrong values.

stavros1y ago

It would be wrong, but it wouldn't hallucinate non-existent IDs.

norcalkc1y ago· 2 in thread

> Allowing an execution environment to also access MCPs, tools, and user data requires careful design to where API keys are stored, and how tools are exposed.

vrv1y ago

iandanforth1y ago

Do you know of any examples which use MCP and oauth cleanly?

darkteflon1y ago· 2 in thread

We’ve been using smolagents, which takes this approach, and are impressed.

3abiton1y ago

How does it compare to MCP servers?

darkteflon1y ago

iLoveOncall1y ago· 2 in thread

That's MCP for you.

BeetleB1y ago

It is just a wrapper around an API call. And that's all you need for magic.

Explain how I would do this without an LLM:

https://blog.nawaz.org/posts/2025/May/gemini-figured-out-my-...

iLoveOncall1y ago

Is this a trick question? You lay out exactly how you would do this without an LLM in the prompt...

fullstackchris1y ago· 2 in thread

also looking into "fire and forget" tools, to see even if that is possible

mehdibl1y ago

You don't have to use full write.

Use grep & edit lines. and sequences instead of full files.

This way you can edit files with 50kl loc without issue while Claude will blow out if you ever try to write such file.

In that case grep is fine, but if I have a specific artifact I need to transport from one function to another, I'll need some sort of background set / get.

deadbabe1y ago· 2 in thread

keithwhor1y ago

I am building a company in this space, so can hopefully give some insight [0].

I hope to see capabilities improve and costs and latency trend downwards, but what you're suggesting isn't quite feasible yet. That said I (and many others) are interested in making it happen!

[0] https://instant.bot

deadbabe1y ago

yahoozoo1y ago· 2 in thread

In the example request, they want a list of issues in their project but don’t need the ID of each issue. But, what about when you want a list of issues and DO want the ID?

vrv1y ago

wyett1y ago

I had the same question.

koakuma-chan1y ago· 2 in thread

> TL;DR: Giving LLMs the full output of tool calls is costly and slow.

Is this true for all tool calls? Even if the tool returns little data?