Rather than asking an agent to internalize its algorithm, you should teach it an API and then ask it to design an algorithm which you can run that in user space. There are very few situations where I think it makes sense (for cost or accuracy) for an LLM to internalize its algorithm. It's like asking asking an engineer to step through a function in their head instead of just running it.
So in concrete terms I'm imagining:
1. Create a prompt that gives the complete API specification and some general guidance about what role the agent will have.
2. In that prompt ask it to write a function that can be concisely used by the agent, written to be consumed from the agent and with the agent's perspective. The body of that function translates the agent-oriented function definition to an API call.
3. Now the agent can use these modified versions of the API that expose only what's really important from its perspective.
4. But there's no reason APIs and functions have to map 1:1. You can wrap multiple APIs in one function, or break things up however made most sense.
5. Now the API-consuming agent is just writing library routines for other agents, and creating a custom environment for those agents.
6. This is all really starting to look like a team of programmers building a platform.
7. You could design the whole thing top-down as well, speculating then creating the functions the agents will likely want, and using whatever capabilities you have to implement those functions. The API calls are just an available set of functionality.
And really you could have multiple APIs being used in one function call, and any number of ways to rephrase the raw capabilities as more targeted and specific capabilities.
4. Now the
Also, since you're implicitly questioning OP's claim to have been saying this all along, here's a comment from September 2023 where they first said the same quote and said they'd been building agents for 3 months by that point [1]. That's close enough to 2 years in my book.
[0] https://hn.algolia.com/?dateEnd=1685491200&dateRange=custom&...
i just broke Claude Code Research Preview, and i've crashed ChatGPT 4.5 Pro Deep Research. and i have the receipts :), so i'm looking for tools that work
There's nothing stopping your endpoints from returning data in some other format. LLMs actually seem to excel with XML for instance. But you could just use a template to define some narrative text.
XML seems more text heavy, more tokens. However, maybe more context helps?
But it's also evident for anyone who has used these models. It's also not unique to OpenAI, this bias is prevalent in every model I've ever tested from GPT 3 to the latest offerings from every single frontier model provider.
As to why I would guess it's because XML bakes semantic meaning into the tags it uses so it's easier for the model to understand the structure of the data. <employee>...</employee> is a lot easier to understand than { "employee": { ... }}.
I would guess that the models are largely ignoring the angular brackets and just focusing on the words which have unique tokens and thus are easier to pair up than the curly braces that are the same throughout JSON. Just speculation on my part though.
And this only applies to the input. Earlier models struggled to reliably output JSON so they've been both fine-tuned and wrapped in specific formatters that reliably force clean JSON outputs.
Most MCP are replicating API. Returning blobs of data.
1. This is using a lot of input context in formating as JSON and escaping a Json inside already a JSON. 2. This contain a lot of irrelevant information that you can same on it.
So the issue is the MCP tool. It should instead flaten the data as possible as it's going back again thru JSON Encoding. And if needed remove some fields.
So MCP SAAS here are mainly API gateways.
That brings this noise! And most of ALL they are not optimizing MCP's.
In that scenario running code on the data with minimum evaluation of the data (eg. a schema with explanation) is a much better approach and it will scale up to use cases of a certain complexity.
Even this system is not perfect: once your data definition and orchestration grow to big you'll face the same problems.
This should allow you to scale to pretty complex problems though, while the naive approach of just embedding API responses in the chat fails soon (I run into this issue frequently, maintaining a relatively simple systems with a few tool calls).
The only proper solution is reproducing the level of granularity of human decisions in code and call this "decisional system" from an LLM (which would be then reduced to a mere language interface between human language and the internal system). Easier said than done, though.
Isn't it a model problem that they don't respect complex json schemas?
No one has found anything revolutionary yet, but there are some useful applications to be sure.
I really did not even want this, it just happened.
And its not like it changes every day. KPis etc stay the same for months. And then you can easily update it in a hour.
So what exactly does llm solve here?
I guess one could in principle wrap the entire execution block into a distributed transaction, but llm try to make code that is robust, which works against this pattern as it makes hard to understand failure
For example, when the code execution fails mid-way, we really want the model to be able to pick up from where it failed (with the states of the variables at the time of failure) and be able to continue from there.
We've found that the LLM is able to generate correct code that picks up gracefully. The hard part now is building the runtime that makes that possible; we've something that works pretty well in many cases now in production at Lutra.
I agree function composition and structured data are essential for keeping complexity in check. In our experience, well-defined structured outputs are the real scalability lever in tool calling. Typed schemas keep both cognitive load and system complexity manageable. We rely on deterministic behavior wherever possible, and reserve LLM processing for cases where schema-less data or ambiguity is involved. Its a great tool for mapping fuzzy user requests to a more structured deterministic system.
That said, finding the right balance between taking complexity out of high entropy input or introducing complexity through chained tool calling is a tradeoff and balance that needs to be struck carefully. In real-world commerce settings, you rarely get away with just one approach. Structured outputs are great until you hit ambiguous intents—then things get messy and you need fallback strategies.
Or, if I gave the LLM a list of my users and asked it to filter based on some criteria, the grammar would change to only output user IDs that existed in my list.
I don't know how useful this would be in practice, but at least it would make it impossible for the LLM to hallucinate for these cases.
If your tools are calling APIs on-behalf of users, it's better to use OAuth flows to enable users of the app to give explicit consent to the APIs/scopes they want the tools to access. That way, tools use scoped tokens to make calls instead of hard to manage, maintain API keys (or even client credentials).
Slight tangent, but as a long term user of OpenAI models, I was surprised at how well Claude Sonnet 3.7 through the desktop app handles multi-hop problem solving using tools (over MCP). As long as tool descriptions are good, it’s quite capable of chaining and “lateral thinking” without any customisation of the system or user prompts.
For those of you using Sonnet over API: is this behaviour similar there out of the box? If not, does simply pasting the recently exfiltrated[1] “agentic” prompt into the API system prompt get you (most of the way) there?
MCP is literally just a wrapper around an API call, but because it has some LLM buzz sprinkled on top, people expect it to do some magic, when they wouldn't expect the same magic from the underlying API.
Explain how I would do this without an LLM:
https://blog.nawaz.org/posts/2025/May/gemini-figured-out-my-...
also looking into "fire and forget" tools, to see even if that is possible
Use grep & edit lines. and sequences instead of full files.
This way you can edit files with 50kl loc without issue while Claude will blow out if you ever try to write such file.
The issue right now is that both (1) function calling and (2) codegen just aren't really very good. The hype train far exceeds capabilities. Giving great demos like fetching some Stripe customers, generating an email or getting the weather work flawlessly. But anything more sophisticated goes off the rails very quickly. It's difficult to get models to reliably call functions with the right parameters, to set up multi-step workflows and more.
Add codegen into the mix and it's hairier. You need a deployment and testing apparatus to make sure the code actually works... and then what is it doing? Does it need secret keys to make web requests to other services? Should we rely on functions for those?
The price / performance curve is a consideration, too. Good models are slow and expensive. Which means their utility has to be higher in order to charge a customer to pay for the costs, but they also take a lot longer to respond to requests which reduces perception of value. Codegen is even slower in this case. So there's a lot of alpha in finding the right "mixture of models" that can plan and execute functions quickly and accurately.
For example, OpenAI's GPT-4.1-nano is the fastest function calling model on the market. But it routinely tries to execute the same function twice in parallel. So if you combine it with another fast model, like Gemini Flash, you can reduce error rates - e.g. 4.1-nano does planning, Flash executes. But this is non-obvious to anybody building these systems until they've tried and failed countless times.
I hope to see capabilities improve and costs and latency trend downwards, but what you're suggesting isn't quite feasible yet. That said I (and many others) are interested in making it happen!
Is this true for all tool calls? Even if the tool returns little data?
def process(param1, param2):
my_data = mcp_get_data(param1)
sorted_data = mcp_sort(my_data, by=param2)
return sorted_dataThis would be more reliable than expecting the LLM to generate working code 100% of the time?
It's interesting how architectural patterns built at large tech companies (for completely different use-cases than AI) have become so relevant to the AI execution space.
You see a lot of AI startups learning the hard way that value of event sourcing and (eventually) durable execution, but these patterns aren't commonly adopted on Day 1. I blame the AI frameworks.
(disclaimer - currently working on a durable execution platform)
Think of extracting parts of an email subject. LLM is great at going through unseen subject lines and telling us what can be extracted. We ask LLM what it found, where. For things like dates, times, city, country etc, we can then deterministically re-run on new strings to extract.
It’s not well thought out. I’ve been building one with the new auth spec and their official code and tooling is really lacking.
It could have been so much simpler and straight forward by now.
Instead you have 3 different server types and one is deprecated already (SSE) it’s almost funny