What years of production-grade concurrency teaches us about building AI agents (opens in new tab)

(georgeguimaraes.com)

149 pointsellieh4mo ago51 comments

51 comments

45 comments · 16 top-level

simianwords4mo ago· 6 in thread

I don’t see the point of agent frameworks. Other than durability and checkpoints how does it help me?

Claude code already works as an agent that calls tools when necessary so it’s not clear how an abstraction helps here.

I have been really confused by langchain and related tech because they seem so bloated without offering me any advantages?

I genuinely would like to know what I’m missing.

d4rkp4ttern4mo ago

I ran into this question when thinking about the approach for a recent project. Yes CLI coding tools are good agents for interactive use, but if you are building a product then you do need an agent abstraction.

You could package Claude Code into the product (via agents-sdk or Claude -p) and have it use the API key (with metered billing) but in my case I didn’t find it ergonomic enough for my needs, so I ended up using my own agent framework Langroid for this.

https://github.com/langroid/langroid

(No it’s not based on that similarly named other framework, it’s a clean, minimal, extensible framework with good dx)

veunes4mo ago

You don't need frameworks for one-off scripts, but in prod, you're going to need RAG, proper memory, tools, and orchestration anyway. Without standards you'll just end up writing your own janky framework on top of requests. LangChain is definitely a bloated mess, but it provides structure. The beauty of Elixir is that this structure (OTP) is baked into the language, not duct-taped on the side

spoiler4mo ago

There's lots of things you could do. Imagine you're making a group chat bot (way more difficult than a 1-1 chat) where people can play social games by giving the LLM game rules. You can have an agent that only manages game state using natural language (controlled by the main LLM). You could have an agent dedicated to remembering important conversation, while not paying attention to chit-chatting

simianwords4mo ago

What more functionality do you need than the system prompt and list of tools?

znnajdla4mo ago

Even if you’re just running Claude Code as an external CLI then Elixir makes this easier because it supervises and handles external processes gracefully. Starting, checking, stopping processes needs to be done manually in most languages.

koakuma-chan4mo ago

100% agreed

mccoyb4mo ago· 4 in thread

Broadly agree with the author's points, except for this one:

> TypeScript/Node.js: Better concurrency story thanks to the event loop, but still fundamentally single-threaded. Worker threads exist but they're heavyweight OS threads, not 2KB processes. There's no preemptive scheduling: one CPU-bound operation blocks everything.

This cannot be a real protest: 100% of the time spent in agent frameworks is spent ... waiting for the agent to respond, or waiting for a tool call to execute. Almost no time is spent in the logic of the framework itself.

Even if you use heavyweight OS threads, I just don't believe this matters.

Now, the other points about hot code swapping ... so true, painfully obvious to those of us who have used Elixir or Erlang.

For instance, OpenClaw: how much easier would "in-place updating" be if the language runtime was just designed with the ability in mind in the first place.

znnajdla4mo ago

> 100% of the time spent in agent frameworks is spent ... waiting for the agent to respond, or waiting for a tool call to execute. Almost no time is spent in the logic of the framework itself.

But that’s exactly where multi threaded Elixir is better! You want a single thread like Node for CPU bound work, you want extreme multi threading for I/O bound work like AI agents. In Elixir you can do both: heavy CPU work without worrying about stopping the world, and heavy concurrency across millions of threads where work is I/O bound and you want to saturate your network connection. In Node you can’t do either of those things easily - it’s just a single thread.

znnajdla4mo ago

> Even if you use heavyweight OS threads, I just don't believe this matters.

It matters a lot. How many OS threads can you run on 1 machine? With Elixir you can easily run thousands without breaking a sweat. But even if you need only a few agents on one machine, OS thread management is a headache if you have any shared state whatsoever (locks, mutexes, etc.). On Unix you can't even reliably kill dependent processes[1]. All those problems just disappear with Elixir.

[1] https://matklad.github.io/2023/10/11/unix-structured-concurr...

wqaatwt4mo ago

Presumably if you can afford to pay for all those tokens the computational cost should be mostly insignificant?

Spending too much time optimizing for the 1% of extra overhead seems suboptimal..

1 more reply

kibwen4mo ago

> How many OS threads can you run on 1 machine?

Any modern Linux machine should be able to spawn thousands of simultaneous threads without breaking a sweat.

quadruple4mo ago· 4 in thread

> The BEAM's "let it crash" philosophy takes the opposite approach. Instead of anticipating every failure mode, you write the happy path and let processes crash. The supervisor detects the crash and restarts the process in a clean state. The rest of the system continues unaffected.

Do I want this? If my request fails because the tool doesn't have a DB connection, I want the model to receive information about that error. If the LLM API returns an error because the conversation is too long, I want to run compacting or other context engineering strategies, I don't want to restart the process just to run into the same thing again. Am I misunderstanding Elixir's advantage here?

asa4004mo ago

You can (and should) always handle whatever errors you actually want to and are able handle, so if that means catching a lot of them and forwarding them to a model, that's not a problem.

The benefit comes mainly from what happens when you encounter unknown errors or errors that you can't handle or errors that would get you into an invalid state. It's normal in BEAM languages to handle the errors you want to/can handle and let the runtime deal with the other transient/unknown errors to restart the process into a known good state.

The big point really is preventing state corruption, so the types of patterns the BEAM encourages will go a long way toward preventing you from accidentally ending up in some kind of unknown zombie state with your model, like for example if your model or control plane think they are connected to each other but actually aren't. That kind of thing.

Happy to clarify more if this sounds strange.

quadruple4mo ago

It doesn't sound strange, it actually sounds sane and what everyone should be doing.

At the same time, I can't imagine the last time I had a random exception I didn't think about in prod, but I guess that's the whole point of the BEAM, just don't think about it at all.

I might take a stab at Elixir, the concepts seem interesting and the syntax looks to be up my alley.

1 more reply

znnajdla4mo ago

“Let it crash” doesn’t mean keep bashing your head against the wall. Elixir makes it easy to write state machines which reason about different types of failures, but it’s more declarative (this process requires X and Y preconditions, otherwise do Z) rather than imperative (I have to try/catch failures due to X and Y, now do Z). With Elixir you can actually specify that the process doesn’t start until the DB connection is ready, if that was the cause of the failure, it won’t start again (something else can take care of the DB). When the LLM API returns an error you can put the agent in a paused “errored” state and then you can have a different process decide what to do with the error, and pass it back to the main agent when it’s done. This is all really elegant functional code in Elixir compared to try/catches and if statements.

dumpsterdiver4mo ago

Especially now that those workloads might have something to say about it… e.g. “Why did you make me this way?”

fud1014mo ago· 4 in thread

This is all well and good but Elixir with Phoenix and Liveview is super bloated and you have to have a real reason to buy into such a monster that you couldn't do with a simpler stack.

monooso4mo ago

I'm not sure I agree with the "bloated" description, but I will say that I really like Elixir, and really dislike LiveView. Which is a shame, because the latter is pretty inescapable in Elixir world these days.

ipnon4mo ago

Interesting take. I’m an Elixir fanboy because I find LiveView to be very slim. You’re just sending state diffs over WebSocket and updating the DOM with MorphDOM. The simplicity is incomparable to state of the art with JavaScript frameworks in my humble opinion.

fud1014mo ago

I'd use something like a lightweight python framework (take your pick) and pair it with htmx. You can run that on low powered hardware or a cheap VPS. I can't even dev elixir on my N100 minipc, it's too demanding. Otherwise Python and SolidJS or Preact will work perfectly for a SPA.

1 more reply

btreecat4mo ago

That seems more complicated than jQuery

manojlds4mo ago· 4 in thread

Surely they mean Erlang not Elixir

igravious4mo ago

addressed at the very top of the article

   A note on terminology: Throughout this post I refer to "the BEAM." BEAM is
   the virtual machine that runs both Erlang and Elixir code, similar to how the
   JVM runs both Java and Kotlin. Erlang (1986) created the VM and the
   concurrency model. Elixir (2012) is a modern language built on top of it with
   better ergonomics. When I say "BEAM," I mean the runtime and its properties.
   When I say "Elixir," I mean the language we write.

manojlds4mo ago

How is that addressing the title

christophilus4mo ago

Sounds to me like they mean “BEAM” rather than a specific language. But BEAM means Elixir for most newcomers.

cyberpunk4mo ago

Which is a real shame as if you actually spend some time with both you’ll probably eventually realise erlang is the nicer language.

Elixir just feels… Like it’s a load of pre-compile macros. There’s not even a debugger.

1 more reply

mackross4mo ago· 3 in thread

I’m a huge elixir fan, but imho it doesn’t solve durable execution out of the box which is a major problem that often gets swept under the rug by BEAM fanboys. Because ETS and supervision trees don’t play well with deployment via restart, you’ve got to write some level of execution state to relational database or files. You can choose persistent ETS, mnesia, etc, (which have their own tradeoffs and come with some kind of gnarley data-loss scenarios in deep documentation). But, whatever you choose, in my experience you will need to spend a fair amount of time considering how your processes are going to survive restarts. Alternatively, Oban is nice, but it’s a heavy layer that makes control flow more complex to follow. And, yes you can roll your own hot code deploy and run in persistent VMs/bare metal and be a true BEAM native, but it’s not easy out of the box and comes with its own set of foot guns. If I’m missing something, I would love for someone to explain to me how to do things better, as I find this to be a big pain point whenever I pick up elixir. I want to use the beautiful primitives, but I feel I’m always fighting durable execution in the event of a server restart. I wish a temporal.io client or something with similar guarantees was baked into the lang/frameworks.

hugobarauna4mo ago

Regarding adding persistence, people in the community are already building solutions, like: https://github.com/ChristianAlexander/durable_object

The good thing about those, IMO, is that they’re leveraging everything that’s already in BEAM/OTP, so there’s no need to reinvent the harder parts. They “only” add some extra features (like persistence of processes/GenServers between restarts) and higher-level abstraction APIs.

mackross4mo ago

Durable objects looks interesting! Thanks for the link

veunes4mo ago

Spot on. BEAM is great at surviving process crashes, but if the whole cluster goes down or you redeploy, that in-memory state evaporates. It's not magic. For agents that might hang around for days, pure Elixir isn't enough, you still need a persistence layer. The ecosystem is catching up (Oban Pro, FLAME), but in reality, we're still building hybrids: fast actors for active chats and a good old DB for history and long-running processes

bitwize4mo ago· 2 in thread

Ackshually...

Erlang didn't introduce the actor model, any more than Java introduced garbage collection. That model was developed by Hewitt et al. in the 70s, and the Scheme language was developed to investigate it (core insights: actors and lambdas boil down to essentially the same thing, you really don't need much language to support some really abstract concepts).

Erlang was a fantastic implementation of the actor model for an industrial application, and probably proved out the model's utility for large-scale "real" work more than anything else. That and it being fairly semantically close to Scheme are why I like it.

josevalim4mo ago

The team that built Erlang (Joe, Robert, Mike, and Bjorn) didn't know the actor model was actually a thing. They wanted to build reliable distributed systems and came up with the isolated processes model you find in Erlang today. Eventually (probably when Erlang was open sourced?), folks connected the dots that the actor model was the most accurate description of what was going on!

bitwize4mo ago

Spontaneous evolution of the same idea. It actor-models when it's actor-model time!

randomtoast4mo ago· 1 in thread

I’ve built fairly large OTP systems in the past, and I think the core claim is directionally right: long lived, stateful, failure prone "conversations" map very naturally to Erlang processes plus supervision trees. An agent session is basically a call session with worse latency and more nondeterminism.

That said, a lot of current agent workloads are I/O bound around external APIs. If 95% of the time is waiting on OpenAI or Anthropic, the scheduling model matters less than people think. The BEAM’s preemption and per process GC shine when you have real contention or CPU heavy work in the same runtime. Many teams quietly push embeddings, parsing, or model hosting to separate services anyway.

Hot code swapping is genuinely interesting in this context. Updating agent logic without dropping in flight sessions is non trivial on most mainstream stacks. In practice though, many startups are comfortable with draining connections behind a load balancer and calling it a day.

So my take is: if you actually need millions of concurrent, stateful, soft real time sessions with strong fault isolation, the BEAM is a very sane default. If you are mostly gluing API calls together for a few thousand users, the runtime differences are less decisive than the surrounding tooling and hiring pool.

znnajdla4mo ago

If you are just “gluing together API calls” that’s exactly where Elixir state machines and supervisors make your agentic code so much easier to write. API calls are constantly failing, timing out, retrying, and being pre/post processed, and you need to keep track of the state of your agent across multiple failure modes, recover gracefully, and coordinate between processes. In most other languages like Python or TypeScript that’s a hell of a lot of process orchestration code (Bull/Celery), try/catches, health checks, retry logic, and global state management in a a database of some sort. Compare that to Elixir processes where you get most of that for free because the language was designed exactly for that. It’s just a state machine that can crash and recover gracefully in the right order with all related processes. The BEAM is not just about running millions of processes at scale. It’s also about simplifying how you reason about a single process lifecycle, and long running agent lifecycles can get really complex.

When you are just “gluing together API calls” surrounding tooling doesn’t matter as much. I don’t care that Elixir doesn’t have such a large community as Python, I’m just gluing together API calls, I don’t have dependencies.

znnajdla4mo ago· 1 in thread

Is someone at Anthropic reading my AI chats? I literally came to this conclusion a few weeks ago after studying the right framework for building long running browser agents. Have zero experience with Elixir but as soon as I looked at their language constructs it just "clicked" immediately that this is exactly suited to AI agent frameworks. Another problem you solve immediately: distributed deployment without kubernetes hell.

amelius4mo ago

> Is someone at Anthropic reading my AI chats?

It is not forbidden by their EULA/ToS, I suppose.

zknill4mo ago

There's two parts to this article. The scheduler/preemption, and the transport over the network. The article is absolutely right that long-lived request/response over HTTP connections with SSE streamed responses suck.

The article touches very briefly on Phoenix LiveView and Websockets. I wrote about why chatbots hate page refresh[1], and it's not solved by just swapping to Websockets. By far the best mechanism is pub/sub, especially when you can get multi-user/multi-device, conversation hand-off, re-connection, history resumes, and token compaction basically for free from the transport.

1: https://zknill.io/posts/chatbots-worst-enemy-is-page-refresh...

veunes4mo ago

The Let it crash concept is perfect for deterministic bugs, but does it work for probabilistic errors?

If an LLM returns garbage, restarting the process (agent) with the same prompt and temperature 0 yields the same garbage. An Erlang Supervisor restarts a process in a clean state. For an agent "clean state" = lost conversation context

We don't just need Supervision Trees, we need Semantic Supervision Trees that can change strategy on restart. BEAM doesn't give this out of the box, you still code it manually

nottorp4mo ago

I don't understand. Shouldn't we just task the "AI" agents to improve themselves?

What's that about years of experience? That's obsolete thinking!

codethief4mo ago

@dang I think the original title is much better (more specific):

> Your Agent Framework Is Just a Bad Clone of Elixir: Concurrency Lessons from Telecom to AI

vinnymac4mo ago

Been using Elixir with agents for over a year. Seemed like an obvious choice to me.

Node is great, but scaling Elixir threads is more so.

walletdrainer4mo ago

Every single post on this blog is AI generated spam, all commenters seem completely oblivious.

Are you guys okay? WTF is going on with HN?

There’s one interesting detail about this blog though, you can see how the LLM-generated spam improves over the years as models get better.

koakuma-chan4mo ago

You Should Just Use Rust

j / k navigate · click thread line to collapse

51 comments

45 comments · 16 top-level

simianwords4mo ago· 6 in thread

I don’t see the point of agent frameworks. Other than durability and checkpoints how does it help me?

Claude code already works as an agent that calls tools when necessary so it’s not clear how an abstraction helps here.

I have been really confused by langchain and related tech because they seem so bloated without offering me any advantages?

I genuinely would like to know what I’m missing.

d4rkp4ttern4mo ago

https://github.com/langroid/langroid

(No it’s not based on that similarly named other framework, it’s a clean, minimal, extensible framework with good dx)

veunes4mo ago

spoiler4mo ago

simianwords4mo ago

What more functionality do you need than the system prompt and list of tools?

znnajdla4mo ago

koakuma-chan4mo ago

100% agreed

mccoyb4mo ago· 4 in thread

Broadly agree with the author's points, except for this one:

Even if you use heavyweight OS threads, I just don't believe this matters.

Now, the other points about hot code swapping ... so true, painfully obvious to those of us who have used Elixir or Erlang.

For instance, OpenClaw: how much easier would "in-place updating" be if the language runtime was just designed with the ability in mind in the first place.

znnajdla4mo ago

> 100% of the time spent in agent frameworks is spent ... waiting for the agent to respond, or waiting for a tool call to execute. Almost no time is spent in the logic of the framework itself.

znnajdla4mo ago

> Even if you use heavyweight OS threads, I just don't believe this matters.

[1] https://matklad.github.io/2023/10/11/unix-structured-concurr...

wqaatwt4mo ago

Presumably if you can afford to pay for all those tokens the computational cost should be mostly insignificant?

Spending too much time optimizing for the 1% of extra overhead seems suboptimal..

1 more reply

kibwen4mo ago

> How many OS threads can you run on 1 machine?

Any modern Linux machine should be able to spawn thousands of simultaneous threads without breaking a sweat.

quadruple4mo ago· 4 in thread

asa4004mo ago

You can (and should) always handle whatever errors you actually want to and are able handle, so if that means catching a lot of them and forwarding them to a model, that's not a problem.

Happy to clarify more if this sounds strange.

quadruple4mo ago

It doesn't sound strange, it actually sounds sane and what everyone should be doing.

At the same time, I can't imagine the last time I had a random exception I didn't think about in prod, but I guess that's the whole point of the BEAM, just don't think about it at all.

I might take a stab at Elixir, the concepts seem interesting and the syntax looks to be up my alley.

1 more reply

znnajdla4mo ago

dumpsterdiver4mo ago

Especially now that those workloads might have something to say about it… e.g. “Why did you make me this way?”

fud1014mo ago· 4 in thread

This is all well and good but Elixir with Phoenix and Liveview is super bloated and you have to have a real reason to buy into such a monster that you couldn't do with a simpler stack.

monooso4mo ago

ipnon4mo ago

fud1014mo ago

1 more reply

btreecat4mo ago

That seems more complicated than jQuery

manojlds4mo ago· 4 in thread

Surely they mean Erlang not Elixir

igravious4mo ago

addressed at the very top of the article

   A note on terminology: Throughout this post I refer to "the BEAM." BEAM is
   the virtual machine that runs both Erlang and Elixir code, similar to how the
   JVM runs both Java and Kotlin. Erlang (1986) created the VM and the
   concurrency model. Elixir (2012) is a modern language built on top of it with
   better ergonomics. When I say "BEAM," I mean the runtime and its properties.
   When I say "Elixir," I mean the language we write.

manojlds4mo ago

How is that addressing the title

christophilus4mo ago

Sounds to me like they mean “BEAM” rather than a specific language. But BEAM means Elixir for most newcomers.

cyberpunk4mo ago

Which is a real shame as if you actually spend some time with both you’ll probably eventually realise erlang is the nicer language.

Elixir just feels… Like it’s a load of pre-compile macros. There’s not even a debugger.

1 more reply

mackross4mo ago· 3 in thread

hugobarauna4mo ago

Regarding adding persistence, people in the community are already building solutions, like: https://github.com/ChristianAlexander/durable_object

mackross4mo ago

Durable objects looks interesting! Thanks for the link

veunes4mo ago

bitwize4mo ago· 2 in thread

Ackshually...

josevalim4mo ago

bitwize4mo ago

Spontaneous evolution of the same idea. It actor-models when it's actor-model time!

randomtoast4mo ago· 1 in thread

znnajdla4mo ago

znnajdla4mo ago· 1 in thread

amelius4mo ago

> Is someone at Anthropic reading my AI chats?

It is not forbidden by their EULA/ToS, I suppose.

zknill4mo ago

1: https://zknill.io/posts/chatbots-worst-enemy-is-page-refresh...

veunes4mo ago

The Let it crash concept is perfect for deterministic bugs, but does it work for probabilistic errors?

We don't just need Supervision Trees, we need Semantic Supervision Trees that can change strategy on restart. BEAM doesn't give this out of the box, you still code it manually

nottorp4mo ago

I don't understand. Shouldn't we just task the "AI" agents to improve themselves?

What's that about years of experience? That's obsolete thinking!

codethief4mo ago

@dang I think the original title is much better (more specific):

> Your Agent Framework Is Just a Bad Clone of Elixir: Concurrency Lessons from Telecom to AI

vinnymac4mo ago

Been using Elixir with agents for over a year. Seemed like an obvious choice to me.

Node is great, but scaling Elixir threads is more so.

walletdrainer4mo ago

Every single post on this blog is AI generated spam, all commenters seem completely oblivious.

Are you guys okay? WTF is going on with HN?

There’s one interesting detail about this blog though, you can see how the LLM-generated spam improves over the years as models get better.

koakuma-chan4mo ago

You Should Just Use Rust

j / k navigate · click thread line to collapse