Monty: A minimal, secure Python interpreter written in Rust for use by AI (opens in new tab)

(github.com)

323 pointsdmpetrov4mo ago164 comments

164 comments

103 comments · 34 top-level

avaer4mo ago· 21 in thread

This feels like the time I was a Mercurial user before I moved to Git.

Everyone was using git for reasons to me that seemed bandwagon-y, when Mercurial just had such a better UX and mental model to me.

Now, everyone is writing agent `exec`s in Python, when I think TypeScript/JS is far better suited for the job (it was always fast + secure, not to mention more reliable and information dense b/c of typing).

But I think I'm gonna lose this one too.

miki1232114mo ago

3 reasons why Python is much better than JS for this IMO.

1. Large built-in standard library (CSV, sqlite3, xml/json, zipfile).

2. In Python, whatever the LLM is likely to do will probably work. In JS, you have the Node / Deno split, far too many libraries that do the same thing (XMLHTTPRequest / Axios / fetch), many mutually-incompatible import syntaxes (E.G. compare tsx versus Node's native ts execution), and features like top-level await (very important for small scripts, and something that an LLM is likely to use!), which only work if you pray three times on the day of the full moon.

3. Much better ecosystem for data processing (particularly csv/pandas), partially resulting from operator overloading being a thing.

Tade04mo ago

> In JS, you have the Node / Deno split,

You do? Deno is maybe a single digit percentage of the market, just hyped tremendously.

> E.G. compare tsx versus Node's native ts execution

JSX/TSX, despite what React people might want you to believe, are not part of the language.

> which only work if you pray three times on the day of the full moon.

It only doesn't work in some contexts due to legacy reasons. Otherwise it's just elaborate syntax sugar for `Promise`.

2 more replies

pjmlp4mo ago

In Python you also have plenty of implementations to choose from, incidentally many of them have evem better performance than CPython.

63stack4mo ago

>In Python, whatever the LLM is likely to do will probably work.

Do you not realize how this sounds?

>many mutually-incompatible import syntaxes

Do you think there are 22 competing package managers in python because the package/import system "just works"?

1 more reply

giancarlostoro4mo ago

Having been doing Python for over a decade and JavaScript. I would pick Python any day of the week over JavaScript. JavaScript is beautiful, and also the most horrific programming language all at once. It still feels incomplete, there's too many oddities I've run into over the years, like checking for null, empty, undefined values is inconsistent all around because different libraries behave differently.

whilenot-dev4mo ago

TBF is the Python ecosystem any different? None and dict everywhere, requirements.txt without pinned versions... I'm not complaining either, as I wouldn't expect a unified typed experience in ecosystems where multiple competing type checkers and package managers have been introduced gradually. How could any library from the python3.4 era foresee dataclasses or the typing module?

Such changes take time, and I favor an "evolution trumps revolution"-approach for such features. The JS/TS ecosystem has the advantage here, as it has already been going through its roughest time since es2015. In hindsight, it was a very healthy choice and the type system with TS is something to be left desired in many programming languages.

If it weren't for its rich standard library and uv, I would still clearly favor TS and a runtime like bun or deno. Python still suffers from spread out global state and some multi-paradigm approach when it comes to concurrency (if concurrency has even been considered by the library author). Python being the first programming language for many scientists shows its toll too: rich libraries of dubious quality in various domains. Whereas JS' origins in browser scripting contributed to the convention to treat global state as something to be frowned upon.

I wish both systems would have good object schema validation build into the standard library. Python has the upper hand here with dataclasses, but it still follows some "take it or throw"-approach, rather than to support customization for validations.

1 more reply

nine_k4mo ago

For historical reasons (FFI), Python has access to excellent vector / tensor mathematics (numpy / scipy / pandas / polars) and ML / AI libraries, from OpenCV to PyTorch. Hence the prevalence of Python in science and research. "Everybody knows Python".

I do like Typescript (not JS) better, because of its highly advanced type system, compared to Python's.

TS/JS is not inherently fast, it just has a good JIT compiler; Python still ships without one. Regarding security, each interpreter is about as permissive as the other, and both can be sealed off from environment pretty securely.

shoeb00m4mo ago

A big benefit of letting agents run code is they can process data without bloating their context.

LLMs are really good at writing python for data processing. I would suspect its due to Python having a really good ecosystem around this niche

And the type safety/security issues can hopefully be mitigated by ty and pyodide (already used by cf’s python workers)

https://pyodide.org/en/stable/

https://github.com/astral-sh/ty

DouweM4mo ago

(Pydantic AI lead here) That’s exactly what we built this for: we’re implementing Code Mode in https://github.com/pydantic/pydantic-ai/pull/4153 which will use Monty by default, with abstractions to use other runtimes / sandboxes.

Monty’s overhead is so low that, assuming we get the security / capabilities tradeoff right (Samuel can comment on this more), you could always have it enabled on your agents with basically no downsides, which can’t be said for many other code execution sandboxes which are often over-kill for the code mode use case anyway.

For those not familiar with the concept, the idea is that in “traditional” LLM tool calling, the entire (MCP) tool result is sent back to the LLM, even if it just needs a few fields, or is going to pass the return value into another tool without needing to see (all of) the intermediate value. Every step that depends on results from an earlier step requires a new LLM turn, limiting parallelism and adding a lot of overhead, expensive token usage, and context window bloat.

With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.

These posts by Cloudflare: https://blog.cloudflare.com/code-mode/ and Anthropic: https://platform.claude.com/docs/en/agents-and-tools/tool-us... explain the concept and its advantages in more detail.

5 more replies

pjmlp4mo ago

Agreed, however AI adoption is finally putting pressure on CPython to have a JIT in the box, so there is that.

And on GPU side, the existing libraries provide DSL based JITs, thus for many scenarios the performance is not much different from C++.

Now NVidia is also on the game with the new tile based architecture, with first party support to write kernels in Python even.

verdverm4mo ago

Don't extrapolate

There is a ton of wheel reinvention going on right now cause everyone wants to be cool in the age of ai

Use boring tech, you'll thank me and yourself later

Which in this case means, just use regular python. Your devops team is unlikely to allow knock off python in production. TS is fine too, I mainly write Go

rzerowan4mo ago

Tangentially i wonder if the recent changes in the GIL will percolate to mercurial as any improvements.

Yep still using good old hg for personal repos - interop for outside project defaults to git since almost all the hg host withered.

trenchgun4mo ago

Python has uv, ruff, ty

odiroot4mo ago

For me it's the opposite. I'm actively looking for tools in Python because at least they're gonna be lightweight and easy for me to debug.

Really tired of every AI-related tool released as of late being a half-GB node behemoth with hundreds of library dependencies.

Or alternatively some cryptic academic Rust codebase.

woadwarrior014mo ago

I remember the time when Python was the underdog and most of AI/ML code was written in the Matlab or Lua (torch). People would roll their eyes when you told them that you were doing deep learning with Python (theano).

piskov4mo ago

Can we please make as little js as possible?

Why would one drag this god forsaken abomination on server-side is beyond me.

Even effing C# nowdays can be run in script-like manner from a single file.

—

Even the latest Codex UI app is Electron. The one that is supposed to write itself with AI wonders but couldn’t manage native swiftui, winui, and qt or whatever is on linux this days.

aryonoco4mo ago

My favourite languages are F# and OCaml, and from my perspective, TypeScript is a far better language than C#.

Typescript’s types are far more adaptable and malleable, even with the latest C# 15 which is belatedly adding Sum Types. If I set TypeScript to its most strict settings, I can even make it mimic a poor man’s Haskell and write existential types or monoids.

And JS/TS have by far the best libraries and utilities for JSON and xml parsing and string manipulation this side of Perl (the difference being that the TypeScript version is actually readable), and maybe Nushell but I’ve never used Nushell in production.

Recently I wrote a Linux CLI tool for managing podman/quadlett containers and I wrote it in TypeScript and it was a joy to use. The Effect library gave me proper Error types and immutable data types and the Bun Shell makes writing shell commands in TS nearly as easy as Bash. And I got it to compile a single self contained binary which I can run on any server and has lower memory footprint and faster startup time than any equivalent .NET code I’ve ever written.

And yes had I written it in rust it would have been faster and probably even safer but for a quick a dirty tool, development speed matters and I can tell you that I really appreciated not having to think about ownership and fighting the borrow checker the whole time.

TypeScript might not be perfect, but it is a surprisingly good language for many domains and is still undervalued IMO given what it provides.

1 more reply

mcintyre19944mo ago

> and qt or whatever is on linux this days.

When you put it like that I can see why people end up with electron!

IshKebab4mo ago

I would say the same about Python, a language that has clearly got far too big for its boots.

1 more reply

bee_rider4mo ago

Python has the advantage that everybody sort of knows it is bad and slow, which is an important trait for a glue language. This increases the incentive to do the right thing: call a library written in C or Fortran or something.

wiseowise4mo ago

It might be slow, but it is definitely not bad. In the contrary, it is a great language. The closest to pseudocode you can get in a mainstream.

1 more reply

simonw4mo ago· 9 in thread

I got a WebAssembly build of this working and fired up a web playground for trying it out: https://simonw.github.io/research/monty-wasm-pyodide/demo.ht...

It doesn't have class support yet!

But it doesn't matter, because LLMs that try to use a class will get an error message and rewrite their code to not use classes instead.

Notes on how I got the WASM build working here: https://simonwillison.net/2026/Feb/6/pydantic-monty/

jstanley4mo ago

> But it doesn't matter, because LLMs that try to use a class will get an error message and rewrite their code to not use classes instead.

This is true in a sense, but every little papercut at the lower levels of abstraction degrades performance at higher levels as the LLM needs to spend its efforts on hacking around jank in the Python interpreter instead of solving the real problem.

qwertox4mo ago

It is a workaround, so we can assume that this will be temporary and in the future the ai will then start using them once it can. Probably just like we would do.

1 more reply

saberience4mo ago

I really don't understand the use-case here.

My models are writing code all day in 3/4 different languages, why would I want to:

a) Restrict them to Python

b) Restrict them to a cutdown, less-useful version of Python?

My models write me Typescript and C# and Python all day with zero issues. Why do I need this?

falcor844mo ago

For extremely rapid iteration - they can run a quick script with this in under 1ms - it removes a significant bottleneck, especially for math-heavy reasoning

1 more reply

srcreigh4mo ago

It’s a sandbox. If your model generates and runs a script for each email in your inbox and has access to sensitive information, you want to make sure it can’t communicate externally.

zahlman4mo ago

For sandboxing, as described in the README.

vghaisas4mo ago

This is very cool, but I'm having some trouble understanding the use cases.

Is this mostly just for codemode where the MCP calls instead go through a Monty function call? Is it to do some quick maths or pre/post-processing to answer queries? Or maybe to implement CaMeL?

It feels like the power of terminal agents is partly because they can access the network/filesystem, and so sandboxed containers are a natural extension?

16bitvoid4mo ago

It's right there in the README.

> Monty avoids the cost, latency, complexity and general faff of using full container based sandbox for running LLM generated code.

> Instead, it let's you safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.

1 more reply

otabdeveloper44mo ago

> and rewrite their code to not use classes instead

Only if the training data has enough Python code that doesn't use classes.

(We're in luck that these things are trained on Stackoverflow code snippets.)

OutOfHere4mo ago· 7 in thread

It is absurd for any user to use a half baked Python interpreter, also one that will always majorly lag behind CPython in its support. I advise sandboxing CPython instead using OS features.

bityard4mo ago

Python already has a lot of half-baked (all the way up to nearly-fully-baked) interpreters, what's one more?

https://en.wikipedia.org/wiki/List_of_Python_software#Python...

simonw4mo ago

How do I sandbox CPython using OS features?

(Genuine question, I've been trying to find reliable, well documented, robust patterns for doing this for years! I need it across macOS and Linux and ideally Windows too. Preferably without having to run anything as root.)

nickpsecurity4mo ago

It could be difficult. My first thought would be a SELinux policy like this article attempted:

https://danwalsh.livejournal.com/28545.html

One might have different profiles with different permissions. A network service usually wouldn't need your hone directory while a personal utility might not need networking.

Also, that concept could be mixed with subprocess-style sandboxing. The two processes, main and sandboxed, might have different policies. The sandboxed one can only talk to main process over a specific channel. Nothing else. People usually also meter their CPU, RAM, etc.

INTEGRITY RTOS had language-specific runtimes, esp Ada and Java, that ran directly on the microkernel. A POSIX app or Linux VM could run side by side with it. Then, some middleware for inter-process communication let them talk to each other.

OutOfHere4mo ago

Docker and other container runners allow it. https://containers.dev/ allows it too.

https://github.com/microsoft/litebox might somehow allow it too if a tool can be built on top of it, but there is no documentation.

1 more reply

avaer4mo ago

The repo does make a case for this, namely speed, which does make sense.

sd2k4mo ago

True, but while CPython does have a reputation for slow startup, completely re-implementing isn't the only way to work around it - e.g. with eryx [1] I've managed to pre-initialize and snapshots the Wasm and pre-compile it, to get real CPython starting in ~15ms, without compromising on language features. It's doable!

[1] https://github.com/eryx-org/eryx

OutOfHere4mo ago

Speed is not a feature if there isn't even syntax parity with CPython.

1 more reply

imfing4mo ago· 4 in thread

This is a really interesting take on the sandboxing problem. This reminds me of an experiment I worked on a while back (https://github.com/imfing/jsrun), which embedded V8 into Python to allow running JavaScript with tightly controlled access to the host environment. Similar in goal to run untrusted code in Python.

I’m especially curious about where the Pydantic team wants to take Monty. The minimal-interpreter approach feels like a good starting point for AI workloads, but the long tail of Python semantics is brutal. There is a trade-off between keeping the surface area small (for security and predictability) and providing sufficient language capabilities to handle non-trivial snippets that LLMs generate to do complex tasks

scolvin4mo ago

Can't be sure where this might end, but the primary goal is to enable codemode/programmatic tool calling, using the external function call mechanism for anything more complicated.

I think in the near term we'll add support for classes, dataclasses, datetime, json. I think that should be enough for many use cases.

digdugdirk4mo ago

I'd love to see this paired up with Pydantic for a lightweight pydantic based configuration "language". Similar to CUElang, but using pydantic to describe the configuration models themselves.

ushakov4mo ago

there’s no way around VMs for secure, untrusted workloads. everything else, like Monty has too many tradeoffs that makes it non-viable for any real workloads

disclaimer: i work at E2B, opinions my own

scolvin4mo ago

As discussed on twitter, v8 shows that's not true.

But to be clear, we're not even targeting the same "computer use" use case I think e2b, daytona, cloudflare, modal, fly.io, deno, google, aws are going after - we're aiming to support programmatic tool calling with minimal latency and complexity - it's a fundamentally different offering.

Chill, e2b has its use case, at least for now.

3 more replies

zahlman4mo ago· 3 in thread

> Instead, it let's you run safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.

Perhaps if the interpreter is in turn embedded in the executable and runs in-process, but even a do-nothing `uv` invocation takes ~10ms on my system.

I like the idea of a minimal implementation like this, though. I hadn't even considered it from an AI sandboxing perspective; I just liked the idea of a stdlib-less alternative upon which better-thought-out "core" libraries could be stacked, with less disk footprint.

Have to say I didn't expect it to come out of Pydantic.

preciousoo4mo ago

Pydantic + FastAPI are my two favorite python shops right now, they’re always dropping fun new projcts

Cyphase4mo ago

uv is written in Rust, not Python.

zahlman4mo ago

Yes. That's why I compare it (a compiled Rust executable) to Monty (a compiled Rust executable). The point is that loading large compiled executables into memory takes long enough to raise an objection to the "startup times measured in single digit microseconds not hundreds of milliseconds" claim.

kodablah4mo ago· 3 in thread

I'm of the mind that it will be better to construct more strict/structured languages for AI use than to reuse existing ones.

My reasoning is 1) AIs can comprehend specs easily, especially if simple, 2) it is only valuable to "meet developers where they are" if really needing the developers' history/experience which I'd argue LLMs don't need as much (or only need because lang is so flexible/loose), and 3) human languages were developed to provide extreme human subjectivity which is way too much wiggle-room/flexibility (and is why people have to keep writing projects like these to reduce it).

We should be writing languages that are super-strict by default (e.g. down to the literal ordering/alphabetizing of constructs, exact spacing expectations) and only having opt-in loose modes for humans and tooling to format. I admit I am toying w/ such a lang myself, but in general we can ask more of AI code generations than we can of ourselves.

bityard4mo ago

I think the hard part about that is you first have to train the model on a BUTT TON of that new language, because that's the only way they "learn" anything. They already know a lot of Python, so telling them to write restricted and sandboxed Python ("you can only call _these_ functions") is a lot easier.

But I'd be interested to see what you come up with.

kodablah4mo ago

> that's the only way they "learn" anything

I think skills and other things have shown that a good bit of learning can be done on-demand, assuming good programming fundamentals and no surprise behavior. But agreed, having a large corpus at training time is important.

I have seen, given a solid lang spec to a never-before-seen lang, modern models can do a great job of writing code in it. I've done no research on ability to leverage large stdlib/ecosystem this way though.

> But I'd be interested to see what you come up with.

Under active dev at https://github.com/cretz/duralade, super POC level atm (work continues in a branch)

Terretta4mo ago

> you first have to train the model on a BUTT TON of that new language

Tokenization joke?

c2xlZXB54mo ago· 3 in thread

Maybe a dumb question, but couldn't you use seccomp to limit/deny the amount of syscalls the Python interpreter has access to? For example, if you don't want it messing with your host filesystem, you could just deny it from using any filesystem related system calls? What is the benefit of using a completely separate interpreter?

oofbey4mo ago

Yours is a valid approach. But you always gotta wonder if there’s some way around it. Starting with runtime that has ways of accessing every aspect of your system - there are a lot of ways an attacker might try to defeat the blocks you put in place. The point of starting with something super minimal is that the attack surface is tiny. Really hard to see how anything could break out.

ushakov4mo ago

agree. you still need a secure boundary like VM to isolate the tenants in case the model breaks out of the sandbox.

everything that you don’t want your agent to access should live outside of the sandbox.

thundergolfer4mo ago

https://github.com/butter-dot-dev/bvisor is pushing in that direction

krick4mo ago· 3 in thread

I don't quite understand the purpose. Yes, it's clearly stated, but, what do you mean "a reasonable subset of Python code" while "cannot use the standard library"? 99.9% of Python I write for anything ever uses standard library and then some (requests?). What do you expect your LLM-agent to write without that? A pseudo-code sorting algorithm sketch? Why would you even want to run that?

impulser_4mo ago

They plan to use to for "Code Mode" which mean the LLM will use this to run Python code that it writes to run tools instead of having to load the tools up front into the LLM context window.

DouweM4mo ago

(Pydantic AI lead here) We’re implementing Code Mode in https://github.com/pydantic/pydantic-ai/pull/4153 with support for Monty and abstractions to use other runtimes / sandboxes.

The idea is that in “traditional” LLM tool calling, the entire (MCP) tool result is sent back to the LLM, even if it just needs a few fields, or is going to pass the return value into another tool without needing to see the intermediate value. Every step that depends on results from an earlier step also requires a new LLM turn, limiting parallelism and adding a lot of overhead.

With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.

These posts by Cloudflare: https://blog.cloudflare.com/code-mode/ and Anthropic: https://platform.claude.com/docs/en/agents-and-tools/tool-us... explain the concept and its advantages in more detail.

1 more reply

notepad0x904mo ago

It's pydantic, they're verifying types and syntax, those don't require the stdlib. Type hints, syntax checks, likely logical issues,etc.. static type checking is good with that, but LLMs can take to the next level where they analyze the intended data flow and find logical bugs, or good syntax and typing but not the intended syntax.

For example, incorrect levels of indentation. Let me use dots instead of space because of HN formatting:

for key,val in mydict.items():

..if key == "operation":

....logging.info("Executing operation %s",val)

..if val == "drop_table":

....self.drop_table()

This uses good syntax, and I the logging part is not in the stdlib, so I assume it would ignore it or replace it with dummy code? That shouldn't prevent it from analyzing that loop and determining that the second if-block was intended to be under the first, and the way it is written now, the key check isn't done.

In other words, if you don't want to do validate proper stdlib/module usage, but proper __Python__ usage, this makes sense. Although I'm speculating on exactly what they're trying to do.

EDIT: I think I my speculation was wrong, it looks like they might have developed this to write code for pydantic-ai: https://github.com/pydantic/pydantic-ai , i'll leave the comment above as-is though, since I think it would still be cool to have that capability in pydantic.

JoshPurtell4mo ago· 2 in thread

Monty is the missing link that's made me ship my rust-based RLM implementation - and I'm certain it'll come in handy in plenty of other contexts.

Just beware of panics!

JoshPurtell4mo ago

rlm-rs: https://crates.io/crates/rlm-rs src: https://github.com/synth-laboratories/Horizons

scolvin4mo ago

Please report any panics, we'll fix them!

1 more reply

geysersam4mo ago· 2 in thread

Is ai running regular python really a problem? I see that in principle there is an issue. But in practice I don't know anyone who's had security issues from this. Have you?

scolvin4mo ago

No one is going to let an LLM get prompted by end users to write python code I just run on my server, there's no real debate on that.

ushakov4mo ago

i think there’s a confusion around what use-case Monty is solving (i was confused as well). this seems to isolate in a scope of execution like function calls, not entire Python applications

theanonymousone4mo ago· 2 in thread

I wish someone commanded their agent to write a Python "compiler" targeting WASM. I'm quite surprised there is still no such thing at this day and age...

johndough4mo ago

Not sure if this is what you are looking for, but here is Python compiled to WASM: https://pyodide.org/en/stable/

Web demo: https://pyodide.org/en/stable/console.html

theanonymousone4mo ago

No it's not. It's an "interpreter": The whole interpreter binary (in wasm) as well as the Python source is transferred to the client to be executed.

1 more reply

andai4mo ago· 2 in thread

Doesn't the agent already have bash though?

My current security model is to give it a separate Linux user.

So it can blow itself up and... I think that's about it?

zahlman4mo ago

> Doesn't the agent already have bash though?

You don't have to give it bash, depending on your tools at least.

> So it can blow itself up and... I think that's about it?

And exfiltrate data via the Internet, fill up disk space...

andai4mo ago

It can already exfiltrate stuff in a VM though right? Like people will run this thing in a sandboxed environment in docker in a VM but then hook it up to GMail and also feed it random web content (web search tool, Twitter integration etc.).

I saw at least some interest in a better security model where for example instead of giving it the API keys, there's a broker that rewrites the curl requests and injects keys so the agent doesn't see them.

I'm not sure what that looks like for your emails or web content though, since using placeholders there would defeat the purpose.

1 more reply

rienbdj4mo ago· 2 in thread

If we’re going to have LLMs write the code, why not something more performant? Like pages and pages of Java maybe?

scolvin4mo ago

this is pretty performant for short scripts if you measure time "from code to rust" which can be as low as 1us.

Of course it's slow for complex numerical calculations, but that's the primary usecase.

I think the consensus is that LLMs are very good at writing python and ts/js, generally not quite as good at writing other languages, at least in one shot. So there's an advantage to using python/js/ts.

catlifeonmars4mo ago

Seems like we should fix the LLMs instead of bending over backwards no?

1 more reply

spacedatum4mo ago· 2 in thread

There is no reason to continue writing Python in 2026. Tell Claude to write Rust apriori. Your future self will thank you.

JoshPurtell4mo ago

I do both and compile times are very unfriendly to AI!

spacedatum4mo ago

Compile times, I can live with. You can run previous models on the gpu while your new model is compiling. Or switch from cargo to bazel if it is that bad.

1 more reply

throwa3562624mo ago· 1 in thread

I really like this!

Claude Code always resorts to running small python scripts to test ideas when it gets stuck.

Something like this would mean I dont need to approve every single experiment it performs.

stingraycharles4mo ago

Didn’t Anthropic recently acquire some JavaScript engine, though?

I figured that that was because they want tighter integration and a safer execution environment for code written by the LLM. And sandboxing is already very common for JavaScript in browsers.

ontouchstart4mo ago· 1 in thread

I wonder when the title will be upgraded to “A minimal, secure Rust interpreter written in Python for use by AI”.

Any human or AI want to take the challenge?

ontouchstart4mo ago

We already have a starting point:

https://play.rust-lang.org

https://github.com/rust-lang/rust-playground

Retr0id4mo ago· 1 in thread

I'm enjoying watching the battle for where to draw the sandbox boundaries (and I don't have any answers, either!)

ushakov4mo ago

best answer is probably to have a layered approach - use this to limit what the generated code can do, wrap it in a secure VM to prevent leaking out to other tenants.

wewewedxfgdf4mo ago· 1 in thread

If I say my code is secure does hat make it secure?

Or is all Rust code secure unquestionably?

maxbond4mo ago

Of course not, especially when the security model is about access to resources like file systems that are outside the scope of what the Rust compiler can verify. While you won't have a data race in safe Rust you absolutely can have data races accessing the file system in any language.

Their security model, as explained in the README, is in not including the standard library and limiting all access to the environment to functions you write & control. Does that make it secure? I'll leave it to you to evaluate that in the context of your use case/threat model.

It would appear to me that they used Rust primarily because a.) they want to deliver very fast startup times and b.) they want it to be accessible from a variety of host languages (like Python and JavaScript). Those are things Rust does well, though not to the exclusion of C or other GC-free compiled languages. They certainly do not claim that Rust is pixie dust you sprinkle on a project to make it secure. That would clearly be cargo culting.

I find this language war tiring. Don't you? Let's make 2026 the year we all agree to build cool stuff in whatever language we want without this pointless quarreling. (I've personally been saying this for three years at this point.)

dvershinin4mo ago

The no-stdlib limitation is the elephant in the room. Most useful Python isn't pure computation — it's reading files, making HTTP requests, parsing JSON. Without that, you've basically built a safe eval() for math and string manipulation.

The security argument makes sense in theory, but in practice the moment your agent needs to do anything interesting you're back to running real Python with real syscalls. seccomp + namespaces already solve this on Linux without rewriting the interpreter.

matheus-rr4mo ago

Interesting trade-off: build a minimal interpreter that's "good enough" for AI-generated code rather than trying to match CPython feature-for-feature.

The security angle is probably the most compelling part. Running arbitrary AI-generated Python in a full CPython runtime is asking for trouble — the attack surface is enormous. Stripping it down to a minimal subset at least constrains what the generated code can do.

The bet here seems to be that AI-generated code can be nudged to use a restricted subset through error feedback loops, which honestly seems reasonable for most tool-use scenarios. You don't need metaclasses and dynamic imports to parse JSON or make API calls.

nudpiedo4mo ago

Serious question: why won’t JUST use SELinux on generated scripts?

It will have access to the original runtimes and ecosystems and it can’t be tampered, it’s well tested, no amount of forks and tricky indirections to bypass syscalls.

Such runtimes come with a bill of technical debt, no support, specific documentation and lack of support for ecosystem and features. And let’s hope in two years isn’t abandoned.

Same could be applied for docker or nix Linux, or isolated containers, etc… the level of security should be good enough for LLMs, not even secure against human (specialist hackers) directed threads

_joel4mo ago

Well I love the name, so definitely trying this out later, but first...

And now for something, completely different.

SafeDusk4mo ago

Sandboxing is going to be of growing interests as more agents go “code mode”.

Will explore this for https://toolkami.com/, which allows plug and play advanced “code mode” for AI agents.

globular-toast4mo ago

I don't get what "the complexity of a sandbox" is. You don't have to use Docker. I've been running agents in bubblewrap sandboxes since they first came out.[0]

If the agent can only use the Python interpreter you choose then you could just sandbox regular Python, assuming you trust the agent. But I don't trust any of them because they've probably been vibe coded, so I'll continue to just sandbox the agent using bubblewrap.

[0] https://blog.gpkb.org/posts/ai-agent-sandbox/

hypertexthero4mo ago

Potentially unrelated tangent thought:

The Man Who Listens to Horses (1997) is an excellent book by Monty Roberts about learning the language of horses and observing and listening to animals: https://www.biblio.com/search.php?stage=1&title=The+Man+Who+...

Video demonstration of the above: https://www.youtube.com/watch?v=vYtTz9GtAT4

iandanforth4mo ago

Totally reasonable project for many reasons but fast tools for AI always makes me chuckle. Imagine your job is delivering packages and along the delivery route one of your coworkers is a literal glacier. It doesn't really matter how fast you walk, run, bike, or drive. If part of your delivery chain tops out at 30 meters per day you're going to have a slow delivery service. The ratio between the speed of code execution and AI "thinking" is worse than this analogy.

vghaisas4mo ago

This is very cool, but I'm having some trouble understanding the use cases.

Is this mostly just for codemode where the MCP calls instead go through a Monty function call? Is it to do some quick maths or pre/post-processing to answer queries? Or maybe to implement CaMeL?

It feels like the power of terminal agents is partly because they can access the network/filesystem, and so sandboxed containers are a natural extension?

tucnak4mo ago

I really like this for CodeAct, but like with other similar tools it's unclear how to implement data pipelining to leverage, like, lockstep batching to remote providers, or paged attention-like optimisations. Basically, let's say I want to run agent for every row in the table, I would probably want to batch most calls...

It's something, I think, missing from smolagents ecosystem anyway!

saberience4mo ago

I actually have no idea why this is needed. I want my models to have access to full libraries/sdks/apis and this is when they become actually useful.

I also want my models to be able to write typescript, python, c# etc, or any language and run it.

Having the model have access to a completely minimal version of python just seems like a waste of time.

bigcat123456784mo ago

It seems that AI finally give the space to true pure-blood system software systems to unleash their potential.

Pretty much all morn software tooling, removing the parts that aim at appeal to humans, becomes much more reliable tools. But it's not clear if the performance will be better or not.

wiradikusuma4mo ago

"To run code written by agents" vs "What Monty cannot do: Use the standard library, ..., Use third party libraries."

But most real world code needs to use (standard/3rd party) library, no? Or is this for AI's own feedback loop?

dmpetrovOP4mo ago

I like the idea a lot but it's still unclear from the docs what the hard security boundary is once you start calling LLMs - can it avoid "breaking out" into the host env in practice?

digdugdirk4mo ago

How does this compare to the sPy project [1]?

[1] - https://github.com/spylang/spy

falcor844mo ago

Wow, a start latency of 0.06ms

j / k navigate · click thread line to collapse

164 comments

103 comments · 34 top-level

avaer4mo ago· 21 in thread

This feels like the time I was a Mercurial user before I moved to Git.

Everyone was using git for reasons to me that seemed bandwagon-y, when Mercurial just had such a better UX and mental model to me.

But I think I'm gonna lose this one too.

miki1232114mo ago

3 reasons why Python is much better than JS for this IMO.

1. Large built-in standard library (CSV, sqlite3, xml/json, zipfile).

3. Much better ecosystem for data processing (particularly csv/pandas), partially resulting from operator overloading being a thing.

Tade04mo ago

> In JS, you have the Node / Deno split,

You do? Deno is maybe a single digit percentage of the market, just hyped tremendously.

> E.G. compare tsx versus Node's native ts execution

JSX/TSX, despite what React people might want you to believe, are not part of the language.

> which only work if you pray three times on the day of the full moon.

It only doesn't work in some contexts due to legacy reasons. Otherwise it's just elaborate syntax sugar for `Promise`.

2 more replies

pjmlp4mo ago

In Python you also have plenty of implementations to choose from, incidentally many of them have evem better performance than CPython.

63stack4mo ago

>In Python, whatever the LLM is likely to do will probably work.

Do you not realize how this sounds?

>many mutually-incompatible import syntaxes

Do you think there are 22 competing package managers in python because the package/import system "just works"?

1 more reply

giancarlostoro4mo ago

whilenot-dev4mo ago

1 more reply

nine_k4mo ago

I do like Typescript (not JS) better, because of its highly advanced type system, compared to Python's.

shoeb00m4mo ago

A big benefit of letting agents run code is they can process data without bloating their context.

LLMs are really good at writing python for data processing. I would suspect its due to Python having a really good ecosystem around this niche

And the type safety/security issues can hopefully be mitigated by ty and pyodide (already used by cf’s python workers)

https://pyodide.org/en/stable/

https://github.com/astral-sh/ty

DouweM4mo ago

With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.

These posts by Cloudflare: https://blog.cloudflare.com/code-mode/ and Anthropic: https://platform.claude.com/docs/en/agents-and-tools/tool-us... explain the concept and its advantages in more detail.

5 more replies

pjmlp4mo ago

Agreed, however AI adoption is finally putting pressure on CPython to have a JIT in the box, so there is that.

And on GPU side, the existing libraries provide DSL based JITs, thus for many scenarios the performance is not much different from C++.

Now NVidia is also on the game with the new tile based architecture, with first party support to write kernels in Python even.

verdverm4mo ago

Don't extrapolate

There is a ton of wheel reinvention going on right now cause everyone wants to be cool in the age of ai

Use boring tech, you'll thank me and yourself later

Which in this case means, just use regular python. Your devops team is unlikely to allow knock off python in production. TS is fine too, I mainly write Go

rzerowan4mo ago

Tangentially i wonder if the recent changes in the GIL will percolate to mercurial as any improvements.

Yep still using good old hg for personal repos - interop for outside project defaults to git since almost all the hg host withered.

trenchgun4mo ago

Python has uv, ruff, ty

odiroot4mo ago

For me it's the opposite. I'm actively looking for tools in Python because at least they're gonna be lightweight and easy for me to debug.

Really tired of every AI-related tool released as of late being a half-GB node behemoth with hundreds of library dependencies.

Or alternatively some cryptic academic Rust codebase.

woadwarrior014mo ago

piskov4mo ago

Can we please make as little js as possible?

Why would one drag this god forsaken abomination on server-side is beyond me.

Even effing C# nowdays can be run in script-like manner from a single file.

—

Even the latest Codex UI app is Electron. The one that is supposed to write itself with AI wonders but couldn’t manage native swiftui, winui, and qt or whatever is on linux this days.

aryonoco4mo ago

My favourite languages are F# and OCaml, and from my perspective, TypeScript is a far better language than C#.

TypeScript might not be perfect, but it is a surprisingly good language for many domains and is still undervalued IMO given what it provides.

1 more reply

mcintyre19944mo ago

> and qt or whatever is on linux this days.

When you put it like that I can see why people end up with electron!

IshKebab4mo ago

I would say the same about Python, a language that has clearly got far too big for its boots.

1 more reply

bee_rider4mo ago

wiseowise4mo ago

It might be slow, but it is definitely not bad. In the contrary, it is a great language. The closest to pseudocode you can get in a mainstream.

1 more reply

simonw4mo ago· 9 in thread

I got a WebAssembly build of this working and fired up a web playground for trying it out: https://simonw.github.io/research/monty-wasm-pyodide/demo.ht...

It doesn't have class support yet!

But it doesn't matter, because LLMs that try to use a class will get an error message and rewrite their code to not use classes instead.

Notes on how I got the WASM build working here: https://simonwillison.net/2026/Feb/6/pydantic-monty/

jstanley4mo ago

> But it doesn't matter, because LLMs that try to use a class will get an error message and rewrite their code to not use classes instead.

qwertox4mo ago

It is a workaround, so we can assume that this will be temporary and in the future the ai will then start using them once it can. Probably just like we would do.

1 more reply

saberience4mo ago

I really don't understand the use-case here.

My models are writing code all day in 3/4 different languages, why would I want to:

a) Restrict them to Python

b) Restrict them to a cutdown, less-useful version of Python?

My models write me Typescript and C# and Python all day with zero issues. Why do I need this?

falcor844mo ago

For extremely rapid iteration - they can run a quick script with this in under 1ms - it removes a significant bottleneck, especially for math-heavy reasoning

1 more reply

srcreigh4mo ago

It’s a sandbox. If your model generates and runs a script for each email in your inbox and has access to sensitive information, you want to make sure it can’t communicate externally.

zahlman4mo ago

For sandboxing, as described in the README.

vghaisas4mo ago

This is very cool, but I'm having some trouble understanding the use cases.

Is this mostly just for codemode where the MCP calls instead go through a Monty function call? Is it to do some quick maths or pre/post-processing to answer queries? Or maybe to implement CaMeL?

It feels like the power of terminal agents is partly because they can access the network/filesystem, and so sandboxed containers are a natural extension?

16bitvoid4mo ago

It's right there in the README.

> Monty avoids the cost, latency, complexity and general faff of using full container based sandbox for running LLM generated code.

> Instead, it let's you safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.

1 more reply

otabdeveloper44mo ago

> and rewrite their code to not use classes instead

Only if the training data has enough Python code that doesn't use classes.

(We're in luck that these things are trained on Stackoverflow code snippets.)

OutOfHere4mo ago· 7 in thread

It is absurd for any user to use a half baked Python interpreter, also one that will always majorly lag behind CPython in its support. I advise sandboxing CPython instead using OS features.

bityard4mo ago

Python already has a lot of half-baked (all the way up to nearly-fully-baked) interpreters, what's one more?

https://en.wikipedia.org/wiki/List_of_Python_software#Python...

simonw4mo ago

How do I sandbox CPython using OS features?

nickpsecurity4mo ago

It could be difficult. My first thought would be a SELinux policy like this article attempted:

https://danwalsh.livejournal.com/28545.html

One might have different profiles with different permissions. A network service usually wouldn't need your hone directory while a personal utility might not need networking.

OutOfHere4mo ago

Docker and other container runners allow it. https://containers.dev/ allows it too.

https://github.com/microsoft/litebox might somehow allow it too if a tool can be built on top of it, but there is no documentation.

1 more reply

avaer4mo ago

The repo does make a case for this, namely speed, which does make sense.

sd2k4mo ago

[1] https://github.com/eryx-org/eryx

OutOfHere4mo ago

Speed is not a feature if there isn't even syntax parity with CPython.

1 more reply

imfing4mo ago· 4 in thread

scolvin4mo ago

Can't be sure where this might end, but the primary goal is to enable codemode/programmatic tool calling, using the external function call mechanism for anything more complicated.

I think in the near term we'll add support for classes, dataclasses, datetime, json. I think that should be enough for many use cases.

digdugdirk4mo ago

I'd love to see this paired up with Pydantic for a lightweight pydantic based configuration "language". Similar to CUElang, but using pydantic to describe the configuration models themselves.

ushakov4mo ago

there’s no way around VMs for secure, untrusted workloads. everything else, like Monty has too many tradeoffs that makes it non-viable for any real workloads

disclaimer: i work at E2B, opinions my own

scolvin4mo ago

As discussed on twitter, v8 shows that's not true.

Chill, e2b has its use case, at least for now.

3 more replies

zahlman4mo ago· 3 in thread

> Instead, it let's you run safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.

Perhaps if the interpreter is in turn embedded in the executable and runs in-process, but even a do-nothing `uv` invocation takes ~10ms on my system.

Have to say I didn't expect it to come out of Pydantic.

preciousoo4mo ago

Pydantic + FastAPI are my two favorite python shops right now, they’re always dropping fun new projcts

Cyphase4mo ago

uv is written in Rust, not Python.

zahlman4mo ago

kodablah4mo ago· 3 in thread

I'm of the mind that it will be better to construct more strict/structured languages for AI use than to reuse existing ones.

bityard4mo ago

But I'd be interested to see what you come up with.

kodablah4mo ago

> that's the only way they "learn" anything

> But I'd be interested to see what you come up with.

Under active dev at https://github.com/cretz/duralade, super POC level atm (work continues in a branch)

Terretta4mo ago

> you first have to train the model on a BUTT TON of that new language

Tokenization joke?

c2xlZXB54mo ago· 3 in thread

oofbey4mo ago

ushakov4mo ago

agree. you still need a secure boundary like VM to isolate the tenants in case the model breaks out of the sandbox.

everything that you don’t want your agent to access should live outside of the sandbox.

thundergolfer4mo ago

https://github.com/butter-dot-dev/bvisor is pushing in that direction

krick4mo ago· 3 in thread

impulser_4mo ago

They plan to use to for "Code Mode" which mean the LLM will use this to run Python code that it writes to run tools instead of having to load the tools up front into the LLM context window.

DouweM4mo ago

(Pydantic AI lead here) We’re implementing Code Mode in https://github.com/pydantic/pydantic-ai/pull/4153 with support for Monty and abstractions to use other runtimes / sandboxes.

With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.

These posts by Cloudflare: https://blog.cloudflare.com/code-mode/ and Anthropic: https://platform.claude.com/docs/en/agents-and-tools/tool-us... explain the concept and its advantages in more detail.

1 more reply

notepad0x904mo ago

For example, incorrect levels of indentation. Let me use dots instead of space because of HN formatting:

for key,val in mydict.items():

..if key == "operation":

....logging.info("Executing operation %s",val)

..if val == "drop_table":

....self.drop_table()

In other words, if you don't want to do validate proper stdlib/module usage, but proper __Python__ usage, this makes sense. Although I'm speculating on exactly what they're trying to do.

JoshPurtell4mo ago· 2 in thread

Monty is the missing link that's made me ship my rust-based RLM implementation - and I'm certain it'll come in handy in plenty of other contexts.

Just beware of panics!

JoshPurtell4mo ago

rlm-rs: https://crates.io/crates/rlm-rs src: https://github.com/synth-laboratories/Horizons

scolvin4mo ago

Please report any panics, we'll fix them!

1 more reply

geysersam4mo ago· 2 in thread

Is ai running regular python really a problem? I see that in principle there is an issue. But in practice I don't know anyone who's had security issues from this. Have you?

scolvin4mo ago

No one is going to let an LLM get prompted by end users to write python code I just run on my server, there's no real debate on that.

ushakov4mo ago

i think there’s a confusion around what use-case Monty is solving (i was confused as well). this seems to isolate in a scope of execution like function calls, not entire Python applications

theanonymousone4mo ago· 2 in thread

I wish someone commanded their agent to write a Python "compiler" targeting WASM. I'm quite surprised there is still no such thing at this day and age...

johndough4mo ago

Not sure if this is what you are looking for, but here is Python compiled to WASM: https://pyodide.org/en/stable/

Web demo: https://pyodide.org/en/stable/console.html

theanonymousone4mo ago

No it's not. It's an "interpreter": The whole interpreter binary (in wasm) as well as the Python source is transferred to the client to be executed.

1 more reply

andai4mo ago· 2 in thread

Doesn't the agent already have bash though?

My current security model is to give it a separate Linux user.

So it can blow itself up and... I think that's about it?

zahlman4mo ago

> Doesn't the agent already have bash though?

You don't have to give it bash, depending on your tools at least.

> So it can blow itself up and... I think that's about it?

And exfiltrate data via the Internet, fill up disk space...

andai4mo ago

I'm not sure what that looks like for your emails or web content though, since using placeholders there would defeat the purpose.

1 more reply

rienbdj4mo ago· 2 in thread

If we’re going to have LLMs write the code, why not something more performant? Like pages and pages of Java maybe?

scolvin4mo ago

this is pretty performant for short scripts if you measure time "from code to rust" which can be as low as 1us.

Of course it's slow for complex numerical calculations, but that's the primary usecase.

catlifeonmars4mo ago

Seems like we should fix the LLMs instead of bending over backwards no?

1 more reply

spacedatum4mo ago· 2 in thread

There is no reason to continue writing Python in 2026. Tell Claude to write Rust apriori. Your future self will thank you.

JoshPurtell4mo ago

I do both and compile times are very unfriendly to AI!

spacedatum4mo ago

Compile times, I can live with. You can run previous models on the gpu while your new model is compiling. Or switch from cargo to bazel if it is that bad.

1 more reply

throwa3562624mo ago· 1 in thread

I really like this!

Claude Code always resorts to running small python scripts to test ideas when it gets stuck.

Something like this would mean I dont need to approve every single experiment it performs.

stingraycharles4mo ago

Didn’t Anthropic recently acquire some JavaScript engine, though?

I figured that that was because they want tighter integration and a safer execution environment for code written by the LLM. And sandboxing is already very common for JavaScript in browsers.

ontouchstart4mo ago· 1 in thread

I wonder when the title will be upgraded to “A minimal, secure Rust interpreter written in Python for use by AI”.

Any human or AI want to take the challenge?

ontouchstart4mo ago

We already have a starting point:

https://play.rust-lang.org

https://github.com/rust-lang/rust-playground

Retr0id4mo ago· 1 in thread

I'm enjoying watching the battle for where to draw the sandbox boundaries (and I don't have any answers, either!)

ushakov4mo ago

best answer is probably to have a layered approach - use this to limit what the generated code can do, wrap it in a secure VM to prevent leaking out to other tenants.

wewewedxfgdf4mo ago· 1 in thread

If I say my code is secure does hat make it secure?

Or is all Rust code secure unquestionably?

maxbond4mo ago

dvershinin4mo ago

matheus-rr4mo ago

Interesting trade-off: build a minimal interpreter that's "good enough" for AI-generated code rather than trying to match CPython feature-for-feature.

nudpiedo4mo ago

Serious question: why won’t JUST use SELinux on generated scripts?

It will have access to the original runtimes and ecosystems and it can’t be tampered, it’s well tested, no amount of forks and tricky indirections to bypass syscalls.

Such runtimes come with a bill of technical debt, no support, specific documentation and lack of support for ecosystem and features. And let’s hope in two years isn’t abandoned.

Same could be applied for docker or nix Linux, or isolated containers, etc… the level of security should be good enough for LLMs, not even secure against human (specialist hackers) directed threads

_joel4mo ago

Well I love the name, so definitely trying this out later, but first...

And now for something, completely different.

SafeDusk4mo ago

Sandboxing is going to be of growing interests as more agents go “code mode”.

Will explore this for https://toolkami.com/, which allows plug and play advanced “code mode” for AI agents.

globular-toast4mo ago

I don't get what "the complexity of a sandbox" is. You don't have to use Docker. I've been running agents in bubblewrap sandboxes since they first came out.[0]

[0] https://blog.gpkb.org/posts/ai-agent-sandbox/

hypertexthero4mo ago

Potentially unrelated tangent thought:

Video demonstration of the above: https://www.youtube.com/watch?v=vYtTz9GtAT4

iandanforth4mo ago

vghaisas4mo ago

This is very cool, but I'm having some trouble understanding the use cases.

Is this mostly just for codemode where the MCP calls instead go through a Monty function call? Is it to do some quick maths or pre/post-processing to answer queries? Or maybe to implement CaMeL?

It feels like the power of terminal agents is partly because they can access the network/filesystem, and so sandboxed containers are a natural extension?

tucnak4mo ago

It's something, I think, missing from smolagents ecosystem anyway!

saberience4mo ago

I actually have no idea why this is needed. I want my models to have access to full libraries/sdks/apis and this is when they become actually useful.

I also want my models to be able to write typescript, python, c# etc, or any language and run it.

Having the model have access to a completely minimal version of python just seems like a waste of time.

bigcat123456784mo ago

It seems that AI finally give the space to true pure-blood system software systems to unleash their potential.

Pretty much all morn software tooling, removing the parts that aim at appeal to humans, becomes much more reliable tools. But it's not clear if the performance will be better or not.

wiradikusuma4mo ago

"To run code written by agents" vs "What Monty cannot do: Use the standard library, ..., Use third party libraries."

But most real world code needs to use (standard/3rd party) library, no? Or is this for AI's own feedback loop?

dmpetrovOP4mo ago

I like the idea a lot but it's still unclear from the docs what the hard security boundary is once you start calling LLMs - can it avoid "breaking out" into the host env in practice?

digdugdirk4mo ago

How does this compare to the sPy project [1]?

[1] - https://github.com/spylang/spy

falcor844mo ago

Wow, a start latency of 0.06ms

j / k navigate · click thread line to collapse