Can LLMs model real-world systems in TLA+? (opens in new tab)

(sigops.org)

123 pointsmad1mo ago33 comments

33 comments

29 comments · 12 top-level

tombert1mo ago· 7 in thread

Claude has certainly been getting better with TLA+. It's not perfect yet but for laughs I got it to model the rules of Monopoly last night [1]. I haven't done any exhaustive checking on it yet, but it certainly looks passable.

It is pretty impressive at how good it's gotten at this, in a relatively short amount of time no less. I still usually write my specs by hand, but who knows how much longer I'll be doing that.

[1] https://pdfhost.io/v/KU2j37YKrP_Monopoly

ofrzeta1mo ago

It looks quite complicated and I have no idea what it is doing. Obviously, since I don't know about TLA+. But what about someone who knows TLA+? It still seems hard to make sure it is valid. And it's just for a relatively simple game.

comex1mo ago

Well, for one thing:

> Decline to buy: property stays with bank (auction abstracted out)

Ignoring an entire game mechanic is really stretching the definition of “abstracted out”…

Also, at the bottom it defines a “Liveness: someone eventually wins” property which I believe cannot be proven. Monopoly doesn’t have any rules forcing the game to end eventually. There is only a probabilistic guarantee, and even that only applies if the players are trying to win; if the players are conspiring to prevent the game from ending then they’re unlikely to fail.

1 more reply

_doctor_love1mo ago

There is a nice guide to TLA+ from Hillel Wayne here: https://learntla.com/

PlusCal is recommended as the gentler on-ramp to TLA+ for first learning.

1 more reply

_flux1mo ago

I've played a bit with having a Claude generate a TLA+ model, I review it, Claude reviews it, and then Claude checks for alignment with the actual system. The code in the system can be annotated with comments to link back to the model, and vice versa, to make human and AI review easier.

I think it's highly promising. I might even call it a success. For example, I was able to express a bug in the system and Claude was able to find me a property that would reveal that in the model; and then revise the model to eliminate that bug, and then revise the code accordinly.

Additionally (with OpenAI) I have a small guest management system I had the LLM generate a TLA+ model about the API which endpoints a client can request from it, and which tokens it learns from it (there are session ids, event ids, guest ids and invitation ids that interact in interesting ways, as guests are not required to have an invitation to the system, and they can have only magic-link protected account, resulting in different visibility on the contact details of others), and what data it can or cannot access with those tokens.

I still haven't manually verified, but it seems rather promising as well.

randusername1mo ago

What's the advantage of provable correctness if it's apparently not easy to prove even for people who understand TLA+? I'm not trying to be a party pooper, just curious.

Isn't logical incorrectness less of a problem in software than failures of imagination or conscientiousness in modeling the domain?

tclancy1mo ago

My thought (as someone interested in formal verification but unable to grok the math) is it exists as a canary in any sufficiently-complex codebase I let AI create. Even if it's wrong, knowing that something "we" changed breaks an agreement things currently assume is valuable.

NooneAtAll31mo ago

> I haven't done any exhaustive checking on it yet, but it certainly looks passable.

isn't that exactly the kind of fails LLMs do the most? first-glance-passable nonsense?

iFire1mo ago· 6 in thread

I don't use tla+ to model real-world systems anymore, Claude is able to model systems in Lean 4 and the binary executable can handle real input or I can directly generate c / rust on proofs with numeric types that have ring structure (integers, rationals, bits).

https://github.com/lambdaclass/truth_research_zk

pron1mo ago

The point of TLA+ is easily readable specs with refinements. That's why it was designed in a novel way rather than in the older executable-with-more-complex-logic style that Lean (and other maths/spec languages) offer. I'm not saying you should prefer TLA+'s approach, but that's what it's meant to accomplish and LLMs don't change that.

thomasahle1mo ago

I'm currently choosing between the right formalization for a big hardware project.

I'm considering between SVA, TLA+ and Lean. With the former being more domain specific and the later more general.

Do you think we'll move towards "Lean for everything" or do domain specific formalisms still make sense?

kown71mo ago

Have you considered P? It feels like a good abstraction for engineers as it's "proper" code.

https://github.com/p-org/P

NooneAtAll31mo ago

what's SVA?

1 more reply

dmos621mo ago

Do you find Lean 4 sufficient for highly async systems?

iFire1mo ago

I haven't made money on yet, but I'm trying to model a webtransport (http/3, quic) system for massive multiplay vr games.

See https://aws.amazon.com/builders-library/challenges-with-dist... for how async related to distributed systems.

atomicnature1mo ago· 2 in thread

Just a question to people who may know better than me about this.

I thought the whole point of trying to write out TLA+ is so that you get a better idea of what you want and put it into formal language?

I get that an LLM can assist/help with expressing what we want in formal language a bit, but if one automates all this there is no human intent/design anymore.

If the LLM generates both the design (TLA+) and writes an arbitrary program that satisfies said design -- what exactly have we proved?

What assurance do humans get since human doesn't know or cannot specify what they want.

majormajor1mo ago

An LLM-generated TLA+ model can be verified for certain things in a way that LLM-generated code can't. It's infamously hard to exhaustively unit-test concurrency.

Whether or not you're modeling the right things or verifying the right things, of course... that's always left as an exercise for the user. ;)

(How to prove the implementation code is guaranteed to match the spec is a trick I haven't seen generalized yet, either, too.)

kiwicopple1mo ago

> It's infamously hard to exhaustively unit-test concurrency.

a useful example from last week where TLA+ found a bug in pg_rewind:

https://multigres.com/blog/2026/05/04/tla-pg-rewind

tmaly1mo ago· 1 in thread

I remember NVIDA sponsored a TLA+ challenge last year https://foundation.tlapl.us/challenge/index.html

uptodatenews1mo ago

Whoa didn't even know cool

pzoln1mo ago· 1 in thread

Sorry, must be a very naive question, but what if you give LLM just a source code (maybe even obfuscate the names like Raft and Etcd) and ask it to create a TLA+ spec of that?

_doctor_love1mo ago

This is already being done by some folks, reverse-engineering existing source into a TLA+ spec. Like other commenters have mentioned, the challenge is in ensuring that the spec and code match each other.

simplegeek1mo ago

I feel LLMs are indeed getting better at writing models. But, in my experience, they struggle to come up with correct safety and liveness properties unless you closely work with them. And of these two, they struggle the most with correct liveness properties.

Also for some problems I observe that models produced by LLMs often cause state space explosion. For simpler models they can fix this when you guide them though.

I’m sure LLMs will get even better.

That said, I take slightly different approach. Lamport said “If you're thinking without writing, you only think you're thinking.” So taking that advice I always try to write the first draft with hand and once I have the final shape in place I then turn to an LLM for further exploration and experimentation if I have to.

dgacmu1mo ago

This post reads like an accidental advertisement for approaches like Verus [1], which couple the implementation and verification so you can't end up with a model that diverges from the actual implementation. I'm personally much more optimistic about the verus approach, but I freely admit that's my builder bias speaking.

[1] https://github.com/verus-lang/verus

1 more reply

Kab1r1mo ago

I've been building a TLA style Temporal Logic library for Verus (using LLMs). My experience so far is that LLMs are surprisingly useful at generating the mechanical proof scaffolding (when they're not occasionally trying to cheat with `assume(false)` statements), but they are not a substitute for knowing what property you actually want.

asxndu1mo ago

>... we asked Claude to write a TLA+ specification (spec) for Etcd’s Raft implementation. It passed syntax checks, ran through the TLC model checker, and at first glance looked like a polished formal model.

This is a mistaken use of TLA+!

Leslie Lamport insists that he invented it to be a way of creating "blueprints" for systems.

- You are supposed to go from TLA+ Spec to System (codebase or hardware).

- Not codebase to TLA+ like the author has done.

Otherwise, you may simply model an existing bug properly and the pass all the checks based of its implementation.

He (Leslie Lamport) insists that the value AI can provide is in compiling TLA+ specs to a code base.

alhazrod1mo ago

Hopefully some people find this interesting too:

TLAiBench[0]: A dataset and benchmark suite for evaluating Large Language Models (LLMs) on TLA+ formal specification tasks, featuring logic puzzles and real-world scenarios.

[0]: https://github.com/tlaplus/TLAiBench

gr711mo ago

is the training data for these testcases in benchmark not already there ? how do llms perform in novel complex systems spec design ?

radarkilat1mo ago

hmm

j / k navigate · click thread line to collapse

33 comments

29 comments · 12 top-level

tombert1mo ago· 7 in thread

It is pretty impressive at how good it's gotten at this, in a relatively short amount of time no less. I still usually write my specs by hand, but who knows how much longer I'll be doing that.

[1] https://pdfhost.io/v/KU2j37YKrP_Monopoly

ofrzeta1mo ago

comex1mo ago

Well, for one thing:

> Decline to buy: property stays with bank (auction abstracted out)

Ignoring an entire game mechanic is really stretching the definition of “abstracted out”…

1 more reply

_doctor_love1mo ago

There is a nice guide to TLA+ from Hillel Wayne here: https://learntla.com/

PlusCal is recommended as the gentler on-ramp to TLA+ for first learning.

1 more reply

_flux1mo ago

I still haven't manually verified, but it seems rather promising as well.

randusername1mo ago

What's the advantage of provable correctness if it's apparently not easy to prove even for people who understand TLA+? I'm not trying to be a party pooper, just curious.

Isn't logical incorrectness less of a problem in software than failures of imagination or conscientiousness in modeling the domain?

tclancy1mo ago

NooneAtAll31mo ago

> I haven't done any exhaustive checking on it yet, but it certainly looks passable.

isn't that exactly the kind of fails LLMs do the most? first-glance-passable nonsense?

iFire1mo ago· 6 in thread

https://github.com/lambdaclass/truth_research_zk

pron1mo ago

thomasahle1mo ago

I'm currently choosing between the right formalization for a big hardware project.

I'm considering between SVA, TLA+ and Lean. With the former being more domain specific and the later more general.

Do you think we'll move towards "Lean for everything" or do domain specific formalisms still make sense?

kown71mo ago

Have you considered P? It feels like a good abstraction for engineers as it's "proper" code.

https://github.com/p-org/P

NooneAtAll31mo ago

what's SVA?

1 more reply

dmos621mo ago

Do you find Lean 4 sufficient for highly async systems?

iFire1mo ago

I haven't made money on yet, but I'm trying to model a webtransport (http/3, quic) system for massive multiplay vr games.

See https://aws.amazon.com/builders-library/challenges-with-dist... for how async related to distributed systems.

atomicnature1mo ago· 2 in thread

Just a question to people who may know better than me about this.

I thought the whole point of trying to write out TLA+ is so that you get a better idea of what you want and put it into formal language?

I get that an LLM can assist/help with expressing what we want in formal language a bit, but if one automates all this there is no human intent/design anymore.

If the LLM generates both the design (TLA+) and writes an arbitrary program that satisfies said design -- what exactly have we proved?

What assurance do humans get since human doesn't know or cannot specify what they want.

majormajor1mo ago

An LLM-generated TLA+ model can be verified for certain things in a way that LLM-generated code can't. It's infamously hard to exhaustively unit-test concurrency.

Whether or not you're modeling the right things or verifying the right things, of course... that's always left as an exercise for the user. ;)

(How to prove the implementation code is guaranteed to match the spec is a trick I haven't seen generalized yet, either, too.)

kiwicopple1mo ago

> It's infamously hard to exhaustively unit-test concurrency.

a useful example from last week where TLA+ found a bug in pg_rewind:

https://multigres.com/blog/2026/05/04/tla-pg-rewind

tmaly1mo ago· 1 in thread

I remember NVIDA sponsored a TLA+ challenge last year https://foundation.tlapl.us/challenge/index.html

uptodatenews1mo ago

Whoa didn't even know cool

pzoln1mo ago· 1 in thread

Sorry, must be a very naive question, but what if you give LLM just a source code (maybe even obfuscate the names like Raft and Etcd) and ask it to create a TLA+ spec of that?

_doctor_love1mo ago

simplegeek1mo ago

Also for some problems I observe that models produced by LLMs often cause state space explosion. For simpler models they can fix this when you guide them though.

I’m sure LLMs will get even better.

dgacmu1mo ago

[1] https://github.com/verus-lang/verus

1 more reply

Kab1r1mo ago

asxndu1mo ago

This is a mistaken use of TLA+!

Leslie Lamport insists that he invented it to be a way of creating "blueprints" for systems.

- You are supposed to go from TLA+ Spec to System (codebase or hardware).

- Not codebase to TLA+ like the author has done.

Otherwise, you may simply model an existing bug properly and the pass all the checks based of its implementation.

He (Leslie Lamport) insists that the value AI can provide is in compiling TLA+ specs to a code base.

alhazrod1mo ago

Hopefully some people find this interesting too:

TLAiBench[0]: A dataset and benchmark suite for evaluating Large Language Models (LLMs) on TLA+ formal specification tasks, featuring logic puzzles and real-world scenarios.

[0]: https://github.com/tlaplus/TLAiBench

gr711mo ago

is the training data for these testcases in benchmark not already there ? how do llms perform in novel complex systems spec design ?

radarkilat1mo ago

hmm

j / k navigate · click thread line to collapse