It is pretty impressive at how good it's gotten at this, in a relatively short amount of time no less. I still usually write my specs by hand, but who knows how much longer I'll be doing that.
> Decline to buy: property stays with bank (auction abstracted out)
Ignoring an entire game mechanic is really stretching the definition of “abstracted out”…
Also, at the bottom it defines a “Liveness: someone eventually wins” property which I believe cannot be proven. Monopoly doesn’t have any rules forcing the game to end eventually. There is only a probabilistic guarantee, and even that only applies if the players are trying to win; if the players are conspiring to prevent the game from ending then they’re unlikely to fail.
PlusCal is recommended as the gentler on-ramp to TLA+ for first learning.
I think it's highly promising. I might even call it a success. For example, I was able to express a bug in the system and Claude was able to find me a property that would reveal that in the model; and then revise the model to eliminate that bug, and then revise the code accordinly.
Additionally (with OpenAI) I have a small guest management system I had the LLM generate a TLA+ model about the API which endpoints a client can request from it, and which tokens it learns from it (there are session ids, event ids, guest ids and invitation ids that interact in interesting ways, as guests are not required to have an invitation to the system, and they can have only magic-link protected account, resulting in different visibility on the contact details of others), and what data it can or cannot access with those tokens.
I still haven't manually verified, but it seems rather promising as well.
Isn't logical incorrectness less of a problem in software than failures of imagination or conscientiousness in modeling the domain?
isn't that exactly the kind of fails LLMs do the most? first-glance-passable nonsense?
I'm considering between SVA, TLA+ and Lean. With the former being more domain specific and the later more general.
Do you think we'll move towards "Lean for everything" or do domain specific formalisms still make sense?
See https://aws.amazon.com/builders-library/challenges-with-dist... for how async related to distributed systems.
I thought the whole point of trying to write out TLA+ is so that you get a better idea of what you want and put it into formal language?
I get that an LLM can assist/help with expressing what we want in formal language a bit, but if one automates all this there is no human intent/design anymore.
If the LLM generates both the design (TLA+) and writes an arbitrary program that satisfies said design -- what exactly have we proved?
What assurance do humans get since human doesn't know or cannot specify what they want.
Whether or not you're modeling the right things or verifying the right things, of course... that's always left as an exercise for the user. ;)
(How to prove the implementation code is guaranteed to match the spec is a trick I haven't seen generalized yet, either, too.)
a useful example from last week where TLA+ found a bug in pg_rewind:
Also for some problems I observe that models produced by LLMs often cause state space explosion. For simpler models they can fix this when you guide them though.
I’m sure LLMs will get even better.
That said, I take slightly different approach. Lamport said “If you're thinking without writing, you only think you're thinking.” So taking that advice I always try to write the first draft with hand and once I have the final shape in place I then turn to an LLM for further exploration and experimentation if I have to.
This is a mistaken use of TLA+!
Leslie Lamport insists that he invented it to be a way of creating "blueprints" for systems.
- You are supposed to go from TLA+ Spec to System (codebase or hardware).
- Not codebase to TLA+ like the author has done.
Otherwise, you may simply model an existing bug properly and the pass all the checks based of its implementation.
He (Leslie Lamport) insists that the value AI can provide is in compiling TLA+ specs to a code base.
TLAiBench[0]: A dataset and benchmark suite for evaluating Large Language Models (LLMs) on TLA+ formal specification tasks, featuring logic puzzles and real-world scenarios.