ProofOfThought: LLM-based reasoning using Z3 theorem proving (opens in new tab)

(github.com)

326 pointsbarthelomew8mo ago175 comments

175 comments

94 comments · 23 top-level

chrchr8mo ago· 13 in thread

I had a surprising interaction with Gemini 2.5 Pro that this project reminds me of. I was asking the LLM for help using an online CAS system to solve a system of equations, and the CAS system wasn't working as I expected. After a couple back and forths with Gemini about the CAS system, Gemini just gave me the solution. I was surprised because it's the kind of thing I don't expect LLMs to be good at. It said it used Python's sympy symbolic computation package to arrive at the solution. So, yes, the marriage of fuzzy LLMs with more rigorous tools can have powerful effects.

TrainedMonkey8mo ago

Just like humans... we are not so good at hard number crunching, but we can invent computers that are amazing at it. And with a lot of effort we can make a program that uses a whole lot of number crunching to be ok at predicting text but kind of bad at crunching hard numbers. And then that program can predict how to create and use programs which are good at number crunching.

emporas8mo ago

Small steps of nondeterministic computation, checked thoroughly with deterministic computation every so often, and the sky is the limit.

That's when A.I. starts advancing itself and needs humans in the loop no more.

3 more replies

jonplackett8mo ago

Maybe the number crunching program the text generation program creates will, with enough effort become good at generating text, an will in turn make another number crunching computer and then…

1 more reply

idiotsecant8mo ago

Parent post is talking about symbolic manipulation, not rote number crunching, which is exactly what we're supposed to be good at and machines are supposed to be bad at.

patcon8mo ago

I love this kind of thought. Thanks.

29athrowaway8mo ago

We do plenty of number crunching all the time, just not consciously.

Like the inverse kinematics required for your arm and fingers to move.

2 more replies

anotherpaulg8mo ago

I really like LLM+sympy for math. I have the LLM write me a sympy program, so I can trust that the symbolic manipulation is done correctly.

The code is also a useful artifact that can be iteratively edited and improved by both the human and LLM, with git history, etc. Running and passing tests/assertions helps to build and maintain confidence that the math remains correct.

I use helper functions to easily render from the sympy code to latex, etc.

A lot of the math behind this quantum eraser experiment was done this way.

https://github.com/paul-gauthier/entangled-pair-quantum-eras...

selinkocalar8mo ago

The combination of LLMs and formal verification tools is pretty interesting. We've been thinking about this for compliance automation - there are a lot of regulatory requirements that could theoretically be expressed as formal constraints. Curious about the performance though. Z3 can be really slow on complex problems, and if you're chaining that with LLM calls, the latency could get rough for interactive use cases.

fennecfoxy8mo ago

Yeah it feels like these early LLMs are pretty decent at the coming up with a plan and executing a plan part.

Probably the main deficiencies are confusion as the context grows (therefore confusion as task complexity grows).

jansan8mo ago

How die that work? Did Gemini call sympy on your maschine, or is access to sympy built-in and available through normal chat?

77341288mo ago

https://cloud.google.com/vertex-ai/generative-ai/docs/multim...

DrewADesign8mo ago

I get having it walk you through figuring out a problem with a tool: seems like a good idea and it clearly worked even better than expected. But deliberately coaxing an LLM into doing math correctly instead of a CAS because you’ve got one handy seems like moving apartments with dozens of bus trips rather than taking the bus to a truck rental place, just because you’ve already got a bus pass.

afiori8mo ago

I feel like a better analogy is trying to rent a truck to move to a new apartment and after repeated failures of trucks not working they just hire a moving company for you to get you to leave

1 more reply

sigmoid108mo ago· 12 in thread

I always find it amazing how many people seem to fail to use current LLMs to the fullest, even though they apparently work with them in research settings. This benchmark pipeline simply calls the OpenAI API and then painstakingly tries to parse the raw text output into a structured json format, when in reality the OpenAI API has supported structured outputs for ages now. That already ensures your model generates schema compliant output without hallucinating keys at the inference level. Today all the major providers support this feature either directly or at least indirectly via function calling. And if you run open models, you can literally write arbitrary schema (i.e. not limited to json behind the scenes) adhering inference engines yourself with rather manageable effort. I'm constantly using this in my daily work and I'm always baffled when people tell me about their hallucination problems, because so many of them can be fixed trivially these days.

barthelomewOP8mo ago

Hey there! I mostly designed and wrote most of the actual interpreter during my internship at Microsoft Research last summer. Constrained decoding for GPT-4 wasn’t available when we started designing the DSL, and besides, creating a regex to constrain this specific DSL is quite challenging.

When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.

sigmoid108mo ago

What are you talking about? OpenAI has supported structured json output in the API since 2023. Only the current structured output API was introduced by OpenAI in summer 2024, but it was primarily a usability improvement that still runs json behind the scenes.

2 more replies

atrus8mo ago

I wouldn't find it amazing, there are so many new models, features, ways to use models that the minute you pause to take a deep dive into something specific, 43 other things have already passed by you.

sigmoid108mo ago

I would agree if you are a normal dev who doesn't work in the field. But even then reading the documentation once a year would have brought you insane benefits regarding this particular issue. And for ML researchers there is no excuse for stuff like that at this point.

jssmith8mo ago

I see JSON parse errors on occasion when using OpeanAI structured outputs that resolve upon retry. It seems it’s giving instructions to the LLM but validation is still up to the caller. Wondering if others see this too.

barthelomewOP8mo ago

Hey, yes! This is because the DSL (Domain Specific Language) is pretty complex, and the LLM finds it hard. We prototype a much more effective version using SMT in our NeurIPS 2025 paper (https://arxiv.org/abs/2505.20047). We shall soon open source that code!

sigmoid108mo ago

Depends on how strictly you define your types. Are you using pydantic to pass the information to the API? There are a few pitfalls with this, because not everything is fully supported and it gets turned into json behind the scenes. But in principle, the autoregressive engine will simply not allow tokens that break the supplied schema.

1 more reply

eric-burel8mo ago

Yep from time to time.

IanCal8mo ago

I’d also be surprised if the models are better at writing code in some custom schema (assuming that’s not z3s native structure) than writing code in something else. Decent models can write pretty good code and for a lot of mistakes can fix them, plus you get testing/etc setups for free.

eric-burel8mo ago

It's a relatively new feature, also people need actual professional training to become true LLM developers using them to their fullest and not just developers that happen to call an LLM API here and there. Takes a lot of time and effort.

retinaros8mo ago

yes this can also improve the said reasoning.

sigmoid108mo ago

The secret the big companies don't want to tell you is that you can turn all their models into reasoning models that way. You even have full control over the reasoning process and can make it adhere to a specific format, e.g. the ones used in legal settings. I've built stuff like that using plain old gpt-4o and it was even better than the o series.

everdrive8mo ago· 8 in thread

I'm honestly confused why we can't determine how LLMs come to their decisions in the general sense. Is it not possible to log every step as the neural network / vector db / magic happens? Is it merely impractical, or is it actually something that's genuinely difficult to do?

konmok8mo ago

My understanding is that it's neither impractical nor genuinely difficult, it's just that the "logging every step" approach provides explanations of their "reasoning" that are completely meaningless to us, as humans. It's like trying to understand why a person likes the color red, but not the color blue, using a database recording the position, makeup, and velocity of every atom in their brain. Theoretically, yes, that should be sufficient to explain their color preferences, in that it fully models their brain. But practically, the explanation would be phrased in terms of atomic configurations in a way that makes much less sense to us than "oh, this person likes red because they like roses".

everdrive8mo ago

>It's like trying to understand why a person likes the color red, but not the color blue, using a database recording the position, makeup, and velocity of every atom in their brain.

But this is an incredibly interesting problem!

1 more reply

chpatrick8mo ago

Everything happens in an opaque super-high-dimensional numerical space that was "organically grown" not engineered, so we don't really understand what's going on.

moffkalast8mo ago

It would be like logging a bunch of random noise from anyone's perspective except the LLM's.

everdrive8mo ago

I guess I'm also just confused. I get that this is _difficult_ to do, but I would think that computer scientists would be utterly dissatisfied that AI was "non-deterministic" and would poke at the problem until it could be understood.

nickpsecurity8mo ago

There's people doing both types. Look up survey of mechanistic interpretebility of language models and survey of explainable AI for neural networks. Those will give you many techniques for illustrating what's happening.

You'll also see why their applications are limited compared to what you probably hoped for.

NotGMan8mo ago

Chat GPT-4 has alegedly 1.8 trillion parameters.

Imagine having a bunch of 2D matrices with a combined 1.8 trillion total numbers, from which you pick out a blocks of numbers in a loop and finally merge them and combine them to form a token.

Good luck figuring out what number represents what.

everdrive8mo ago

Wouldn't that mean it's totally impractical for day-to-day usage, but a researcher or team of researchers could solve this?

2 more replies

zwnow8mo ago· 6 in thread

Reasoning? LLMs can not reason, why is it always assumed they reason? They mimic reasoning.

elcomet8mo ago

How can you know?

measurablefunc8mo ago

By thinking about what a computer is actually doing & realizing that attributing thought to an arthmetic gadget leads to all sorts of nonsensical consequences like an arrangement of dominoes & their cascade being a thought. The metaphysics of thinking computers is incoherent & if you study computability theory you'll reach the same conclusion.

1 more reply

moffkalast8mo ago

It's so funny to me that people are still adamant about this like two years after it's become a completely moot point.

emp173448mo ago

Moot point? As far as I know, it’s still intensely debated, and there are some excellent papers out there providing evidence that LLMs truly are just statistical prediction machines. It’s far from an unreasonable position.

zwnow8mo ago

Experts are adamant about this. Just take a look at https://youtu.be/iRqpsCHqLUI

1 more reply

Terr_8mo ago

The normative importance of a fact may increase when more number of people start willfully ignoring it for shorter-term profit.

Imagine somebody in 2007: "It's so funny to me that people are still adamant about mortgage default risk after it's become a completely moot point because nobody cares in this housing market."

2 more replies

measurablefunc8mo ago· 6 in thread

This is proof of verifiable logic. Computers can not think so calling it proof of thought misrepresents what's actually happening.

aSanchezStern8mo ago

I agree that "proof of thought" is a misleading name, but this whole "computers can't think" thing is making LLM skepticism seem very unscientific. There is no universally agreed upon objective definition of what it means to be able to "think" or how you would measure such a thing. The definition that these types of positions seem to rely upon is "a thing that only humans can do", which is obviously a circular one that isn't useful.

measurablefunc8mo ago

If you believe computers can think then you must be able to explain why a chain of dominoes is also thinking when I convert an LLM from transistor relay switches into the domino equivalent. If you don't fall for the marketing hype & study both the philosophical & mathematical literature on computation then it is obvious that computers (or any mechanical gadget for that matter) can not qualify for any reasonable definition of "thinking" unless you agree that all functionally equivalent manifestations of arithmetic must be considered "thinking", including cascading dominoes that correspond to the arithmetic operations in an LLM.

2 more replies

encyclopedism8mo ago

The jury maybe out on how to judge what 'thought' actually is. However what it is not is perhaps easier to perceive. My digital thermometer does not think when it tells me the temperature.

My paper and pen version of the latest LLM (quite a large bit of paper and certainly a lot of ink I might add) also does not think.

I am surprised so many in the HN community have so quickly taken to assuming as fact that LLM's think or reason. Even anthropomorphising LLM's to this end.

For a group inclined to quickly calling out 'God of the gaps' they have quite quickly invented their very own 'emergence'.

Terr_8mo ago

> this whole "computers can't think" thing is making LLM skepticism seem very unscientific.

It's just shorthand for "that's an extraordinary claim and nobody has provided any remotely extraordinary evidence to support it."

1 more reply

chpatrick8mo ago

Do you understand human thinking well enough to determine what can think and what can't? We have next to no idea how an organic brain works.

measurablefunc8mo ago

I understand computers, software, & the theory of computation well enough to know that there is no algorithm or even a theoretical algorithmic construction that can be considered thought. Unless you are willing to concede that thinking is nothing more than any number of models equivalent to a Turing machine, e.g. lambda calculus, Post systems, context aware grammars, carefully laid out dominoes, permutations of bit strings, etc. then you must admit that computers are not thinking. If you believe computers are thinking then you must also admit dominoes are thinking when falling in a cascading chain.

2 more replies

LASR8mo ago· 5 in thread

This is an interesting approach.

My team has been prototyping something very similar with encoding business operations policies with LEAN. We have some internal knowledge bases (google docs / wiki pages) that we first convert to LEAN using LLMs.

Then we run the solver to verify consistency.

When a wiki page is changed, the process is run again and it's essentially a linter for process.

Can't say it moved beyond the prototyping stage though, since the LEAN conversion does require some engineers to look through it at least.

But a promising approach indeed, especially when you have a domain that requires tight legal / financial compliance.

barthelomewOP8mo ago

The autoformalization gap is pretty difficult to bridge indeed. We explored uncertainty quantification of autoformalization on well-defined grammars in our NeurIPS 2025 paper : https://arxiv.org/abs/2505.20047 .

If you ever feel like chatting and discussing more details, happy to chat!

viraptor8mo ago

Could you share an example of such policy? I'm struggling to think of something defined well enough in the real world to apply in Lean.

chandureddyvari8mo ago

For anyone curious about what LEAN is, like me, here’s the explanation: Lean Theorem Prover is a Microsoft project. You can find it here: https://www.microsoft.com/en-us/research/project/lean/

ashandoak8mo ago

Lean has been under development over the last 13 years, part of that while chief architect Leo de Moura was employed by Microsoft Research (he's now at AWS). However, Lean is an open source project, not exclusively a Microsoft project. More accurately, see here: https://lean-lang.org/

pbronez8mo ago

That’s pretty cool. It would be super useful to identify contradictory guidance systematically.

nakamoto_damacy8mo ago· 5 in thread

LLMs lack logical constraints in the generative process; they only learn probabilistic constraints. If you apply logic verification post-hoc, you're not "ensuring the correctness of your LLMs reasoning" (I went down this path a year ago); you're classifying whether the LLM's statistically driven pattern generation happens to correspond to correct logic or not, where the LLMs output may be wrong 100% of the time, and your theorem prover simply acts as a classifier, ensuring nothing at all.

barthelomewOP8mo ago

Yep, this is a genuine problem, and this is what we term as the autoformalization gap in our follow up paper. (https://arxiv.org/abs/2505.20047)

Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)

You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.

YeGoblynQueenne8mo ago

Well, if you understand that this is a "genuine problem" then what have you done to solve it? A quick look at the abstract of your follow up paper does not reveal an answer.

And let me be clear that this is a major limitation that fundamentally breaks whatever you are trying to achieve. You start with some LLM-generated text that is, by construction, unrelated to any notion of truth or factuality, and you push it through a verifier. Now you are verifying hot air.

It's like research into the efficacy of homeopathic medicine and there's a lot of that indeed, very carefully performed and with great attention to detail. Except all of that research is trying to prove whether doing nothing at all (i.e. homeopathy) has some kind of measurable effect or not. Obviously the answer is not. So what can change that? Only making homeopathy do something instead of nothing. But that's impossible, because homeopathy is, by construction, doing nothing.

It's the same thing with LLMs. Unless you find a way to make an LLM that can generate text that is conditioned on some measure of factuality, then you can verify the output all you like, the whole thing will remain meaningless.

avmich8mo ago

Probabilistic constraints are all around us. You learn that the sine function is the ratio of the length of the side of the right triangle opposite to the angle to the length of the side opposite to the right angle, so obviously the sine is always positive. Yet your thinking should be flexible enough to allow changing the definition to the ordinate of the point on the unit circle where the line corresponding to the given angle and drawn from zero intersects that circle. So your knowledge - the symbolic one - can also be probabilistic.

nakamoto_damacy8mo ago

You're thinking along the right track but without formalization it goes nowhere fast. By layering of differential geometry on top of probability and then maybe category theoretic logic on top of that, each layer constraining the one below it, and all layers cohering, you get somewhere... There is work that's been done in this area, and I was recently interviewed by a journalist who published a high level article on it on Forbes (Why LLMs are failing) and it links to the actual technical work (at first to my high level presentation then Prof. L. Thorne McCarty's work): https://www.forbes.com/sites/hessiejones/2025/09/30/llms-are...

nakamoto_damacy8mo ago

Why is this being down voted? I believe the author acknowledged and responded. Anything wrong?

nextos8mo ago· 4 in thread

This is a very interesting area of research. I did something similar a couple of years ago using logic and probabilistic logic inference engines to make sure conclusions followed from premises.

I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.

I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934

barthelomewOP8mo ago

Yep! I have read your work! Pretty cool! I also worked on a similar deep research agent for autoformalization this summer at AWS ARChecks, building on similar patterns.

Although that work is not public, you can play with the generally available product here!

[1] https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...

CuriouslyC8mo ago

Agent/LLM as a judge is biased and only good for bootstrapping. As capabilities get better LLM as a judge will artificially cap your performance, you need to graduate to either expert human judges or deterministic oracles.

fnordpiglet8mo ago

LLMs display a form of abductive reasoning which is not the same as judgement. The only thing in the universe we know that can display judgement is a human. However many tasks we presume to require human judgement do not and abductive reasoning will perform as well as a human. This in theory acts as a filter if used right reducing the tasks of human judgement to those that can’t be automated with similar or better precision and recall. The trick then is using LLMs and other techniques to reduce the problem space for the human to the kernel of quandary that requires human judgement and to isolate the salient information to reduce the cognitive load as much as possible. Many many mundane tasks can be automated in this way, and many complex tasks can be facilitated to greatly magnify the effectiveness of the human in the middle’s time.

jebarker8mo ago

Why does this have to be true? For example, if you have a different LLM that is judging than the one being judged then their biases could at least be different. Also, as their reasoning abilities improve wouldn't LLM judges approach the abilities of human judges?

2 more replies

tannhaeuser8mo ago· 3 in thread

LLMs are statistical language models (d'uh) not reasoners after all. I found generating logic programs, and Prolog source specifically, to work unreasonably well, though [1], maybe because Prolog was introduced for symbolic natural language processing and there's a wealth of translation examples in the training set. Might be worth checking out Z3's alternative Datalog syntax [2] instead of its Lisp-ish SMTLib syntax.

[1]: https://quantumprolog.sgml.net/llm-demo/part1.html

[2]: https://microsoft.github.io/z3guide/docs/fixedpoints/syntax

barthelomewOP8mo ago

Yep! Datalog syntax for Z3 is pretty neat! We used SMT [1] in our grammars paper because it allowed the most interoperability with solvers, but our technique also works with PROLOG; as tested our at the behest of reviewers at NeurIPS. I would assume that this should also work with datalog [2].

[1] https://arxiv.org/abs/2505.20047 [2] https://github.com/antlr/grammars-v4/blob/master/datalog/dat...

larodi8mo ago

Neuralsymbolic systems are very likely the future as so many times mentioned here already.

a3w8mo ago

I cannot use wolframalpha most of the time since the syntax is not that natural. WolframAlpha is good AI, it never lies.

Calculators are good AI, they rarely lie (due to floating arithmetics rounding). And yes, Wikipedia says calculators are AI tech, since a Computer was once a person, and not it is a tool that shows the intelligent trait of doing math with numbers or even functions/variables/equations.

Querying a calculator or wolfram alpha like symbolic AI system with LLMs seems like the only use for LLMs except for text refactoring that should be feasible.

Thinking LLMs know anything on their own is a huge fallacy.

0xWTF8mo ago· 3 in thread

Am I reading this right? Statistical LLM outputs pushed through a formal logic model? Wouldn't that be a case of "crap in, crap out"?

avmich8mo ago

Formal logic serves as a useful filter. In other words, "crap in, filtered crap out" - remember, evolution works with absolutely random, "crap" mutations, which then are "filtered" by the environment.

varispeed8mo ago

That's subjective. One could argue all the things we invented in the past few thousands years were crap. Life would have been much easier in the caves, albeit shorter.

baq8mo ago

You assume it’s all crap when it clearly isn’t often enough to be useful.

ivanbakel8mo ago· 3 in thread

The repo is sparse on the details unless you go digging, which perhaps makes sense if this is just meant as the artifact for the mentioned paper.

Unless I’m wrong, this is mainly an API for trying to get an LLM to generate a Z3 program which “logically” represents a real query, including known facts, inference rules, and goals. The “oversight” this introduces is in the ability to literally read the logical statement being evaluated to an answer, and running the solver to see if it holds or not.

The natural source of doubt is: who’s going to read a bunch of SMT rules manually and be able to accurately double-check them against real-world understanding? Who double checks the constants? What stops the LLM from accidentally (or deliberately, for achieving the goal) adding facts or rules that are unsound (both logically and from a real-world perspective)?

The paper reports a *51%* false positive rate on a logic benchmark! That’s shockingly high, and suggests the LLM is either bad at logical models or keeps creating unsoundnesses. Sadly, the evaluation is a bit thin on the ground about how this stacks up, and what causes it to fall short.

barthelomewOP8mo ago

Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.

E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.

Illustration of this can be seen in Fig 14, 15 on Page 29.

In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.

This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.

https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...

bhk8mo ago

Re: "99% of the time" ... this is an ambiguous sample space. Soundness of results clearly depends on the questions being asked. For what set of questions does the 99% guarantee hold?

cerved8mo ago

Who makes the rules?

tonerow8mo ago· 2 in thread

Cool research! I went to the repo to see what the DSL looked like but it was hard to find a clear example. It would be cool if you added a snippet to the README.

barthelomewOP8mo ago

Hey! Thank you for the interest! I shall do that. Meanwhile, check out Page 11 onwards. We describe a lot of situations! (https://arxiv.org/pdf/2409.17270)

pstoll8mo ago

Upvoting the comment that the gitrepo would be way more self stand-alone if it had an intro of the DSL.

dehsge8mo ago· 1 in thread

LLMs and its output are bounded by Rices theorem. This is not going to ensure correctness it’s just going to validate that the model can produce an undecidable result.

ogogmad8mo ago

Errr, checking correctness of proofs is decidable.

nakamoto_damacy8mo ago

I posted about my year long development effort of this very method on reddit 25 days ago. My comment elsewhere in this thread provides a cautionary tale, and the authors response to the basic issue I raised is incomplete in that it leaves out that certain problems simply cannot be solved with LLMs (requires logical constraints in the generative process but LLMs lack that layer) So I've pivoted to something else since (also mentioned in my comment elsewhere in this thread)

https://www.reddit.com/r/healthIT/comments/1n81e8g/comment/n...

sytse8mo ago

So the core idea is to use an LLM to draft reasoning as a structured, JSON domain-specific language (DSL), then deterministically translate that into first-order logic and verify it with a theorem prover (Z3).

Interesting that the final answer is provably entailed (or you get a counterexample), instead of being merely persuasive chain-of-thought.

derekcheng088mo ago

Interesting. I wonder if you could implement tool calling with this approach so the LLM calls the tool with the formal specification and gets back the result. Just like a coding agent can run a compiler, get back errors and then self-correct.

westurner8mo ago

ScholarlyArticle: "Proof of thought: Neurosymbolic program synthesis allows robust and interpretable reasoning" (2024) https://arxiv.org/abs/2409.17270 .. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C43&q=%22...

jadelcastillo8mo ago

Interesting approach, but I guess still lot of work to be done. I tried with this question:

"Alice has 60 brothers and she also has 212 sisters. How many sisters does Alice's brother have?"

But the generated program is not very useful:

{ "sorts": [], "functions": [], "constants": {}, "variables": [ {"name": "num_brothers_of_alice", "sort": "IntSort"}, {"name": "num_sisters_of_alice", "sort": "IntSort"}, {"name": "sisters_of_alice_brother", "sort": "IntSort"} ], "knowledge_base": [ "num_brothers_of_alice == 60", "num_sisters_of_alice == 212", "sisters_of_alice_brother == num_sisters_of_alice + 1" ], "rules": [], "verifications": [ { "name": "Alice\'s brother has 213 sisters", "constraint": "sisters_of_alice_brother == 213" } ], "actions": ["verify_conditions"] }

renshijian8mo ago

This is fascinating! An AI that doesn't just think out loud, but keeps a verifiable diary. It's like a philosopher with a cryptographic notary public living in its brain. Amazing work!

Yoric8mo ago

That is exactly the kind of things that I hope LLM will help us achieve before the next AI winter.

Western08mo ago

I need this same with Mizar https://wiki.mizar.org/

maiuki8mo ago

What industrial problems would this solve?

hamonrye8mo ago

RHEL knife-edge rolling kernel distribition for the proof of concept.

j / k navigate · click thread line to collapse

175 comments

94 comments · 23 top-level

chrchr8mo ago· 13 in thread

TrainedMonkey8mo ago

emporas8mo ago

Small steps of nondeterministic computation, checked thoroughly with deterministic computation every so often, and the sky is the limit.

That's when A.I. starts advancing itself and needs humans in the loop no more.

3 more replies

jonplackett8mo ago

Maybe the number crunching program the text generation program creates will, with enough effort become good at generating text, an will in turn make another number crunching computer and then…

1 more reply

idiotsecant8mo ago

Parent post is talking about symbolic manipulation, not rote number crunching, which is exactly what we're supposed to be good at and machines are supposed to be bad at.

patcon8mo ago

I love this kind of thought. Thanks.

29athrowaway8mo ago

We do plenty of number crunching all the time, just not consciously.

Like the inverse kinematics required for your arm and fingers to move.

2 more replies

anotherpaulg8mo ago

I really like LLM+sympy for math. I have the LLM write me a sympy program, so I can trust that the symbolic manipulation is done correctly.

I use helper functions to easily render from the sympy code to latex, etc.

A lot of the math behind this quantum eraser experiment was done this way.

https://github.com/paul-gauthier/entangled-pair-quantum-eras...

selinkocalar8mo ago

fennecfoxy8mo ago

Yeah it feels like these early LLMs are pretty decent at the coming up with a plan and executing a plan part.

Probably the main deficiencies are confusion as the context grows (therefore confusion as task complexity grows).

jansan8mo ago

How die that work? Did Gemini call sympy on your maschine, or is access to sympy built-in and available through normal chat?

77341288mo ago

https://cloud.google.com/vertex-ai/generative-ai/docs/multim...

DrewADesign8mo ago

afiori8mo ago

I feel like a better analogy is trying to rent a truck to move to a new apartment and after repeated failures of trucks not working they just hire a moving company for you to get you to leave

1 more reply

sigmoid108mo ago· 12 in thread

barthelomewOP8mo ago

When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.

sigmoid108mo ago

2 more replies

atrus8mo ago

sigmoid108mo ago

jssmith8mo ago

barthelomewOP8mo ago

sigmoid108mo ago

1 more reply

eric-burel8mo ago

Yep from time to time.

IanCal8mo ago

eric-burel8mo ago

retinaros8mo ago

yes this can also improve the said reasoning.

sigmoid108mo ago

everdrive8mo ago· 8 in thread

konmok8mo ago

everdrive8mo ago

>It's like trying to understand why a person likes the color red, but not the color blue, using a database recording the position, makeup, and velocity of every atom in their brain.

But this is an incredibly interesting problem!

1 more reply

chpatrick8mo ago

Everything happens in an opaque super-high-dimensional numerical space that was "organically grown" not engineered, so we don't really understand what's going on.

moffkalast8mo ago

It would be like logging a bunch of random noise from anyone's perspective except the LLM's.

everdrive8mo ago

nickpsecurity8mo ago

You'll also see why their applications are limited compared to what you probably hoped for.

NotGMan8mo ago

Chat GPT-4 has alegedly 1.8 trillion parameters.

Imagine having a bunch of 2D matrices with a combined 1.8 trillion total numbers, from which you pick out a blocks of numbers in a loop and finally merge them and combine them to form a token.

Good luck figuring out what number represents what.

everdrive8mo ago

Wouldn't that mean it's totally impractical for day-to-day usage, but a researcher or team of researchers could solve this?

2 more replies

zwnow8mo ago· 6 in thread

Reasoning? LLMs can not reason, why is it always assumed they reason? They mimic reasoning.

elcomet8mo ago

How can you know?

measurablefunc8mo ago

1 more reply

moffkalast8mo ago

It's so funny to me that people are still adamant about this like two years after it's become a completely moot point.

emp173448mo ago

zwnow8mo ago

Experts are adamant about this. Just take a look at https://youtu.be/iRqpsCHqLUI

1 more reply

Terr_8mo ago

The normative importance of a fact may increase when more number of people start willfully ignoring it for shorter-term profit.

Imagine somebody in 2007: "It's so funny to me that people are still adamant about mortgage default risk after it's become a completely moot point because nobody cares in this housing market."

2 more replies

measurablefunc8mo ago· 6 in thread

This is proof of verifiable logic. Computers can not think so calling it proof of thought misrepresents what's actually happening.

aSanchezStern8mo ago

measurablefunc8mo ago

2 more replies

encyclopedism8mo ago

The jury maybe out on how to judge what 'thought' actually is. However what it is not is perhaps easier to perceive. My digital thermometer does not think when it tells me the temperature.

My paper and pen version of the latest LLM (quite a large bit of paper and certainly a lot of ink I might add) also does not think.

I am surprised so many in the HN community have so quickly taken to assuming as fact that LLM's think or reason. Even anthropomorphising LLM's to this end.

For a group inclined to quickly calling out 'God of the gaps' they have quite quickly invented their very own 'emergence'.

Terr_8mo ago

> this whole "computers can't think" thing is making LLM skepticism seem very unscientific.

It's just shorthand for "that's an extraordinary claim and nobody has provided any remotely extraordinary evidence to support it."

1 more reply

chpatrick8mo ago

Do you understand human thinking well enough to determine what can think and what can't? We have next to no idea how an organic brain works.

measurablefunc8mo ago

2 more replies

LASR8mo ago· 5 in thread

This is an interesting approach.

Then we run the solver to verify consistency.

When a wiki page is changed, the process is run again and it's essentially a linter for process.

Can't say it moved beyond the prototyping stage though, since the LEAN conversion does require some engineers to look through it at least.

But a promising approach indeed, especially when you have a domain that requires tight legal / financial compliance.

barthelomewOP8mo ago

If you ever feel like chatting and discussing more details, happy to chat!

viraptor8mo ago

Could you share an example of such policy? I'm struggling to think of something defined well enough in the real world to apply in Lean.

chandureddyvari8mo ago

For anyone curious about what LEAN is, like me, here’s the explanation: Lean Theorem Prover is a Microsoft project. You can find it here: https://www.microsoft.com/en-us/research/project/lean/

ashandoak8mo ago

pbronez8mo ago

That’s pretty cool. It would be super useful to identify contradictory guidance systematically.

nakamoto_damacy8mo ago· 5 in thread

barthelomewOP8mo ago

Yep, this is a genuine problem, and this is what we term as the autoformalization gap in our follow up paper. (https://arxiv.org/abs/2505.20047)

Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)

You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.

YeGoblynQueenne8mo ago

Well, if you understand that this is a "genuine problem" then what have you done to solve it? A quick look at the abstract of your follow up paper does not reveal an answer.

avmich8mo ago

nakamoto_damacy8mo ago

Why is this being down voted? I believe the author acknowledged and responded. Anything wrong?

nextos8mo ago· 4 in thread

This is a very interesting area of research. I did something similar a couple of years ago using logic and probabilistic logic inference engines to make sure conclusions followed from premises.

I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.

I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934

barthelomewOP8mo ago

Yep! I have read your work! Pretty cool! I also worked on a similar deep research agent for autoformalization this summer at AWS ARChecks, building on similar patterns.

Although that work is not public, you can play with the generally available product here!

[1] https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...

CuriouslyC8mo ago

fnordpiglet8mo ago

jebarker8mo ago

2 more replies

tannhaeuser8mo ago· 3 in thread

[1]: https://quantumprolog.sgml.net/llm-demo/part1.html

[2]: https://microsoft.github.io/z3guide/docs/fixedpoints/syntax

barthelomewOP8mo ago

[1] https://arxiv.org/abs/2505.20047 [2] https://github.com/antlr/grammars-v4/blob/master/datalog/dat...

larodi8mo ago

Neuralsymbolic systems are very likely the future as so many times mentioned here already.

a3w8mo ago

I cannot use wolframalpha most of the time since the syntax is not that natural. WolframAlpha is good AI, it never lies.

Querying a calculator or wolfram alpha like symbolic AI system with LLMs seems like the only use for LLMs except for text refactoring that should be feasible.

Thinking LLMs know anything on their own is a huge fallacy.

0xWTF8mo ago· 3 in thread

Am I reading this right? Statistical LLM outputs pushed through a formal logic model? Wouldn't that be a case of "crap in, crap out"?

avmich8mo ago

varispeed8mo ago

That's subjective. One could argue all the things we invented in the past few thousands years were crap. Life would have been much easier in the caves, albeit shorter.

baq8mo ago

You assume it’s all crap when it clearly isn’t often enough to be useful.

ivanbakel8mo ago· 3 in thread

The repo is sparse on the details unless you go digging, which perhaps makes sense if this is just meant as the artifact for the mentioned paper.

barthelomewOP8mo ago

Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.

E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.

Illustration of this can be seen in Fig 14, 15 on Page 29.

https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...

bhk8mo ago

Re: "99% of the time" ... this is an ambiguous sample space. Soundness of results clearly depends on the questions being asked. For what set of questions does the 99% guarantee hold?

cerved8mo ago

Who makes the rules?

tonerow8mo ago· 2 in thread

Cool research! I went to the repo to see what the DSL looked like but it was hard to find a clear example. It would be cool if you added a snippet to the README.

barthelomewOP8mo ago

Hey! Thank you for the interest! I shall do that. Meanwhile, check out Page 11 onwards. We describe a lot of situations! (https://arxiv.org/pdf/2409.17270)

pstoll8mo ago

Upvoting the comment that the gitrepo would be way more self stand-alone if it had an intro of the DSL.

dehsge8mo ago· 1 in thread

LLMs and its output are bounded by Rices theorem. This is not going to ensure correctness it’s just going to validate that the model can produce an undecidable result.

ogogmad8mo ago

Errr, checking correctness of proofs is decidable.

nakamoto_damacy8mo ago

https://www.reddit.com/r/healthIT/comments/1n81e8g/comment/n...

sytse8mo ago

Interesting that the final answer is provably entailed (or you get a counterexample), instead of being merely persuasive chain-of-thought.

derekcheng088mo ago

westurner8mo ago

jadelcastillo8mo ago

Interesting approach, but I guess still lot of work to be done. I tried with this question:

"Alice has 60 brothers and she also has 212 sisters. How many sisters does Alice's brother have?"

But the generated program is not very useful:

renshijian8mo ago

This is fascinating! An AI that doesn't just think out loud, but keeps a verifiable diary. It's like a philosopher with a cryptographic notary public living in its brain. Amazing work!

Yoric8mo ago

That is exactly the kind of things that I hope LLM will help us achieve before the next AI winter.

Western08mo ago

I need this same with Mizar https://wiki.mizar.org/

maiuki8mo ago

What industrial problems would this solve?

hamonrye8mo ago

RHEL knife-edge rolling kernel distribition for the proof of concept.

j / k navigate · click thread line to collapse