undefined | Better HN

0 pointspron13d ago0 comments

First, it's not a question of decidability but of tractability. Verifying programs in a language that has nothing but boolean variables, no subroutines, and loops at depth of at most 2 - far, far, from Turing-completeness - is already intractable (reduction from TQBF).

Second, it's very easy to have some specs decided tractably, at least in many practical instances, but they are far too weak to specify most correctness properties programs need. You mentioned the Rust type system, and it cannot specify properties with interleaved quantifiers, which most interesting properties require.

And as for HoTT - or any of the many equivalent rich formalisms - checking their proofs is tractable, but not finding them. The intractability of verification of even very limited languages (again TQBF) holds regardless of how the verification is done.

I think it's best to take it step by step, and CodeSpeak's approach is pragmatic.

0 comments

nyrikki12d ago

I think there is a bit of the map territory relation here.

> First, it's not a question of decidability but of tractability

The question of decidability is a form of many-to-one, reduction. In fact RE-complete is defined by many-to-one reductions.

In a computational complexity sense, tractability is a far stronger notion. Basically an algorithm efficient if its time complexity at most PTIME for any size n input. A problem is "tractable" if there is an efficient algorithm that solves it.

You are correct, if you limit your expressiveness to PTIME, where because P == co-P, PEM/tight apartness/Omniscience principles hold.

But the problem is that Church–Rosser Property[0] (proofs ~= programs) and Brouwer–Heyting–Kolmogorov Interpretation[1] (Propositions at types) are NOT binary SAT, and you have concepts like mere propositions[3] that are very different than just BSAT.

But CodeSpeak doesn't have formal specifications, so this is irrelevant. Their example code output produced code with path traversal/resource exhaustion risks and correctness issues and is an example.

My personal opinion is that we will need to work within the limitations of the systems, and while it is trivial to come up with your own canary, I would recommend playing with [3] before the models directly target it.

Generating new code from a changed spec will be less difficult, specifically when the mess of real world specs comes into play. You can play with the example on CodeSpeak's front page, trying to close the various holes the software has with malformed/malicious input, giving the LLM the existing code base and you will see that "brown m&m"[3] problem arise quickly. At least for me if I prompt it to look at the changed natural language spec, generating new code it was more successful.

But for some models like the qwen3 coder next, the style resulted in far less happy path protections, which that model seems to have been trained on to deliver by default in some cases.

[0] https://calhoun.nps.edu/entities/publication/015f1bab-6642-4... [1] https://www.cs.cornell.edu/courses/cs6110/2017sp/lectures/le... [2] https://www.cambridge.org/core/journals/journal-of-functiona... [3] https://codemanship.wordpress.com/2025/10/03/llms-context-wi...

pronOP12d ago

> But the problem is that Church–Rosser Property[0] (proofs ~= programs) and Brouwer–Heyting–Kolmogorov Interpretation[1] (Propositions at types) are NOT binary SAT, and you have concepts like mere propositions[3] that are very different than just BSAT.

The computational limits imposed on program verification are independent of the logical theory used, and depend only on its expressive strength. Many if not most interesting program properties require interleaved properties (forall-exists or forall-exists-forall etc.) which are intractable to verify.

> But CodeSpeak doesn't have formal specifications, so this is irrelevant.

The lack of formalism also doesn't matter to the limitations on correctness. If you wish to know, with certainty, that a program satisfies some property, that knowledge has a cost. But both humans and LLMs write programs at least in part based on inductive rather than deductive reasoning, which cannot be rigorously given a level of confidence. That may not be what we want, but that's what computational complexity says we can have, so we're forced to work in this way.

We should complain about what we don't have, but demanding things we can't have isn't going to help. There's no fundamental reason why AI shouldn't, someday, be able to program as well as humans, but there are fundamental limitations to producing the software we wish we had. Humans can rigorously use deductive methods, with mechanical help, and AI could possibly use it, too. But there's no reason to believe AI could break the size barrier of such problems. People have been able to rigorously verify only very small programs, and maybe AI could do a little better if only because of its tenacity, but if we expect to produce perfect programs by any means, we'll be waiting for a long time.

j / k navigate · click thread line to collapse