https://news.ycombinator.com/item?id=44163194
https://news.ycombinator.com/item?id=44068943
It doesn't optimize "good programs". It interprets "humans interpretation of good programs." More accurately, "it optimizes what low paid over worked humans believe are good programs." Are you hiring your best and brightest to code review the LLMs?Even if you do, it still optimizes tricking them. It will also optimize writing good programs, but you act like that's a well defined and measurable thing.
You can definitely still run into some of the problems eluded to in the first link. Think hacking unit tests, deception, etc -- but the bar is less "create a perfect RL environment" than "create an RL environment where solving the problem is easier than reward hacking." It might be possible to exploit a bug in the Lean 4 proof assistant to prove a mathematical statement, but I suspect it will usually be easier for an LLM to just write a correct proof. Current RL environments aren't as watertight as Lean 4, but there's certainly work to make them more watertight.
This is in no way a "solved" problem, but I do see it as a counter to your assertion that "This isn't a thing RL can fix." RL is powerful.
> Current paradigms are shifting towards RLVR, which absolutely can optimize good programs
I think you've misunderstood. RL is great. Hell, RLHF has done a lot of good. I'm not saying LLM are useless.But no, it's much more complex than you claim. RLVM can optimize for correct answers in the narrow domains where there are correct answers but it can't optimize good programs. There's a big difference.
You're right that Lean, Coq, and other ATPs can prove mathematical statements, but they also don't ensure that a program is good. There's frequently an infinite number of proofs that are correct, but most of those are terrible proofs.
This is the same problem all the coding benchmarks face. Even if the LLM isn't cheating, testing isn't enough. If it was we'd never do code review lol. I can pass a test with an algorithm that's O(n^3) despite there being an O(1) solution.
You're right that it makes it better, but it doesn't fix the underlying problem I'm discussing.
Not everything is verifiable.
Verifiability isn't enough.
If you'd like to prove me wrong in the former you're going to need to demonstrate that there are provably true statements to lots of things. I'm not expecting you to defy my namesake, nor will I ask you prove correctness and solve the related halting problem.
You can't prove an image is high fidelity. You can't prove a song sounds good. You can't prove a poem is a poem. You can't prove this sentence is English. The world is messy as fuck and most things are highly subjective.
But the problem isn't binary, it is continuous. I said we're using Justice Potter optimization, you can't even define what porn is. These definitions change over time, often rapidly!
You're forgetting about the tyrannical of metrics. Metrics are great, powerful tools that are incredibly useful. But if you think they're perfectly aligned with what you intend to measure then they become tools that work against you. Goodhart's Law. Metrics only work as guides. They're no different than any other powerful tool, if you use it wrong you get hurt.
If you really want to understand this I really encourage you to deep dive into this stuff. You need to get into the math. Into the weeds. You'll find a lot of help with metamathematics (i.e. my namesake), metaphysics (Ian Hacking is a good start), and such. It isn't enough to know the math, you need to know what the math means.
If the former, I still think that the vast majority of production software has metrics/unit tests that could be attached and subsequently hillclimbed via RL. Whether the resulting optimized programs would be considered "good" depends on your definition of "good." I suspect mine is more utilitarian than yours (as even after some thought I can't conceive of what a "terrible" proof might look like), but I am skeptical that your code review will prove to be a better measure of goodness than a broad suite of unit tests/verifiers/metrics -- which, to my original last point, are only getting more robust! And if these aren't enough, I suspect the addition of LLM-as-a-judge (potentially ensembles) checking for readability/maintainability/security vulnerabilities will eventually put code quality above that of what currently qualifies as "good" code.
Your examples of tasks that can't easily be optimized (image fidelity, song quality, etc.) seem out of scope to me -- can you point to categories of extant software that could not be hillclimbed via RL? Or is this just a fundamental disagreement about what it means for software to be "good"? At any rate, I think we can agree that the original claim that "The LLM has one job, to make code that looks plausible. That's it. There's no logic gone into writing that bit of code" is wrong in the context of RL.
I very strongly disagree with this and think this reflects a misunderstanding of model capabilities. This sort of agentic loop with access to ground truth model has been tried in one form or another ever since GPT-3 came out. For four years they didn't work. Models would very quickly veer into incoherence no matter what tooling you gave them.
Only in the last year or so have models gotten capable enough to maintain coherence over long enough time scales that these loops work. And future model releases will tighten up these loops even more and scale them out to longer time horizons.
This is all to say that progress in code production has been essentially driven by progress in model capabilities, and agent loops are a side effect of that rather than the main driving force.
> I don't know if any of this applies to the arguments
> with access to ground truth
There's the connection. You think you have ground truth. No such thing existsYou can talk about how meaningful those exit codes and error messages are or aren't, but the point is that they are profoundly different than the information an LLM natively operates with, which are atomized weights predicting next tokens based on what an abstract notion of a correct line of code or an error message might look like. An LLM can (and will) lie to itself about what it is perceiving. An agent cannot; it's just 200 lines of Python, it literally can't.
In medical AI, where I'm currently working, "ground truth" is usually whatever human experts say about a medical image, and is rarely perfect. The goal is always to do better than whatever the current ground truth is.
Correctness.
> and meets my requirements
It can't do that. "My requirements" wasn't part of the training set.
> It can't do that. "My requirements" wasn't part of the training set.
Neither are mine, the art of building these models is that they are generalisable enough that they can tackle tasks that aren't in their dataset. They have proven, at least for some classes of tasks, they can do exactly that.
> If the model can write code that passes tests
You think tests make code good? Oh my sweet summer child. TDD has been tried many times and each time it failed worse than the last.