I don't agree with this at all. Writing new code is trivially easy, to do a full in depth review takes significantly more brain power. You have to fully ascertain and insert yourself into someone elses thought process. Thats way more work than utilizing your own thought process.
They basically achieve over 80% agreement with human evaluators [1]. This level of agreement is similar to the consensus rate between two human evaluators, making LLM-as-a-judge a scalable and reliable proxy for human judgment.
[1] https://arxiv.org/abs/2306.05685 (2023)
It sounds nice but it means at least 1 in 5 are bad. That's worse odds than rolling 1 on a d6. You'll be tripping over mistakes constantly.
Sure there's no bug with how the logic is defined in the CR or even in the context of the project, ti maybe won't throw an exception.
But the LLM won't know that the query is iterating over an unindexed field in the DB with the table in prod having 10s of millions of rows. The LLM won't know that even though the code says the button should be red and the comments say the button should be red, the corporate style guide says red should be a very specific hex code that it isn't.
Oh goodness that's like trusting one kid to tell you whether or not his friend lied.
In matters where trust matters, it's a recipe for disaster.
Give it another year and HN comments will be very different.
Writing tests already works now. It's usually easier to read tests than to read convoluted logic.
Mmmhmm. And you think this "growing up" doesn't have biases to lie in circumstances where it matters? Consider politics. Politics matter. It's inconceivable that a magic algorithm would lie to us about various political concerns, right? Right...?
A magic algorithm lying to us about anything would be extremely valuable to liars. Do you think it's possible that liars are guiding the direction of these magic algorithms?
I notice a distinct lack of blockchain hegemony.