undefined | Better HN

0 pointsHerring9mo ago0 comments

So get another LLM to do it. Judging is considerably easier [For LLMs] than writing something from scratch, so LLM judges will always have that edge in accuracy. Equivalently, I also like getting them to write tons of tests to build trust in correct behavior.

0 comments

acedTrex9mo ago

> Judging is considerably easier than writing something from scratch

I don't agree with this at all. Writing new code is trivially easy, to do a full in depth review takes significantly more brain power. You have to fully ascertain and insert yourself into someone elses thought process. Thats way more work than utilizing your own thought process.

HerringOP9mo ago

Sorry, I should have been more specific. I meant LLMs are more reliable and accurate at judging than at generating from scratch.

They basically achieve over 80% agreement with human evaluators [1]. This level of agreement is similar to the consensus rate between two human evaluators, making LLM-as-a-judge a scalable and reliable proxy for human judgment.

[1] https://arxiv.org/abs/2306.05685 (2023)

habinero9mo ago

80% is a pretty abysmal success rate and means it's very unreliable.

It sounds nice but it means at least 1 in 5 are bad. That's worse odds than rolling 1 on a d6. You'll be tripping over mistakes constantly.

2 more replies

malfist9mo ago

LLMs will not have the context behind the lines of code in the CR.

Sure there's no bug with how the logic is defined in the CR or even in the context of the project, ti maybe won't throw an exception.

But the LLM won't know that the query is iterating over an unindexed field in the DB with the table in prod having 10s of millions of rows. The LLM won't know that even though the code says the button should be red and the comments say the button should be red, the corporate style guide says red should be a very specific hex code that it isn't.

inetknght9mo ago

> So get another LLM to do it.

Oh goodness that's like trusting one kid to tell you whether or not his friend lied.

In matters where trust matters, it's a recipe for disaster.

malfist9mo ago

LLMs inspecting LLM code is like the police investigating themselves for wrong doing.

HerringOP9mo ago

*shrug this kid is growing up fast

Give it another year and HN comments will be very different.

Writing tests already works now. It's usually easier to read tests than to read convoluted logic.

inetknght9mo ago

> shrug this kid is growing up fast

Mmmhmm. And you think this "growing up" doesn't have biases to lie in circumstances where it matters? Consider politics. Politics matter. It's inconceivable that a magic algorithm would lie to us about various political concerns, right? Right...?

A magic algorithm lying to us about anything would be extremely valuable to liars. Do you think it's possible that liars are guiding the direction of these magic algorithms?

habinero9mo ago

Sure, and there was a lot of hype about the blockchain a decade ago and how it would take over everything. YC funded a ton of blockchain startups.

I notice a distinct lack of blockchain hegemony.

catlifeonmars9mo ago

It’s also easy to misread tests FWIW.

dingnuts9mo ago

they've been saying that for three years and the performance improvement has been asymptotic (logarithmic) for a decade, if you've been following the state of the art that long.

skim14209mo ago

0.9 * 0.9 == 0.81

kbelder9mo ago

0.1 * 0.1 == 0.01

j / k navigate · click thread line to collapse

0 comments

acedTrex9mo ago

> Judging is considerably easier than writing something from scratch

HerringOP9mo ago

Sorry, I should have been more specific. I meant LLMs are more reliable and accurate at judging than at generating from scratch.

[1] https://arxiv.org/abs/2306.05685 (2023)

habinero9mo ago

80% is a pretty abysmal success rate and means it's very unreliable.

It sounds nice but it means at least 1 in 5 are bad. That's worse odds than rolling 1 on a d6. You'll be tripping over mistakes constantly.

2 more replies

malfist9mo ago

LLMs will not have the context behind the lines of code in the CR.

Sure there's no bug with how the logic is defined in the CR or even in the context of the project, ti maybe won't throw an exception.

inetknght9mo ago

> So get another LLM to do it.

Oh goodness that's like trusting one kid to tell you whether or not his friend lied.

In matters where trust matters, it's a recipe for disaster.

malfist9mo ago

LLMs inspecting LLM code is like the police investigating themselves for wrong doing.

HerringOP9mo ago

*shrug this kid is growing up fast

Give it another year and HN comments will be very different.

Writing tests already works now. It's usually easier to read tests than to read convoluted logic.

inetknght9mo ago

> shrug this kid is growing up fast

A magic algorithm lying to us about anything would be extremely valuable to liars. Do you think it's possible that liars are guiding the direction of these magic algorithms?

habinero9mo ago

Sure, and there was a lot of hype about the blockchain a decade ago and how it would take over everything. YC funded a ton of blockchain startups.

I notice a distinct lack of blockchain hegemony.

catlifeonmars9mo ago

It’s also easy to misread tests FWIW.

dingnuts9mo ago

they've been saying that for three years and the performance improvement has been asymptotic (logarithmic) for a decade, if you've been following the state of the art that long.

skim14209mo ago

0.9 * 0.9 == 0.81

kbelder9mo ago

0.1 * 0.1 == 0.01

j / k navigate · click thread line to collapse