undefined | Better HN

0 pointsmwigdahl8mo ago0 comments

The problem is that you're talking about a multistep process where each step beyond the first depends on the particular path the agent starts down, along with human input that's going to vary at each step.

I made a crude first stab at an approach that at least uses similar steps and structure to compare the effectiveness of AI agents. My approach was used on a small toy problem, but one that was complex enough the agents couldn't one-shot and required error correction.

It was enough to show significant differences, but scaling this to larger projects and multiple runs would be pretty difficult.

https://mattwigdahl.substack.com/p/claude-code-vs-codex-cli-...

0 comments

8 comments · 1 top-level

potatolicious8mo ago· 7 in thread

What you're getting at is the heart of the problem with the LLM hype train though, isn't it?

"We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.

But in the realm of LLM-enabled use cases they're also expensive. You'd need to recruit dozens, perhaps even hundreds of developers to do this, with extensive observation and rating of the results.

So rather than actually try to measure the efficacy, we just get blog posts with cherry-picked example of "LLM does something cool". Everything is just anecdata.

This is also the biggest barrier to actual LLM adoption for many, many applications. The gap between "it does something REALLY IMPRESSIVE 40% of the time and shits the bed otherwise" and "production system" is a yawning chasm.

marcosdumay8mo ago

It's the heart of the problem with all software engineer research. That's why we have so little reliable knowledge.

It applies to using LLMs too. I guess the one largest difference here is that LLM has few enough companies with abundant enough money pushing it to make it trivial for them to run a test like this. So the fact that they aren't doing that also says a lot.

oblio8mo ago

> What you're getting at is the heart of the problem with the LLM hype train though, isn't it?

> "We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.

Heh, I'd rephrase the first part to:

> What you're getting at is the heart of the problem with software development though, isn't it?

simonw8mo ago

The UK government ran a study with thousands of developers quite recently: https://www.gov.uk/government/publications/ai-coding-assista...

redhale8mo ago

I don't necessarily think the conclusions are wrong, but this relies entirely on self-reported survey results to measure productivity gains. That's too easy to poke holes in, and I think studies like this are unlikely to convince real skeptics in the near term.

1 more reply

b_e_n_t_o_n8mo ago

Woah, finally something with actual metrics instead of vibes!

> Trial participants saved an average of 56 minutes a working day when using AICAs

That feels accurate to me, but again I'm just going on vibes :P

troupo8mo ago

Before you get into the expensive part, how do you get past "non-deterministic blackbox with unknown layers in between imposed by vendors"

potatolicious8mo ago

You can measure probabilistic systems that you can't examine! I don't want to throw the baby out with the bathwater here - before LLMs became the all-encompassing elephant in the room we did this routinely.

You absolutely can quantify the results of a chaotic black box, in the same way you can quantify the bias of a loaded die without examining its molecular structure.

j / k navigate · click thread line to collapse