undefined | Better HN

0 pointszsoltkacsandi5mo ago0 comments

> The nerf is psychologial, not actual

Once I tested this, I gave the same task for a model after the release and a couple weeks later. In the first attempt it produced a well-written code that worked beautifully, I started to worry about the jobs of the software engineers. Second attempt was a nightmare, like a butcher acting as a junior developer performing a surgery on a horse.

Is this empirical evidence?

And this is not only my experience.

Calling this phychological is gaslighting.

0 comments

lukev5mo ago

> Is this empirical evidence?

Look, I'm not defending the big labs, I think they're terrible in a lot of ways. And I'm actually suspending judgement on whether there is ~some kind of nerf happening.

But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.

dash25mo ago

It's not non-empirical. He was careful to give it the same experiment twice. The dependent variable is his judgment, sure, but why shouldn't we trust that if he's an experienced SWE?

atq21195mo ago

Sample size is way too small.

Unless he was able to sample with temperature 0 (and get fully deterministic results both times), this can just be random chance. And experience as SWE doesn't imply experience with statistics and experiment design.

zsoltkacsandiOP5mo ago

> But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.

Well, if we see this way, this is true for Antrophic’s benchmarks as well.

Btw the definition of empirical is: “based on observation or experience rather than theory or pure logic”

So what I described is the exact definition of empirical.

ACCount375mo ago

No, it's entirely psychological.

Users are not reliable model evaluators. It's a lesson the industry will, I'm afraid, have to learn and relearn over and over again.

sillyfluke5mo ago

I don't really find this a helpful line to traverse. By this line of inquiry most of the things in software are psychological.

Whether something is a bug or feature.

Whether the right thing was built.

Whether the thing is behaving correctly in general.

Whether it's better at the very moment that the thing occasionally works for a whole range of stuff or that it works perfectly for a small subset.

Whether fast results are more important than absolutely correct results for a given context.

Yes, all things above are also related with each other.

The most we have for LLMs is tallying up each user's experience using an LLM for a period of time for a wide rane of "compelling" use cases (the pairing of their prompts and results are empirical though right?).

This should be no surprise, as humans often can't agree on an end-all-be-all intelligence test for humans either.

ACCount375mo ago

No. I'm saying that if you take the same exact LLM on the same exact set of hardware and serve it to the same exact humans, a sizeable amount of them will still complain about "model nerfs".

Why? Because humans suck.

zsoltkacsandiOP5mo ago

Giving the same prompt resulting in totally different results is not user evaluation. Nor psychological. You cannot tell the customer you are working for as a developer, that hey, first time it did what you asked, second time it ruined everything, but look, here is the benchmark from Antrophic, according to this there is nothing wrong.

The only thing that matters and that can evaluate performance is the end result.

But hey, the solution is easy: Antrophic can release their own benchmarks, so everyone can test their models any time. Why they don’t do it?

pertymcpert5mo ago

The models are non-deterministic. You can't just assume that because it did better before that it was on average better than before. And the variance is quite large.

zsoltkacsandiOP5mo ago

No one talked about determinism. First it was able to do a task, second time not. It’s not that the implementation details changed.

2 more replies

blurbleblurble5mo ago

I'm working on a hard problem recently and have been keeping my "model" setting pegged to "high".

Why in the world, if I'm paying the loss leader price for "unlimited" usage of these models, would any of these companies literally respect my preference to have unfettered access to the most expensive inference?

Especially when one of the hallmark features of GPT-5 was a fancy router system that decides automatically when to use more/less inference resources, I'm very wary of those `/model` settings.

riwsky5mo ago

Because intentionally fucking over their customers would be an impossible secret to keep, and when it inevitably leaks would trigger severe backlash, if not investigations for fraud. The game theoretic model you’re positing only really makes sense if there’s only one iteration of the game, which isn’t the case.

jaggs5mo ago

That is unfortunately not true. It's pretty easy to mess with your customers when your whole product is as opaque as LLMs. I mean they don't even understand how they work internally.

roywiggins5mo ago

https://en.wikipedia.org/wiki/Regression_toward_the_mean

The way this works is:

1) x% of users have an exceptional first experience by chance. Nobody who has a meh first experience bothers to try a second time. 2) x²% of users also have an exceptional second experience by chance 3) So a lot of people with a great first experience think the model started off great and got suddenly worse

Suppose it's 25% that have a really great first experience. 25% of them have a great second experience too, but 75% of them see a sudden decline in quality and decide that it must be intentional. After the third experience this population gets bigger again.

So by pure chance and sampling biases you end up convincing a bunch of people that the model used to be great but has gotten worse, but a much smaller population of people who thought it was terrible but got better because most of them gave up early.

This is not in their heads- they really did see declining success. But they experienced it without any changes to the model at all.

quleap5mo ago

Your theory does not hold if a user initially had great experience for weeks and then had bad experience also for weeks.

unsupp0rted5mo ago

If by "second" and "third" experience you mean "after 2 ~ 4 weeks of all-day usage"

1 more reply

Wowfunhappy5mo ago

I think this is pretty easy to explain psychologically.

The first time you see a dog that can make pancakes, you’re really focused on the fact that a dog is making pancakes.

After a few weeks of having them for breakfast, you start to notice that the pancakes are actually kind of overcooked and don’t taste that good. Sure it’s impressive that a dog made them, but what use are sub-par pancakes? You’re naturally more focused on what it can’t do than what it can.

fragmede5mo ago

I'm not doubting you, but share the chats! it would make your point even stronger.

j / k navigate · click thread line to collapse

0 comments

lukev5mo ago

> Is this empirical evidence?

Look, I'm not defending the big labs, I think they're terrible in a lot of ways. And I'm actually suspending judgement on whether there is ~some kind of nerf happening.

But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.

dash25mo ago

It's not non-empirical. He was careful to give it the same experiment twice. The dependent variable is his judgment, sure, but why shouldn't we trust that if he's an experienced SWE?

atq21195mo ago

Sample size is way too small.

zsoltkacsandiOP5mo ago

> But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.

Well, if we see this way, this is true for Antrophic’s benchmarks as well.

Btw the definition of empirical is: “based on observation or experience rather than theory or pure logic”

So what I described is the exact definition of empirical.

ACCount375mo ago

No, it's entirely psychological.

Users are not reliable model evaluators. It's a lesson the industry will, I'm afraid, have to learn and relearn over and over again.

sillyfluke5mo ago

I don't really find this a helpful line to traverse. By this line of inquiry most of the things in software are psychological.

Whether something is a bug or feature.

Whether the right thing was built.

Whether the thing is behaving correctly in general.

Whether it's better at the very moment that the thing occasionally works for a whole range of stuff or that it works perfectly for a small subset.

Whether fast results are more important than absolutely correct results for a given context.

Yes, all things above are also related with each other.

This should be no surprise, as humans often can't agree on an end-all-be-all intelligence test for humans either.

ACCount375mo ago

No. I'm saying that if you take the same exact LLM on the same exact set of hardware and serve it to the same exact humans, a sizeable amount of them will still complain about "model nerfs".

Why? Because humans suck.

zsoltkacsandiOP5mo ago

The only thing that matters and that can evaluate performance is the end result.

But hey, the solution is easy: Antrophic can release their own benchmarks, so everyone can test their models any time. Why they don’t do it?

pertymcpert5mo ago

The models are non-deterministic. You can't just assume that because it did better before that it was on average better than before. And the variance is quite large.

zsoltkacsandiOP5mo ago

No one talked about determinism. First it was able to do a task, second time not. It’s not that the implementation details changed.

2 more replies

blurbleblurble5mo ago

I'm working on a hard problem recently and have been keeping my "model" setting pegged to "high".

Especially when one of the hallmark features of GPT-5 was a fancy router system that decides automatically when to use more/less inference resources, I'm very wary of those `/model` settings.

riwsky5mo ago

jaggs5mo ago

That is unfortunately not true. It's pretty easy to mess with your customers when your whole product is as opaque as LLMs. I mean they don't even understand how they work internally.

roywiggins5mo ago

https://en.wikipedia.org/wiki/Regression_toward_the_mean

The way this works is:

This is not in their heads- they really did see declining success. But they experienced it without any changes to the model at all.

quleap5mo ago

Your theory does not hold if a user initially had great experience for weeks and then had bad experience also for weeks.

unsupp0rted5mo ago

If by "second" and "third" experience you mean "after 2 ~ 4 weeks of all-day usage"

1 more reply

Wowfunhappy5mo ago

I think this is pretty easy to explain psychologically.

The first time you see a dog that can make pancakes, you’re really focused on the fact that a dog is making pancakes.

fragmede5mo ago

I'm not doubting you, but share the chats! it would make your point even stronger.

j / k navigate · click thread line to collapse