undefined | Better HN

0 pointszeroonetwothree2mo ago0 comments

Why is it every time anyone has a critique someone has to say “oh but you aren’t using model X, which clearly never has this problem and is far better”?

Yet the data doesn’t show all that much difference between SOTA models. So I have a hard time believing it.

0 comments

Uehreka2mo ago

GP here: My problem with a lot of studies and data is that they seem to measure how good LLMs are at a particular task, but often don't account for "how good the LLM is to work with". The latter feels extremely difficult to quantify, but matters a lot when you're having a couple dozen turns of conversation with an LLM over the course of a project.

Like, I think there's definitely value in prompting a dozen LLMs with a detailed description of a CMS you want built with 12 specific features, a unit testing suite and mobile support, and then timing them to see how long they take and grading their results. But that's not how most developers use an LLM in practice.

Until LLMs become reliable one-shot machines, the thing I care most about is how well they augment my problem solving process as I work through a problem with them. I have no earthly idea of how to measure that, and I'm highly skeptical of anyone who claims they do. In the absence of empirical evidence we have to fall back on intuition.

CJefferson2mo ago

A friend recommended to me having a D&D style roleplay with some different engines, to see which you vibe with. I thought this sounded crazy but I took their advice.

I found this worked suprisingly well, I was certain 'claude' was best, while they like grok and someone else liked ChatGPT. Some AIs just end up fitting best with how you like to chat I think. I do definately also find claude best for coding with as well.

fragmede2mo ago

Because they are getting better. They're still far from perfect/AGI/ASI, but when was the last time you saw the word "delve"? So the models are clearly changing, the question is why doesn't the data show That they're actually better?

Thing is, everyone knows the benchmarks are being gamed. Exactly how is besides the point. In practice, anecdotally, Opus 4.5 is noticably better than 4, and GPT 5.2 has also noticably improved. So maybe the real question is why do you believe this data when it seems at odds with observations by humans in the field?

> Jeff Bezos: When the data and the anecdotes disagree, the anecdotes are usually right.

https://articles.data.blog/2024/03/30/jeff-bezos-when-the-da...

troupo2mo ago

"They don't use delve anymore" is not really a testament that they became better.

Most of what I can do now with them I could do half a year to a year ago. And all the mistakes and fail loops are still there, across all models.

What changed is the number of magical incantations we throw at these models in the form of "skills" and "plugins" and "tools" hoping that this will solve the issue at hand before the context window overflows.

kaffekaka2mo ago

"They dont say X as often anymore" is just a distraction, it has nothing to do with actual capability of the model.

Unfortunately, I think that the overlap between actual model improvements and what people perceive as "better" is quite small. Combine this with the fact that most people desperately want to have a strong opinion on stuff even though the factual basis is very weak.. "But I can SEE it is X now".

fatherwavelet2mo ago

The type of person who outsources their thinking to their social media feed news stories and isn't intellectually curious enough to deeply explore the models themselves in order for the models to display their increase in strength, isn't going to be able to tell this themselves.

I would think this also correlates with the type of person who hasn't done enough data analysis themselves to understand all the lies and misleading half-truths "data" often tells. In the reverse also, that experience with data inoculates one to some degree against the bullshitting LLM so it is probably easier to get value from the model.

I would imagine there are all kinds of factors like this that multiple so some people are really having vastly different experiences with the models than others.

jihadjihad2mo ago

Because the answer to the question, “Does this model work for my use case?” is subjective.

j / k navigate · click thread line to collapse