undefined | Better HN

0 pointsjaccola1mo ago0 comments

All of the latest models I've tried actually pass this test. What I found interesting was all of the success cases were similar to:

e.g. "Drive. Most car washes require the car to be present to wash,..."

Only most?!

They have an inability to have a strong "opinion" probably because their post training, and maybe the internet in general, prefer hedged answers....

0 comments

Waterluvian1mo ago

Here’s my take: boldness requires the risk of being wrong sometimes. If we decide being wrong is very bad (which I think we generally have agreed is the case for AIs) then we are discouraging strong opinions. We can’t have it both ways.

idonotknowwhy1mo ago

Last year's models were bolder. Eg. Sonnet-3.7(thinking), 10 times got it right without hedging:

>You should drive your car to the car wash. Even though it's only 50 meters away (which is very close), you'll need your car physically present at the car wash to get it washed. If you walk there, you'll arrive without your car, which wouldn't accomplish your goal of getting it washed.

>You'll need to drive your car to the car wash. While 50 meters is a very short distance (just a minute's walk), you need your car to actually be at the car wash to get it washed. Walking there without your car wouldn't accomplish your goal!

etc. The reasoning never second-guesses it either.

A shame they're turning it of in 2 days.

dudefeliciano1mo ago

yet the llms seem to be extremely bold when they are completely wrong (two Rs in strawberry and so on)

hansmayer1mo ago

> They have an inability to have a strong "opinion" probably

What opinion? It's evaluation function simply returned the word "Most" as being the most likely first word in similar sentences it was trained on. It's a perfect example showing how dangerous this tech could be in a scenario where the prompter is less competent in the domain they are looking an answer for. Let's not do the work of filling in the gaps for the snake oil salesmen of the "AI" industry by trying to explain its inherent weaknesses.

wilg1mo ago

Presumably the OP scare quoted "opinion" precisely to avoid having to get into this tedious discussion.

lkeskull1mo ago

this example worked in 2021, it's 2026. wake up. these models are not just "finding the most likely next word based on what they've seen on the internet".

strix_varius1mo ago

Well, yes, definitionally they are doing exactly that.

It just turns out that there's quite a bit of knowledge and understanding baked into the relationships of words to one another.

LLMs are heavily influenced by preceding words. It's very hard for them to backtrack on an earlier branch. This is why all the reasoning models use "stop phrases" like "wait" "however" "hold on..." It's literally just text injected in order to make the auto complete more likely to revise previous bad branches.

jaccolaOP1mo ago

The person above was being a bit pedantic, and zealous in their anti-anthropomorphism.

But they are literally predicting the next token. They do nothing else.

Also if you think they were just predicting the next token in 2021, there has been no fundamental architecture change since then. All gains have been via scale and efficiency optimisations (not to discount that, an awful lot of complexity in both of these)

1 more reply

csomar1mo ago

Unless LLMs architecture have changed, that is exactly what they are doing. You might need to learn more how LLMs work.

1 more reply

andersmurphy1mo ago

Did you try several times per model? In my experience it's luck of the draw. All the models I tried managed to get it wrong at least once.

The models that had access to search got ot right.But, then were just dealing with an indirect version of Google.

(And they got it right for the wrong reasons... I.e this is a known question designed to confuse LLMs)

jl61mo ago

I guess it didn’t want to rule out the existence of ultra-powerful water jets that can wash a car in sniper mode.

madeofpalk1mo ago

I enjoyed the Deepseek response that said “If you walk there, you'll have to walk back anyway to drive the car to the wash.”

There’s a level of earnestness here that tickles my brain.

nozzlegear1mo ago

Opus 4.6 answered with "Drive." Opus 4.6 in incognito mode (or whatever they call it) answered with "Walk."

AstroBen1mo ago

They pass it because it went viral a week ago and has been patched

deevus1mo ago

I tried with Opus 4.6 Extended and it failed. LLMs are non deterministic so I'm guessing if I try a couple of times it might succeed.

linsomniac1mo ago

>Only most?!

There is such a thing as "mobile car wash" where they come to you, so "most" does seem appropriate.

zeroonetwothree1mo ago

Right, I use it all the time.

sneak1mo ago

There are car wash services that will come to where your car is and wash it. It’s not wrong!

GuB-421mo ago

Kind of like this: https://xkcd.com/1368/

And it is the kind of things a (cautious) human would say.

For example, that could be my reasoning: It sounds like a stupid question, but the guy looked serious, so maybe there are some types of car washes that don't require you to bring your car. Maybe you hand out the keys and they pick your car, wash it, and put it back to its parking spot while you are doing your groceries or something. I am going to say "most" just to be sure.

Of course, if I expected trick questions, I would have reacted accordingly, but LLMs are most likely trained to take everything at face value, as it is more useful this way. Usually, when people ask questions to LLMs they want an factual answer, not the LLM to be witty. Furthermore, LLMs are known to hallucinate very convincingly, and hedged answers may be a way to counteract this.

yanis_t1mo ago

> Most car washes... I read it as slight-sarcasm answer

dyauspitr1mo ago

There are mobile car washes that come to your house.

andersmurphy1mo ago

Do they involve you walking to them first?

learingsci1mo ago

You could, but presumably most people call. I know of such a place. They wash cars on the premises but you could walk in and arrange to have a mobile detailing appointment later on at some other location.

Loocid1mo ago

That still requires a car present to be washed though.

column1mo ago

but you can walk over to them and tell them to go wash the car that is 50 meters away. no driving involved.

1 more reply

beaugunderson1mo ago

opus 4.6 extended still fails.

YetAnotherNick1mo ago

> Only most?!

I mean I can imagine a scenario where they have pipe of 50m which is readily available commercially?

Puts1mo ago

> Only most?!

What if AI developed sarcasm without us knowing… xD

polynomial1mo ago

That's the problem with sarcasm...

Hnrobert421mo ago

Sure it did.

antonis-gr1mo ago

Once I asked ChatGPT "it takes 9 months for a woman to make one baby. How long does it take 9 women to make one baby?". The response was "it takes 1 month".

I guess it gives the correct answer now. I also guess that these silly mistakes are patched and these patches compensate for the lack of a comprehensive world model.

These "trap" questions dont prove that the model is silly. They only prove that the user is a smartass. I asked the question about pregnancy only to to show a friend that his opinion that LLMs have phd level intelligence is naive and anthropomorphic. LLMs are great tools regardless of their ability to understand the physical reality. I don't expect my wrenches to solve puzzles or show emotions.

j / k navigate · click thread line to collapse

0 comments

Waterluvian1mo ago

idonotknowwhy1mo ago

Last year's models were bolder. Eg. Sonnet-3.7(thinking), 10 times got it right without hedging:

etc. The reasoning never second-guesses it either.

A shame they're turning it of in 2 days.

dudefeliciano1mo ago

yet the llms seem to be extremely bold when they are completely wrong (two Rs in strawberry and so on)

hansmayer1mo ago

> They have an inability to have a strong "opinion" probably

wilg1mo ago

Presumably the OP scare quoted "opinion" precisely to avoid having to get into this tedious discussion.

lkeskull1mo ago

this example worked in 2021, it's 2026. wake up. these models are not just "finding the most likely next word based on what they've seen on the internet".

strix_varius1mo ago

Well, yes, definitionally they are doing exactly that.

It just turns out that there's quite a bit of knowledge and understanding baked into the relationships of words to one another.

jaccolaOP1mo ago

The person above was being a bit pedantic, and zealous in their anti-anthropomorphism.

But they are literally predicting the next token. They do nothing else.

1 more reply

csomar1mo ago

Unless LLMs architecture have changed, that is exactly what they are doing. You might need to learn more how LLMs work.

1 more reply

andersmurphy1mo ago

Did you try several times per model? In my experience it's luck of the draw. All the models I tried managed to get it wrong at least once.

The models that had access to search got ot right.But, then were just dealing with an indirect version of Google.

(And they got it right for the wrong reasons... I.e this is a known question designed to confuse LLMs)

jl61mo ago

I guess it didn’t want to rule out the existence of ultra-powerful water jets that can wash a car in sniper mode.

madeofpalk1mo ago

I enjoyed the Deepseek response that said “If you walk there, you'll have to walk back anyway to drive the car to the wash.”

There’s a level of earnestness here that tickles my brain.

nozzlegear1mo ago

Opus 4.6 answered with "Drive." Opus 4.6 in incognito mode (or whatever they call it) answered with "Walk."

AstroBen1mo ago

They pass it because it went viral a week ago and has been patched

deevus1mo ago

I tried with Opus 4.6 Extended and it failed. LLMs are non deterministic so I'm guessing if I try a couple of times it might succeed.

linsomniac1mo ago

>Only most?!

There is such a thing as "mobile car wash" where they come to you, so "most" does seem appropriate.

zeroonetwothree1mo ago

Right, I use it all the time.

sneak1mo ago

There are car wash services that will come to where your car is and wash it. It’s not wrong!

GuB-421mo ago

Kind of like this: https://xkcd.com/1368/

And it is the kind of things a (cautious) human would say.

yanis_t1mo ago

> Most car washes... I read it as slight-sarcasm answer

dyauspitr1mo ago

There are mobile car washes that come to your house.

andersmurphy1mo ago

Do they involve you walking to them first?

learingsci1mo ago

Loocid1mo ago

That still requires a car present to be washed though.

column1mo ago

but you can walk over to them and tell them to go wash the car that is 50 meters away. no driving involved.

1 more reply

beaugunderson1mo ago

opus 4.6 extended still fails.

YetAnotherNick1mo ago

> Only most?!

I mean I can imagine a scenario where they have pipe of 50m which is readily available commercially?

Puts1mo ago

> Only most?!

What if AI developed sarcasm without us knowing… xD

polynomial1mo ago

That's the problem with sarcasm...

Hnrobert421mo ago

Sure it did.

antonis-gr1mo ago

Once I asked ChatGPT "it takes 9 months for a woman to make one baby. How long does it take 9 women to make one baby?". The response was "it takes 1 month".

I guess it gives the correct answer now. I also guess that these silly mistakes are patched and these patches compensate for the lack of a comprehensive world model.

j / k navigate · click thread line to collapse