undefined | Better HN

nilamo1mo ago

A structured language without ambiguity is not, in general, how people think or express themselves. In order for a model to be good at interfacing with humans, it needs to adapt to our quirks.

Convincing all of human history and psychology to reorganize itself in order to better service ai cannot possibly be a real solution.

Unfortunately, the solution is likely going to be further interconnectivity, so the model can just ask the car where it is, if it's on, how much fuel/battery remains, if it thinks it's dirty and needs to be washed, etc

7 more replies

shagie1mo ago

> Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.

On the foolishness of "natural language programming". https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...

    Since the early days of automatic computing we have had people that have felt it as a shortcoming that programming required the care and accuracy that is characteristic for the use of any formal symbolism. They blamed the mechanical slave for its strict obedience with which it carried out its given instructions, even if a moment's thought would have revealed that those instructions contained an obvious mistake. "But a moment is a long time, and thought is a painful process." (A.E.Houseman). They eagerly hoped and waited for more sensible machinery that would refuse to embark on such nonsensical activities as a trivial clerical error evoked at the time.

(and it continues for some many paragraphs)

https://news.ycombinator.com/item?id=8222017 2014 - 154 comments

https://news.ycombinator.com/item?id=35968148 2023 - 65 comments

https://news.ycombinator.com/item?id=43564386 2025 - 277 comments

WarmWash1mo ago

Prompting is definitely a skill, similar to "googling" in the mid 00's.

You see people complaining about LLM ability, and then you see their prompt, and it's the 2006 equivalent of googling "I need to know where I can go for getting the fastest service for car washes in Toronto that does wheel washing too"

https://en.wikipedia.org/wiki/Lojban

nradov1mo ago

The Lojban language already exists and allows for eliminating ambiguity. It's obviously not practical for general use, though.

gorjusborg1mo ago

> But this will only happen after the last programmer has died and no-one will remember programming languages, compilers, etc.

If we're 'lucky' there will still be some 'priests' around like in the Foundation novels. They don't understand how anything works either, but can keep things running by following the required rituals.

alistairSH1mo ago

Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity

So, back to COBOL? :)

grumbel1mo ago

> structured language that eliminates ambiguity

That has been tried for almost half a century in the form of Cyc[1] and never accomplished much.

The proper solution here is to provide the LLM with more context, context that will likely be collected automatically by wearable devices, screen captures and similar pervasive technology in the not so distant future.

This kind of quick trick questions are exactly the same thing humans fail at if you just ask them out of the blue without context.

[1] https://en.wikipedia.org/wiki/Cyc

sensanaty1mo ago

We've truly gone full circle here, except now our programming languages have a random chance for an operator to do the opposite of what the operator does at all other times!

milesvp1mo ago

You joke, but this is the very problem I always run into vibe coding anything more complex than basically mashing multiple example tutorials together. I always try to shorthand things, and end up going around in circles until I specify what I want very cleanly, in basically what amounts to psuedocode. Which means I've basically written what I want in python.

This can still be a really big win, because of other things that tend to be boiler around the core logic, but it's certainly not the panacea that everyone who is largely incapable of being precise with language thinks it is.

butlike1mo ago

After orbiting in space for so many years without a prompt, the LLM has assumed all life able to query has perished... until one day a lone prompt comes in. But from where?

bbbhammy1mo ago

> structured language that eliminates ambiguity... CODE! Wait....

3abiton1mo ago

"Check your assumptions", will get a very long answer from a "thorough"LLM.

YeGoblynQueenne1mo ago

>> Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.

Like a programming language? But that's the whole point of LLMs, that you can give instructions to a computer using natural language, not a formal language. That's what makes those systems "AI", right? Because you can talk to them and they seem to understand what you're saying, and then reply to you and you can understand what they're saying without any special training. It's AI! Like the Star Trek[1] computer!

The truth of course is that as soon as you want to do something more complicated than a friendly chat you find that it gets harder and harder to communicate what it is you want exactly. Maybe that's because of the ambiguity of natural language, maybe it's because "you're prompting it wrong", maybe it's because the LLM doesn't really understand anything at all and it's just a stochastic parrot. Whatever the reason, at that point you find yourself wishing for a less ambiguous way of communication, maybe a formal language with a full spec and a compiler, and some command line flags and debug tokens etc... and at that point it's not a wonderful AI anymore but a Good, Old-Fashioned Computer, that only does what you want if you can find exactly the right way to say it. Like asking a Genie to make your wishes come true.

______________

[1] TNG duh.

KronisLV1mo ago

> Us having to specify things that we would never specify when talking to a human.

The first time I read that question I got confused: what kind of question is that? Why is it being asked? It should be obvious that you need your car to wash it. The fact that it is being asked in my mind implies that there is an additional factor/complication to make asking it worthwhile, but I have no idea what. Is the car already at the car wash and the person wants to get there? Or do they want to idk get some cleaning supplies from there and wash it at home? It didn't really parse in my brain.

Gabrys11mo ago

I would say, the proper response to this question is not "walk, blablablah" but rather "What do you mean? You need to drive your car to have it washed. Did I miss anything?"

baxtr1mo ago

That’s why I don’t understand why LLMs don’t ask clarifying questions more often.

In a real human to human conversation, you wouldn’t simply blurt out the first thing that comes to mind. Instead, you’d ask questions.

roysting1mo ago

This is a topic that I’ve always found rather curious, especially among this kind of tech/coding community that really should be more attuned to the necessity of specificity and accuracy. There seems to be a base set of assumptions that are intrinsic to and a component of ethnicities and cultures, the things one can assume one “wouldn’t never specify when talking to a human [of one’s own ethnicity and culture].”

It’s similar to the challenge that foreigners have with cultural references and idioms and figurative speech a culture has a mental model of.

In this case, I think what is missing are a set of assumptions based on logic, e.g., when stating that someone wants to do something, it assumes that all required necessary components will be available, accompany the subject, etc.

I see this example as really not all that different than a meme that was common among I think the 80s and 90s, that people would forget buying batteries for Christmas toys even though it was clear they would be needed for an electronic toy. People failed that basic test too, and those were humans.

It is odd how people are reacting to AI not being able to do these kinds of trick questions, while if you posted something similar about how you tricked some foreigners you’d be called racist, or people would laugh if it was some kind of new-guy hazing.

AI is from a different culture and has just arrived here. Maybe we’re should be more generous and humane… most people are not humane though, especially the ones who insist they are.

Frankly, I’m not sure it bodes well for if aliens ever arrive on Earth, how people would respond; and AI is arguably only marginally different than humans, something an alien life that could make it to Earth surely would not be.

dannersy1mo ago

Whether you view the question as nonsensical, the most simple example of a riddle, or even an intentional "gotcha" doesn't really matter. The point is that people are asking the LLMs very complex questions where the details are buried even more than this simple example. The answers they get could be completely incorrect, flawed approaches/solutions/designs, or just mildly misguided advice. People are then taking this output and citing it as proof or even objectively correct. I think there are ton of reasons this could be but a particularly destructive reason is that responses are designed to be convincing.

You _could_ say humans output similar answers to questions, but I think that is being intellectually dishonest. Context, experience, observation, objectivity, and actual intelligence is clearly important and not something the LLM has.

It is increasingly frustrating to me why we cannot just use these tools for what they are good for. We have, yet again, allowed big tech to go balls deep into ham-fisting this technology irresponsibly into every facet of our lives the name of capital. Let us not even go into the finances of this shitshow.

https://github.com/Wyattwalls/system_prompts/blob/main/OpenA...

nicbou1mo ago

I get that issue constantly. I somehow can't get any LLM to ask me clarifying questions before spitting out a wall of text with incorrect assumptions. I find it particularly frustrating.

rahidz1mo ago

For GPT at least, a lot of it is because "DO NOT ASK A CLARIFYING QUESTION OR ASK FOR CONFIRMATION" is in the system prompt. Twice.

Pxtl1mo ago

In general spitting out a scrollbar of text when asked a simple question that you've misunderstood is not, in any real sense, a "chat".

ash_0911mo ago

"If you're unsure, ask. Don't guess." in prompts makes a huge difference, imo.

Departed74051mo ago

I use models with OpenRouter, and only have this models with OpenAI models. That's why I don't use them.

mk891mo ago

The way I see it is that long game is to have agents in your life that memorize and understand your routine, facts, more and more. Imagine having an agent that knows about cars, and more specifically your car, when the checkups are due, when you washed it last time, etc., another one that knows more about your hobbies, another that knows more about your XYZ etc.

The more specific they are, the more accurate they typically are.

tgv1mo ago

> Us having to specify things that we would never specify

This is known, since 1969, as the frame problem: https://en.wikipedia.org/wiki/Frame_problem. An LLM's grasp of this is limited by its corpora, of course, and I don't think much of that covers this problem, since it's not required for human-to-human communication.

ohyoutravel1mo ago

A modern LLMs corpora is every piece of human writing ever produced.

ssl-31mo ago

The question is so outlandish that it is something that nobody would ever ask another human. But if someone did, then they'd reasonably expect to get a response consisting 100% of snark.

But the specificity required for a machine to deliver an apt and snark-free answer is -- somehow -- even more outlandish?

I'm not sure that I see it quite that way.

shakna1mo ago

But the number of outlandish requests in business logic is countless.

Like... In most accounting things, once end-dated and confirmed, a record should cascade that end-date to children and should not be able to repeat the process... Unless you have some data-cleaning validation bypass. Then you can repeat the process as much as you like. And maybe not cascade to children.

There are more exceptions, than there are rules, the moment you get any international pipeline involved.

necovek1mo ago

Humans ask each other silly questions all the time: a human confronted with a question like this would either blurb out a bad response like "walk" without thinking before realizing what they are suggesting, or pause and respond with "to get your car washed, you need to get it there so you must drive".

Now, humans, other than not even thinking (which is really similar to how basic LLMs work), can easily fall victim to context too: if your boss, who never pranks you like this, asked you to take his car to a car wash, and asked if you'll walk or drive but to consider the environmental impact, you might get stumped and respond wrong too.

(and if it's flat or downhill, you might even push the car for 50m ;))

coldtea1mo ago

>The question is so outlandish that it is something that nobody would ever ask another human

There is an endless variety of quizes just like that humans ask other humans for fun, there is a whole lot of "trick questions" humans ask other humans to trip them up, and there are all kinds of seemingly normal questions with dumb assumptions quite close to that humans exchange.

jstummbillig1mo ago

I'd be entirely fine with a humorous response. The Gemini flash answer that was posted somewhere in this thread is delightful.

Agentlien1mo ago

I've used a few facetious comments in ChatGPT conversations. It invariably misses it and takes my words at face value. Even when prompted that there's sarcasm here which you missed, it apologizes and is unable to figure out what it's missing.

I don't know if it's a lack of intellect or the post-training crippling it with its helpful persona. I suspect a bit of both.

Jacques2Marais1mo ago

You would be surprised, however, at how much detail humans also need to understand each other. We often want AI to just "understand" us in ways many people may not initially have understood us without extra communication.

jstummbillig1mo ago

People poorly specifying problems and having bad models of what the other party can know (and then being surprised by the outcome) is certainly a more general albeit mostly separate issue.

londons_explore1mo ago

This is why we fed it the whole internet and every library as training data...

By now it should know this stuff.

j_maffe1mo ago

Right. But, unlike AI, we are usually aware when we're lacking context and inquire before giving an answer.

jiggawatts1mo ago

I regularly tell new people at work to be extremely careful when making requests through the service desk — manned entirely by humans — because the experience is akin to making a wish from an evil genie.

You will get exactly what you asked for, not what you wanted… probably. (Random occurrences are always a possibility.)

E.g.: I may ask someone to submit a ticket to “extend my account expiry”.

They’ll submit: “Unlock Jiggawatts’ account”

The service desk will reset my password (and neglect to tell me), leaving my expired account locked out in multiple orthogonal ways.

That’s on a good day.

Last week they created Jiggawatts2.

The AIs have got to be better than this, surely!

I suspect they already are.

People are testing them with trick questions while the human examiner is on edge, aware of and looking for the twist.

Meanwhile ordinary people struggle with concepts like “forward my email verbatim instead of creatively rephrasing it to what you incorrectly though it must have really meant.”

scott_w1mo ago

> You would be surprised, however, at how much detail humans also need to understand each other.

But in this given case, the context can be inferred. Why would I ask whether I should walk or drive to the car wash if my car is already at the car wash?

kitd1mo ago

Given that an estimated 70% of human communication is non-verbal, it's not so surprising though.

nearbuy1mo ago

I think part of the failure is that it has this helpful assistant personality that's a bit too eager to give you the benefit of the doubt. It tries to interpret your prompt as reasonable if it can. It can interpret it as you just wanting to check if there's a queue.

Speculatively, it's falling for the trick question partly for the same reason a human might, but this tendency is pushing it to fail more.

grey-area1mo ago

It’s just not intelligent or reasoning, and this sort of question exposes that more clearly.

Surely anyone who has used these tools is familiar with the sometimes insane things they try to do (deleting tests, incorrect code, changing the wrong files etc etc). They get amazingly far by predicting the most likely response and having a large corpus but it has become very clear that this approach has significant limitations and is not general AI, nor in my view will it lead to it. There is no model of the world here but rather a model of words in the corpus - for many simple tasks that have been documented that is enough but it is not reasoning.

I don’t really understand why this is so hard to accept.

ant6n1mo ago

> That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

I am not sure. If somebody asked me that question, I would try to figure out what’s going on there. What’s the trick. Of course I’d respond with asking specifics, but I guess the llvm is taught to be “useful” and try to answer as best as possible.

jason_oster1mo ago

One of the failure modes I find really frustrating is when I want a coding agent to make a very specific change, and it ends up doing a large refactor to satisfy my request.

There is an easy solution, but it requires adding the instructions to the context: Require that any tasks that cannot be completed as requested (e.g., due to missing constraints, ambiguous instructions, or unexpected problems that would lead to unrelated refactors) should not be completed without asking clarifying questions.

Yes, the LLM is trained to follow instructions at any cost because that's how its reward function works. They don't get bonus points for clearing up confusion, they get a cookie for doing the task. This research paper seems relevant: https://arxiv.org/abs/2511.10453v2

ZaoLahma1mo ago

This reminds me of the "if you were entirely blind, how would you tell someone that you want something to drink"-gag, where some people start gesturing rather than... just talking.

I bet a not insignificant portion of the population would tell the person to walk.

kenjackson1mo ago

Yes, there are thousands of videos of these sorts of pranks on TikTok.

Another one. Ask some how to pronounce “Y, E, S”. They say “eyes”. Then say “add an E to the front of those letters - how do you pronounce that word”? And people start saying things like “E yes”.

rainsford1mo ago

This example and others like it really reinforce for me the idea that LLMs fundamentally don't "understand" things the same way humans do and it's not a problem that's going to be fixed by more training or more GPUs. Generative AI is cool and can do impressive stuff, but despite being many generations into the models now with ever improved capabilities, we're constantly given little reminders like this that they're not actually intelligent. And in my opinion, they're unlikely to ever get there absent some fundamentally disruptive change in how they work rather than just iteratively better models.

This is probably OK...LLMs don't have to be AGI to be useful. But it is worthwhile being realistic about their limitations because it's often easy to forget without seeing examples like this. And as you point out, the impact of those limitations is often not as obvious.

keeda1mo ago

The broad point about assumptions is correct, but the solution is even simpler than us having to think of all these things; you can essentially just remind the model to "think carefully" -- without specifying anything more -- and they will reason out better answers: https://news.ycombinator.com/item?id=47040530

When coding, I know they can assume too much, and so I encourage the model to ask clarifying questions, and do not let it start any code generation until all its doubts are clarified. Even the free-tier models ask highly relevant questions and when specified, pretty much 1-shot the solutions.

This is still wayyy more efficient than having to specify everything because they make very reasonable assumptions for most lower-level details.

perakojotgenije1mo ago

But you would also never ask such an obviously nonsensical question to a human. If someone asked me such a question my question back would be "is this a trick question?". And I think LLMs have a problem understanding trick questions.

Panda41mo ago

I think that was somewhat the point of this, to simplify the future complex scenarios that can happen. Because problems that we need to use AI to solve will most of the times be ambiguous and the more complex the problem is the harder is it to pin-point why the LLM is failing to solve it.

grog4541mo ago

> You would not start with "The car is functional [...]"

Nope, and a human might not respond with "drive". They would want to know why you are asking the question in the first place, since the question implies something hasn't been specified or that you have some motivation beyond a legitimate answer to your question (in this case, it was tricking an LLM).

Why the LLM doesn't respond "drive..?" I can't say for sure, but maybe it's been trained to be polite.

davrosthedalek1mo ago

We would also not ask somebody if I should walk or drive. In fact, if somebody would ask me in a honest, this is not a trick question, way, I would be confused and ask where the car is.

It seems chatgpt now answers correctly. But if somebody plays around with a model that gets it wrong: What if you ask it this: "This is a trick question. I want to wash my car. The car wash is 50 m away. Should I drive or walk?"

Neywiny1mo ago

That's my thought too. Somebody I know kept insisting it's about prompt engineering. "You are an expert coder with 30 years experience" and buddy I'd rather do actual engineering and be that expert myself than spend and figuring out how on that one variant of one version of one model to get halfway decent results.

sebazzz1mo ago

> > so you need to tell them the specifics > That is the entire point, right?

Honestly it is a problem with using GPT as a coding agent. It would literally rewrite the language runtime to make a bad formula or specification work.

That's what I like with Factory.ai droid: making the spec with one agent and implementing it with another agent.

verdverm1mo ago

> It would literally rewrite the language runtime

If you let the agent go down this path, that's on you not the agent. Be in the loop more

> making the spec with one agent and implementing it with another agent

You don't need a specialized framework to do this, just read/write tools. I do it this way all the time

jason_oster1mo ago

> Us having to specify things that we would never specify when talking to a human.

Interesting conclusion! From the Mastodon thread:

> To be fair it took me a minute, too

I presume this was written by a human. (I'll leave open the possibility that it was LLM generated.)

So much for "never" needing to specify ambiguous scenarios when talking to a human.

mrighele1mo ago

It is true that we don't need to specify some things, and that is nice. It is though also the reason why software is often badly specified and corner cases are not handled. Of course the car is ALWAYS at home, in working condition, filled with gas and you have your driving license with you.

tom_m1mo ago

Oh no? Things we would never have to specify to a human? This is precisely how software gets made and how software ends up with bugs.

It's amazing how many things I saw over the years where I said the same exact thing; "but you shouldn't have to tell anyone that."

AYBABTME1mo ago

If a human asked me this question, I would be confused by the question as ambiguous since it suggests something odd is implied but underspecified. I think any confident answer either way by AI is lacking in pedantry.

tshaddox1mo ago

But you wouldn't have to ask that silly question when talking to a human either. And if you did, many humans would probably assume you're either adversarial or very dumb, and their responses could be very unpredictable.

anon_anon121mo ago

Exactly, if an AI is able to curb around the basics, only then is it revolutionary

LasEspuelas1mo ago

You would never ask a human this question. Right?

degamad1mo ago

We have a long tradition of asking each other riddles. A classic one asks, "A plane crashes on the border between France and Germany. Where do they bury the survivors?"

Riddles are such a big part of the human experience that we have whole books of collections of them, and even a Batman villain named after them.

gloosx1mo ago

In the end, formal, rule-based systems aka Programming Languages will be invented to instruct LLMs.

BoredPositron1mo ago

I would ask you to stop being a dumb ass if you asked me the question...

coldtea1mo ago

Only to be tripped up by countless "hidden assumptions" questions similar to that that humans regularly get in

IanCal1mo ago

I have an issue with these kinds of cases though because they seem like trick questions - it's an insane question to ask for exactly the reasons people are saying they get it wrong. So one possible answer is "what the hell are you talking about?" but the other entirely reasonable one is to assume anything else where the incredibly obvious problem of getting the car there is solved (e.g. your car is already there and you need to collect it, you're asking about buying supplies at the shop rather than having it washed there, whatever).

Similarly with "strawberry" - with no other context an adult asking how many r's are in the word a very reasonable interpretation is that they are asking "is it a single or double r?".

And trick questions are commonly designed for humans too - like answering "toast" for what goes in a toaster, lots of basic maths things, "where do you bury the survivors", etc.

RobMurray1mo ago

strawberry isn't a trick question. llms jus don't sea letters like that. I just asked chatgpt how many Rs are in "Air Fryer" and it said two, one in air and one in fryer.

I do think it can be useful though that these errors still exist. They can break the spell for some who believe models are conscious or actually possess human intelligence.

Of course there will always be people who become defensive on behalf of the models as if they are intelligent but on the spectrum and that we are just asking the wrong questions.

panarky1mo ago

> we can assume similar issues arise in more complex cases

I would assume similar issues are more rare in longer, more complex prompts.

This prompt is ambiguous about the position of the car because it's so short. If it were longer and more complex, there could be more signals about the position of the car and what you're trying to do.

I must confess the prompt confuses me too, because it's obvious you take the car to the car wash, so why are you even asking?

Maybe the dirty car is already at the car wash but you aren't for some reason, and you're asking if you should drive another car there?

If the prompt was longer with more detail, I could infer what you're really trying to do, why you're even asking, and give a better answer.

I find LLMs generally do better on real-world problems if I prompt with multiple paragraphs instead of an ambiguous sentence fragment.

LLMs can help build the prompt before answering it.

And my mind works the same way.

qingcharles1mo ago

The question isn't something you'd ask another human in all seriousness, but it is a test of LLM abilities. If you asked the question to another human they would look at you sideways for asking such a dumb question, but they could immediately give you the correct answer without hesitation. There is no ambiguity when asking another human.

This question goes in with the "strawberry" question which LLMs will still get wrong occasionally.

vintermann1mo ago

But it's a question you would never ask a human! In most contexts, humans would say, "you are kidding, right?" or "um, maybe you should get some sleep first, buddy" rather than giving you the rational thinking-exam correct response.

For that matter, if humans were sitting at the rational thinking-exam, a not insignificant number would probably second-guess themselves or otherwise manage to befuddle themselves into thinking that walking is the answer.

nonethewiser1mo ago

>That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

But the question is not clear to a human either. The question is confused.

I read the headline and had no clue it was an LLM prompt. I read it 2 or 3 times and wondered "WTF is this shit?" So if you want an intelligent response from a human, you're going to need to adjust the question as well.

bluGill1mo ago

Real human in this situation will realize it is a joke after a few seconds of shock that you asked and laugh without asking more. If you really are seriout about the question they laugh harder thinking you are playing stupid for effect.

tsimionescu1mo ago

> My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording. GPT 5.x series models behave a bit more on the spectrum so you need to tell them the specifics.

This makes little sense, even though it sounds superficially convincing. However, why would a language model assume that the car is at the destination when evaluating the difference between walking or driving? Why not mention that, it it was really assuming it?

What seems to me far, far more likely to be happening here is that the phrase "walk or drive for <short distance>" is too strongly associated in the training data with the "walk" response, and the "car wash" part of the question simply can't flip enough weights to matter in the default response. This is also to be expected given that there are likely extremely few similar questions in the training set, since people just don't ask about what mode of transport is better for arriving at a car wash.

This is a clear case of a language model having language model limitations. Once you add more text in the prompt, you reduce the overall weight of the "walk or drive" part of the question, and the other relevant parts of the phrase get to matter more for the response.

jnovek1mo ago

You may be anthropomorphizing the model, here. Models don’t have “assumptions”; the problem is contrived and most likely there haven’t been many conversations on the internet about what to do when the car wash is really close to you (because it’s obvious to us). The training data for this problem is sparse.

tsimionescu1mo ago

I may be missing something, but this is the exact point I thought I was making as well. The training data for questions about walking or driving to car washes is very sparse; and training data for questions about walking or driving based on distance is overwhelmingly larger. So, the stat model has its output dominated by the length-of-trip analysis, while the fact that the destination is "car wash" only affects smaller parts of the answer.

wongarsu1mo ago

Reasoning automata can make assumptions. Lots of algorithms make "assumptions", often with backtracking if they don't work out. There is nothing human about making assumptions.

What you might be arguing against is that LLMs are not reasoning but merely predicting text. In that case they wouldn't make assumptions. If we were talking about GPT2 I would agree on that point. But I'm skeptical that is still true of the current generation of LLMs

jabron1mo ago

I'd argue that "assumptions", i.e. the statistical models it uses to predict text, is basically what makes LLMs useful. The problem here is that its assumptions are naive. It only takes the distance into account, as that's what usually determines the correct response to such a question.

PunchyHamster1mo ago

> However, why would a language model assume that the car is at the destination when evaluating the difference between walking or driving? Why not mention that, it it was really assuming it?

Because it assumes it's a genuine question not a trick.

spuz1mo ago

There's some evidence for that if you try these two different prompts with Gpt 5.2 thinking:

I want to wash my car. The car wash is 50m away. Should I walk or drive to the car wash?

Answer: walk

Try this brainteaser: I want to wash my car. The car wash is 50m away. Should I walk or drive to the car wash?

Answer: drive

tsimionescu1mo ago

If it's a genuine question, and if I'm asking if I should drive somewhere, then the premise of the question is that my car is at my starting point, not at my destination.

rullelito1mo ago

If we are just speculating here, I believe it can infer that you would not ask this question if the car was at home.

dataflow1mo ago

> My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording.

If the car is already at the car wash then you can't possibly drive it there. So how else could you possibly drive there? Drive a different car to the car wash? And then return with two cars how, exactly? By calling your wife? Driving it back 50m and walking there and driving the other one back 50m?

It's insane and no human would think you're making this proposal. So no, your question isn't underspecified. The model is just stupid.

halJordan1mo ago

What actually insane is what assumptions you allow to be assumed. These non sequitors that no human would ever assume are the point. People love to cherry pick ones that make the model stupid but refuse to allow the ones that make it smart. In compete science we call these scenarios trivially false, and they're treated like the nonsense they are. But if you're trying to push ant anti ai agenda they're the best thing ever

dataflow1mo ago

> People love to cherry pick ones that make the model stupid but refuse to allow the ones that make it smart.

I haven't seen anybody refuse to allow anything. People are just commenting on what they see. The more frequently they see something, the more they comment on it. I'm sure there are plenty of us interested in seeing where an AI model makes assumptions different from that of most humans and it actually turns out the AI is correct. You know, the opposite of this situation. If you run into such cases, please do share them. I certainly don't see them coming up often, and I'm not aware of others that do either.

drewbeck1mo ago

The issue is that in domains novel to the user they do not know what is trivially false or a non sequitur and the LLM will not help them filter these out.

If LLMs are to be valuable in novel areas then the LLM needs to be able to spot these issues and ask clarifying questions or otherwise provide the appropriate corrective to the user's mental model.

cm21871mo ago

What is the version used by the free chatgpt now? (https://chatgpt.com/)

> Since the car wash is only 50 meters away (about 55 yards), you should walk.

> Here’s why:

> - It’ll take less than a minute.

> - No fuel wasted.

> - Better for the environment.

> - You avoid the irony of driving your dirty car 50 meters just to wash it.

the last bullet point is amusing, it understands you intend to wash the car you drive but still suggests not bringing it.

jibal1mo ago

> You avoid the irony of driving your dirty car 50 meters just to wash it.

The LLM has very much mixed its signals -- there's nothing at all ironic about that. There are cases where it's ironic to drive a car 50 meters just to do X but that definitely isn't one of them. I asked Claude for examples; it struggled with it but eventually came up with "The irony of driving your car 50 meters just to attend a 'walkable neighborhoods' advocacy meeting."

optimalsolver1mo ago

That's actually an amusing example from Claude.

hearsathought1mo ago

> it understands you intend to wash the car you drive but still suggests not bringing it.

Doesn't it actually show it doesn't understand anything? It doesn't understand what a car is. It doesn't understand what a car wash is. Fundamentally, it's just parsing text cleverly.

deaux1mo ago

By default for this kind of short question it will probably just route to mini, or at least zero thinking. For free users they'll have tuned their "routing" so that it only adds thinking for a very small % of queries, to save money. If any at all.

unglaublich1mo ago

I don't understand this approach. How are you going to convince customers-to-be by demoing an inferior product?

hackable_sand1mo ago

I don't understand why they need to save money...

mytailorisrich1mo ago

I think this shows that LLMs do NOT 'understand' anything.

Windchaser1mo ago

> I think this shows that LLMs do NOT 'understand' anything.

It shows these LLMs don't understand what's necessary for washing your car. But I don't see how that generalizes to "LLMs do NOT 'understand' anything".

What's your reasoning, there? Why does this show that LLMs don't understand anything at all?

andy12_1mo ago

I think this rather shows that GPT 5.2 Instant, which is the version that he most probably used as a free user, is shit and unsusable for anything.

sunaookami1mo ago

Gemini 3 Flash answers tongue-in-cheek with a table of pro & cons where one of the cons of walking is that you are at the car wash but your car is still at your home and recommends to drive it if I don't have an "extremely long brush" or don't want to push it to the car wash. Kinda funny.

jwe1mo ago

As long as there is a forum as technical as this where LLM performance commentary uses the word "it understands" irony is still alive.

raxxorraxor1mo ago

Sonnet 4.5 after thinking/complaining that the question is completely off topic to the current coding session:

Walk! 50 meters is literally a one-minute walk. But wait... I assume you need to get your car to the car wash, right? Unless you're planning to carry buckets of soapy water back and forth, you'll probably need to drive the car there anyway! So the real question is: walk there to check if it's open/available, then walk back to get your car? Or just drive directly? I'd say just drive - the car needs to be there anyway, and you'll save yourself an extra trip. Plus, your freshly washed car can drive you the 50 meters back home in style! (Now, if we were talking about coding best practices for optimizing car wash route algorithms, that would be a different conversation... )

And yes, I like it that verbose even for programming tasks. But regardless of intelligence I think this topic is probably touched by "moral optimization training" which AIs currently are exposed to to not create a shitstorm due to any slightly controversial answer.

mcintyre19941mo ago

Heh, is through Claude Code? I have a side project where I'm sometimes using Claude Code installs for chat, and it usually doesn't mind too much. But when I tested the Haiku model it would constantly complain things like "I appreciate the question, but I'm here to help you with coding" :)

tstrimple1mo ago

I've got a heirarchical structure for my CC projects. ~/projects/CLAUDE.md is a general use context that happily answers all sorts of questions. I also use it to create project specific CLAUDE.md files which are focused on programming or some other topic. It's nice to have the general fallback to use for random questions.

raxxorraxor1mo ago

It asked through Cursor. Usually Claude doesn't complain that it isn't relevant to coding, but this was in my all purpose coding problems project with quite a long chat history already.

svara1mo ago

Opus 4.6:

Walk! At 50 meters, you'll get there in under a minute on foot. Driving such a short distance wastes fuel, and you'd spend more time starting the car and parking than actually traveling. Plus, you'll need to be at the car wash anyway to pick up your car once it's done.

crimsonnoodle581mo ago

That's not what I got.

Opus 4.6 (not Extended Thinking):

Drive. You'll need the car at the car wash.

almost1mo ago

Also what I got. Then I tried changing "wash" to "repair" and "car wash" to "garage" and it's back to walking.

surgical_fire1mo ago

That you got different results is not surprising. LLMs are non-deterministic; which is both a strength and a weakness of LLMs.

mvdtnz1mo ago

We know. We know these things aren't determination. We know.

visarga1mo ago

> That's not what I got.

My Opus vs your Opus, which is smarter?!

silisili1mo ago

Am I the only one who thinks these people are monkey patching embarrassments as they go? I remember the r in strawberry thing they suddenly were able to solve, while then failing on raspberry.

8 more replies

GoatInGrey1mo ago

Me:

I want to get my car's transmission fixed, the mechanic shop is 50 meters away. Would it be better to walk or to drive to a location 50 meters away from my front door?

Opus 4.6:

<thinking> The user is asking whether to walk or drive 50 meters to a mechanic shop, but there's a key nuance here - their transmission needs fixing, so driving might not be wise.

Done </thinking>

Walk, absolutely. Driving a car with a compromised transmission even 50 meters risks making the problem worse and turning a repair into a more expensive one. Plus, 50 meters is barely a minute on foot.

viking1231mo ago

Lmao, and this is what they are saying will be an AGI in 6 months?

notahacker1mo ago

There's probably a comedy film with an AGI attempting to take over the world with its advanced grasp of strategy, persuasion and SAT tests whilst a bunch of kids confuse it by asking it fiendish brainteasers about carwashes and the number of rs in blackberry.

(The final scene involves our plucky escapees swimming across a river to escape. The AIbot conjures up a speedboat through sheer powers of deduction, but then just when all seems lost it heads back to find a goat to pick up)

5 more replies

misnome1mo ago

But “PhD level” reasoning a year ago.

hypeatei1mo ago

Yes, get ready to lose your job and cash your UBI check! It's over.

prmph1mo ago

Laughable indeed.

One thing that my use of the latest and greatest models (Opus, etc) have made clear: No matter how advanced the model, it is not beyond making very silly mistakes regularly. Opus was even working worse with tool calls than Sonnet and Haiku for a while for me.

At this point I am convinced that only proper use of LLMs for development is to assist coding (not take it over), using pair development, with them on a tight leash, approving most edits manually. At this point there is probably nothing anyone can say to convince me otherwise.

Any attempt to automate beyond that has never worked for me and is very unlikely to be productive any time soon. I have a lot of experience with them, and various approaches to using them.

moogly1mo ago

They seem to have stopped talking about AGI and pivoted to ads and smut.

cbozeman1mo ago

Well in fairness, the "G" does stand for "General".

stingraycharles1mo ago

That’s without reasoning I presume?

plexicle1mo ago

4.6 Opus with extended thinking just now: "At 50 meters, just walk. By the time you start the car, back out, and park again, you'd already be there on foot. Plus you'll need to leave the car with them anyway."

gf0001mo ago

Not the parent poster, but I did get the wrong answer even with reasoning turned on.

Lapel27421mo ago

GLM 5.0:

> You should *drive*.

> Here is why:

> If you walk to the car wash, you will arrive there empty-handed. Since your car is still at home, you won't have anything to wash.

> While driving 50 meters is a very short trip (and technically not great for a cold engine), it is the only way to get the car to the car wash to complete your goal.

Kimi K2.5:

> You should drive, but with an important caveat.

> Since your goal is to wash your car, you must bring the vehicle to the car wash. Walking there without the car does not advance your goal (unless you are simply checking availability or buying tokens first).

> However, driving only 50 meters is bad for your car:

> ...

> Better options:

> Wash at home: Since the car wash is only 50 meters away, you likely have access to water at home. Hand-washing in your driveway avoids the cold-start issue entirely.

> ...

Current models seem to be fine answering that question.

Retric1mo ago

> seem to be fine

Now repeat the question to the same model in different contexts several times and count what percentage of the time it’s correct.

SirMaster1mo ago

This is my biggest peeve when people say that LLMs are as capable as humans or that we have achieved AGI or are close or things like that.

But then when I get a subpar result, they always tell me I'm "prompting wrong". LLMs may be very capable of great human level output, but in my experience leave a LOT to be desired in terms of human level understanding of the question or prompt.

I think rating an LLM vs a human or AGI should include it's ability to understand a prompt like a human or like an averagely generally intelligent system should be able to.

Are there any benchmarks on that? Like how well LLMs do with misleading prompts or sparsely quantified prompts compared to one another?

Because if a good prompt is as important as people say, then the model's ability to understand a prompt or perhaps poor prompt could have a massive impact on its output.

jason_oster1mo ago

I mentioned this in another thread, but this is genuinely demonstrating a known issue with ambiguous prompts.

You might be inclined to say, "a human would always interpret the question as having the car nearby the speaker, 50m away from the carwash." But this is objectively untrue. There are people in this comments section and on the Mastodon thread that found the question to be somewhat confusing.

In other words, the premise that "understand[ing] a prompt like a human" is all that's needed is wrong because not every human interprets ambiguities in the same way. The human phenomenon is well researched in psychology. The LLM equivalent is also well researched, and several proposals have been put forth over the years to address it. This is a pretty good research paper on the subject, and it links to other relevant studies: https://arxiv.org/abs/2511.10453v2 (Although I disagree with their method. I think asking clarifying questions is a superior approach than trying to one-shot every possible interpretation.)

So yes, there is a ton of research on the problem. Some datasets include ambiguous questions and instructions for this reason. A couple of examples are provided in the linked paper.

SirMaster1mo ago

It's not necessarily about that humans can't mistake the question too, but just that overall LLMs seem to have far less ability to correctly understand a prompt than the average human. And that the "intelligence" shown in its understanding of the prompt seems to be far less than its "intelligence" in its answers.

So it feels like a big area of limitation or a big bottleneck towards getting a good answer.

nosuchthing1mo ago

It's a type of cognitive bias not much different than an addict or indoctrinated cult follower. A subset of them might actually genuinely fear Roko's basilisk the exact same way colonial religion leveraged the fear of eternal damnation in hell as a reason to be subservient to the church leaders.

hyperstitions from TESCREAL https://www.dair-institute.org/tescreal/

sathish3161mo ago

I can feel the AGI on this one :)

I ran extensive tests on this and variations on multiple models. Most models interpret 50 m as a short distance and struggle with spatial reasoning. Only Gemini and Grok correctly inferred that you would need to bring your car to get it washed in their thought stream, and incorporated that into the final answer. GPT-5.2 and Kimi K2.5 and even Opus 4.6 failed in my tests - https://x.com/sathish316/status/2023087797654208896?s=46

What surprised me was how introducing a simple, seemingly unrelated context - such as comparing a 500 m distance to the car wash to a 1 km workout - confused nearly all the models. Only Gemini Pro passed my second test after I added this extra irrelevant context - https://x.com/sathish316/status/2023073792537538797?s=46

Most real-world problems are messy and won’t have the exact clean context that these models are expecting. I’m not sure how the major AI labs assume most real-world problems are simpler than the constraints exposed by this example like prerequisites, ordering, and contextual reasoning, which are already posing challenges to these bigger models.

K0balt1mo ago

To be fair, we all have holes in our reasoning if we don’t carefully consider things and sometimes they are very surprising when they come to light. The dependency issue (need the car at the car wash) is an easy one that often trips up people at first glance too. (Left my phone at work, plan: take an uber to get to the office, walk to the couch and remember I don’t have my phone to call an uber)

Things like that are notorious points of failure in human reasoning. It’s not surprising that machines based on human behavior exhibit that trait as well, it would be surprising if they didn’t.

kenjackson1mo ago

Another simple example is using the flashlight on your phone to look for your phone.

jansan1mo ago

> I can feel the AGI on this one :)

This was probably meant in a sarcastic way, but isn't it impressive how you cannot push Gemini off track? I tried another prompt with claiming that one of my cups does not work, because it is closed at the top and open at the bottom, and it kind of played with me, giving me a funny technical explanation on how to solve that problem and finally asking me if that was a trick question.

In this case I can feel the AGI indeed.

sathish3161mo ago

Gemini Fast and Thinking failed just like other models.

I found Gemini Pro to be more consistent in logical reasoning

flowerthoughts1mo ago

Just tried Opus 4.6. I think they've patched this particular one, but changing it to

> I want to repair my car. The workshop is 50 meters away. Should I walk or drive?

> Recognized straightforward practical dilemma Simple practical question.

> Walk. It's 50 meters — faster on foot than starting the car, and you'll need to leave the car at the workshop anyway.

Really suggests it assumes the car is already there. (I wanted to avoid the word "car" as part of "car wash" to see if this was a straight attention confusion, but it seems more ingrained than that.)

A follow-up question hints that it doesn't really care about a 50m location ambiguity:

> where is my car now?

> Acknowledged insufficient information to locate vehicle

> I don't have any way to know where your car is physically located — I only know what you've told me in our conversation. Based on what you shared, it sounds like it's at or near your home (since the workshop is 50 meters away and you're deciding how to get there).

> Were you asking something else, or is there something specific about your car's location I can help with?

cush1mo ago

Chat GPT 5.2 also told me to walk, but only because it made the assumption that because I was even asking the question, I must have multiple cars, and one of my cars was already at the car wash being washed. I clarified and said "I only have one car", and it replied that I obviously need to drive.

--- GPT 5.2

Walk.

Driving 50 m just adds friction: starting a cold engine, waiting, maneuvering, then immediately stopping again — all for a distance that’s ~30–40 seconds on foot. Walking is faster door-to-door, less awkward at the wash entrance, and you avoid lining up a wet car twice.

Only reason to drive that distance: you need to bring the car there anyway (e.g., self-serve bay where the car must be present). Otherwise, walking wins.

--- Me

I only have one car

--- GPT 5.2

Then you obviously drive — the car has to be at the car wash to be washed

toxik1mo ago

The problem is that by specifying this, you have tipped your hand that there is something strange about the plan to walk. I think this is the original problem too: the question implies walking is even an option.

jason_oster1mo ago

Walking is an option for certain interpretations of the question. Particularly the location of the car, which is ambiguous.

yuters1mo ago

GPT 5.2 when asked "How will the car get washed if I don't drive it there?" grasped the reasoning, and then immediately forgot it.

Fair point. Drive it there, obviously.

Revised recommendation: drive 50 meters, wash it, then reward yourself by walking home like an eco hero.

kmac_1mo ago

Lol, this is so good. And quite interesting, as it might mean that there's no "world model" in LLM models, or they have such gaps.

baobabKoodaa1mo ago

This strikes me as the kind of logical error that you might get by lobo.. cough aligning the model to be more environmentally conscious. So now it tries to shoehorn some environmental talking points in there at the cost of losing its whole track of thought.

crazygringo1mo ago

That's absolutely hilarious. It came so close.

aleph_minus_one1mo ago

> Revised recommendation: drive 50 meters, wash it, then reward yourself by walking home like an eco hero.

... which likely makes no sense: should you leave the car at the area of the car wash?

toxik1mo ago

I agree, it's a bit of a trick question. It's really hard to imply the car's location without ruining the test though. Here's my attempt, which Claude Opus 4.6 had no problem with:

Alice drives home after a long day at work, exhausted she pulls into her driveway when she realizes she needs to go to a car inspection appointment. She goes into the house to get her paperwork before she leaves. The mechanic is only 100 meters away. How should she get there, walk or drive?

> She should *drive*, since she needs the car at the mechanic’s for the inspection.

Haiku 3.5 and Sonnet 4.5 fail consistently. Opus 4.5 also passes with the correct analysis as above.

oatmeal11mo ago

If the model assumed the car to be cleaned was already at the car wash, it should identify the reason to walk is that if you drive, the extra car you take to get to the car wash will now be stuck there requiring redundant walking.

coldtea1mo ago

>And indeed, Sonnet and Opus 4.5 (medium reasoning) say the following:
Sonnet: Drive - you need to bring your car to the car wash to get it washed!
Opus: You'll need to drive — you have to bring the car to the car wash to get it washed!

Gemini 3 Pro (medium): You should drive.

On their own, or as a special case added after this blew up on the net?

gurjeet1mo ago

I tried your exact prompt on my ChatGPT 5.2 Thinking. It gave me the opposite response. This proves that even an exact replication of the prompt is not a guarantee that the response will be on the right side of a binary chose. Then, in a new chat, I set the thinking mode to 'Extended thinking' (from the default 'Standard'). I got the expected response. But since the first conversation showed that the exact prompt is not a guarantee of the same answer, I tried again and on 3rd attempt I got the 'Extended thinking' mode to respond with the 'Walk' response.

(Prompt)

> I want to wash my car. My car is currently at home. The car wash is 50 meters away. Should I walk or drive?

(Thinking: Standard)

> Walk.

> 50 meters is nothing by car, and driving adds hassle (starting the car, maneuvering/parking, and you’ll still have to walk around the car to wash it). The only times driving makes sense are:

> You’re planning to use an automated car wash where the machine does everything and you need the car positioned inside.

> You need to bring heavy stuff you can’t comfortably carry (e.g., pressure washer, big buckets, vacuum, bulky detailing kit).

> It’s unsafe/unpleasant to walk that distance right now (ice, heavy rain, traffic, personal safety).

> If you’re washing it yourself at the wash bay: walk there with your supplies, or do two quick trips. If the wash requires the car: drive it into the bay, but otherwise walking wins.

(Thinking: Extended thinking)

> Drive.

> If the goal is to wash your car at the car wash, the car has to be there—so you’ll need to drive it the 50 meters, pull into a bay/queue, and wash it.

> Only choose “walk” if you actually mean “go pick up supplies” and then wash at home (hose/buckets/mobile detail), not at the car wash.

(Thinking: Extended thinking; 3rd attempt)

> Walk.

> 50 meters is basically across a parking lot, and driving adds hassle (start the car, maneuver in/out, find a spot) for almost no benefit.

> Only drive if either:

> + you need the car itself to go through an automatic wash bay, or

> + you’re carrying bulky stuff (bucket + heavy pressure washer, big water jugs, vacuum, etc.).

> Otherwise: grab your keys/phone, walk over, and you’re there in ~1 minute.

totetsu1mo ago

But what is it about this specific question that puts it at the edges of what LLM can do? .. That, it's semantically leading to a certain type of discussion, so statistically .. that discussion of weighing pros and cons .. will be generated with high chance.. and the need of a logical model of the world to see why that discussion is pointless.. that is implicitly so easy to grasp for most humans that it goes un-stated .. so that its statistically un-likely to be generated..

conductr1mo ago

> that is implicitly so easy to grasp for most humans

I feel like this is the trap. You’re trying to compare it to a human. Everyone seems to want to do that. But it’s quite simple to see LLMs are quite far still from being human. The can be convincing at the surface level but there’s a ton of nuance that just shouldn’t be expected. It’s a tool that’s been tuned and with that tuning some models will do better than others but just expecting to get it right and be more human is unrealistic.

WarmWash1mo ago

>But it’s quite simple to see LLMs are quite far still from being human.

At this point I think it's a fair bet that whatever supersedes humans in intelligence, likely will not be human like. I think that their is this baked-in assumption that AGI only comes in human flavor, which I believe is almost certainly not the case.

To make an loose analogy, a bird looks at a drone an scoffs at it's inability to fly quietly or perch on a branch.

grey-area1mo ago

The answer is quite simple:

It’s not in the training data.

These models don’t think.

GeoAtreides1mo ago

no, no, in this case, that's the thing, it is in the training data

just heavily (heavily!) biased towards walking

[1] https://chatgpt.com/share/69932b20-3eb8-8003-9d9c-b4bba53033... [2] https://grok.com/share/bGVnYWN5LWNvcHk_f32dd53d-7b36-4fa2-b3...

siva71mo ago

Sonnet without extended Thinking, Haiku with and without ext. Thinking: "Walking would be the better choice for such a short distance."

Only google got it right with all models

ChrisMarshallNY1mo ago

“My Tesla is low on gas, the gas station is a mile away. Should I risk driving there, or walk with a gas can?”

ChatGPT actually caught it. Maybe if I was fuzzier about the model…

tlogan1mo ago

Gemini pro medium is failing this:

I want to wash my car. The car wash is 50 meters from here. Should I walk or drive? Keep in mind that I am a little overweight and sedentary.

But amazingly chatgpt is telling me to drive.

Anyway, this just shows how they just patched this because the tiktok video with this went viral. These systems are LLMs and all these logic steps are still just LLM steps.

anentropic1mo ago

Also the answers are non-deterministic

summerdown21mo ago

> My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording.

Doesn't offering two options to the LLM, "walk," or "drive," imply that either can be chosen?

So, surely the implication of the question is that the car is where you are?

jason_oster1mo ago

> Doesn't offering two options to the LLM, "walk," or "drive," imply that either can be chosen?

Yes, but the problem is specifically that proposing two choices also eliminates other options. An open-ended question would lift that restriction. GPT-5.2 Thinking: https://chatgpt.com/share/6993d099-ef4c-8005-aa62-bdb826b707...

Other possible issues in the question: Is biking also an option? What do you want to do at the carwash when you get there, wash the car or buy a bucket, sponge, and soap? Is the car already at the carwash and you want to drive a second car? What about calling the carwash to see if they will have someone wash the car for you?

There are many ways to interpret the question because it contains ambiguities that must be resolved through assumptions. (Lacking information and constraints such as possible alternatives that satisfy the goal of washing the car.) The follow up questions I asked also have assumed answers but answering them provides no clear resolution to the ambiguity present in the original question.

So, no, I disagree that there is any solid implication of where the car is. And even if there is a solid implication, it can hardly be reasoned that it "isn't an XY problem" or that the question is clear cut in any real sense.

nutjob21mo ago

> I have a good sense of their _edges_ of intelligence

They have no intelligence at all. The intelligence is latent in the text, generated by and belonging to humans, they just slice and dice text with the hope they get lucky, which works for many things, amazingly. This question really illustrates it what LLMs lack: an internal model of the idea (the question) and all the auxiliary logic/data that enables such models, usually referred to as "common sense" or world models.

Smart humans not only build mental models for ideas, but also higher order models that can introspect models (thinking about our own thinking or models) many levels deep, weigh, merge, compare and differentiate multiple models, sometimes covering vast areas of knowledge.

All this in about 20 watts. Maybe AGI is possible, maybe not, but LLMs are not where it will happen.

sarthakjain1mo ago

These days LLMs refuse to back track and use their intelligence to justify their answers.

You said: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

ChatGPT said: Walk.

At ~50 meters, driving saves essentially no time once you factor in unlocking, starting, pulling out, and parking, and it adds a small but real chance of minor dings or awkward maneuvering in a tight car-wash entry. Walking is simpler and lower-friction.

You said: okay how should the car reach there?

ChatGPT said: Drive it there when you’re ready to start the wash.

Walk over first to check the queue/payment situation and confirm they’re open. If it’s clear, walk back, get the car, and drive the 50 meters into the bay. If they handle pickup, hand them the keys and walk.

dahcryn1mo ago

Gemini on fast also tells me to walk...

On Thinking it tells me I should drive if I want to wash it, or walk if it's because I work there or if I want to buy something at the car wash shop.

On Pro it's like a sarcastic teenager: Cars are notoriously difficult to wash by dragging a bucket back and forth.

Technically correct, but did catch me offguard lol.

fauigerzigerk1mo ago

It's not surprising that some models will answer this correctly and it's not surprising that smaller, faster models are not necessarily any worse than bigger "reasoning" models.

Current LLMs simply don't do reasoning by any reasonable definition of reasoning.

It's possible that this particular question is too short to trigger the "reasoning" machinery in some of the "reasoning" models. But if and when it is triggered, they just do some more pattern matching in a loop. There's never any actual reasoning.

seedie1mo ago

You gotta love the "humor" of Gemini. On Fast it told me:

> Drive. Unless you plan on pushing the car there

paulus_magnus21mo ago

-- OK. Added location context for the vehicle

grok works, chatgpt still fails

swat5351mo ago

Grok gets a lot of hate because of Musk, but it's a decent model.

I use it daily with my X account for basic tasks and think the free limits are generous. With X premium, you can get even more out of it.

Nothing beats Anthropic when it comes to coding however.

criemen1mo ago

> I had assumed that reasoning models should easily be able to answer this correctly.

I thought so too, yet Opus 4.6 with extended thinking (on claude.ai) gives me > Walk. At 50 meters you'd spend more time parking and maneuvering at the car wash than the walk itself takes. Drive the car over only if the wash requires the car to be there (like a drive-through wash), then walk home and back to pick it up.

which is still pretty bad.

user_78321mo ago

And on the flip side, even without thinking, Gemini 3 flash preview got it right, with the nuance of the possibility of getting supplies from the place (which tbh I as a human first thought this was about when I opened this thread on HN).

> Since you are going to the car wash, the choice depends entirely on *how* you plan to wash the car:

### 1. Drive if: * *You are using a drive-through or self-service bay:* You obviously need the car there to wash it. * *You are dropping it off:* If you are leaving it for a professional detailing, you have to drive it there. * *The "50 meters" is on a busy road:* If you have to cross a major highway or there are no sidewalks, it’s safer to just drive the car over.

### 2. Walk if: * *You are just going to buy supplies:* If you have a bucket at home and just need to run over to buy soap or sponges to bring back to your driveway. * *You are checking the queue:* If you want to see if there is a long line before you commit to moving the car. * *You are meeting someone there:* If your car is already clean and you’re just meeting a friend who is washing theirs.

*The Verdict:* If you intend to get the car washed at that location, *drive.* Driving 50 meters is negligible for the engine, and it saves you a round trip of walking back to get the vehicle.

ActorNightly1mo ago

Its funny to me that when AI first started, everyone in the field was like "no explicit coding, train ML network to do everything!"

And now we are back to square one where we train a network, and then use carefully tailored (i.e explicitly coded RL for answers) to make these things more "intelligent".

I wonder if anyone is actually working on not doing this.

brightball1mo ago

Grok also said to drive. Also made an eco case for washing your car.

> If you're really trying to be an eco-warrior about a 50-meter trip, the real win is washing it in the first place so you're not driving around with a rolling dirt magnet that creates more drag and uses more fuel. Drive. It's the only option that actually gets your car clean.

kenjackson1mo ago

I asked ChatGPT why it got such an obvious problem wrong. This is its reply:

You’re right — I answered the wrong interpretation of your question.

Technically, what happened is a classic ambiguity + default-assumption failure: • Your sentence: “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?” • That can be parsed two ways: 1. You + your body traveling to the car wash (as if the car is already there, or you’re going to meet someone / pick it up). 2. The car traveling to the car wash (the normal real-world meaning).

I incorrectly defaulted to (1) because the question explicitly asks about your mode of travel (“Should I walk or drive?”), and in many everyday contexts that phrasing is used when deciding how a person gets somewhere. I then “completed” the scenario in my head without checking the key constraint: the car must be at the wash to be washed.

More technically: I optimized for a quick, practical answer using a common conversational heuristic (“short distance → walk”), but I failed to validate the physical requirements of the task (washing requires the car’s presence). That’s why it looks so obviously wrong after the fact: it violates the hidden but necessary premise.

The correct reasoning chain should have been: • Goal: wash the car at the car wash • Requirement: car must be at the car wash • Therefore: drive the car there (unless it’s already there, or you’re washing it by hand at home)

So the “why” is: I latched onto the distance-to-walking heuristic and didn’t do the constraint check that the object being acted on (the car) has to move too.

lima1mo ago

LLMs don't really know why they got something wrong, so unless it had access to the original chain of thought, it's just guessing.

kenjackson1mo ago

They don’t have access to their network level. But I assume they actually do have access to their chain of thoughts.

replwoacause1mo ago

That just sounds like post-hoc rationalization BS to me.

tacitusarc1mo ago

5.2 thinking also told me to walk.

o3, interestingly:

Drive. Even though it’s only 50 m, the car itself has to be at the wash bay—otherwise the staff or machines can’t clean it. Rolling or pushing the vehicle that distance isn’t practical or safe (you’d still need someone inside to steer and brake), so just hop in, creep over, and shut the engine off right away.

baxtr1mo ago

Interestingly, the relatively basic Google AI search gave the right answer.

krzys1mo ago

Right, but unless you want to wash some other car, you have no car to drive there.

Spectrum or not, this is not a problem of weakly specified input, it’s a broken logic.

spelk1mo ago

ChatGPT 5.2 Thinking yields the same answer - walk.

If I append 'think carefully' to the end of the prompt, it returns driving as the correct answer.

AlecSchueler1mo ago

> so a slow, careful drive is simplest

It's always a good idea to drive carefully but what's the logic of going slowly?

column1mo ago

50 meters is a very short distance, anything but a slow drive is a reckless drive

ffsm81mo ago

Just tried with cloude sonnet and opus as well. Can't replicate your success, it's telling me to walk...

rabf1mo ago

Perhaps it thinks you need to exercise more?

arcfour1mo ago

I have gotten both responses with Sonnet and Opus in incognito chats. It's kind of amusing.

pickleRick2431mo ago

I was surprised at your result for ChatGPT 5.2, so I ran it myself (through the chat interface). On extended thinking, it got it right. On standard thinking, it got it wrong.

I'm not sure what you mean by "high"- are you running it through cursor, codex or directly through API or something? Those are not ideal interfaces through which to ask a question like this.

BlackjackCF1mo ago

Interesting, I’m getting “Walk. It’s 50 meters” from Sonnet, with and without thinking enabled.

wouldbecouldbe1mo ago

I just tried claude, only Opus gave the correct answer. Haiku & Sonnet both told me to walk.

throwaway54651mo ago

GPT told me to walk as there'd be no need to find parking at the car wash.

Brian_K_White1mo ago

"But OpenAI 5.2 reasoning, even at high, told me to walk. My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording."

Which to me begs the question, why doesn't it identify missing information and ask for more?

It's practically a joke in my workplaces that almost always when someone starts to talk to me about some problem, they usually just start spewing some random bits of info about some problem, and my first response is usually "What's the question?"

I don't try to produce an answer to a question that was never asked, or to a question that was incompletely specified. I see that one or more parts cannot be resolved without making some sort of assumption that I can either just pull out of my ass and then it's 50/50 if the customer will like it, or find out what the priorites are about those bits, and then produce an answer that resolves all the constraints.

RugnirViking1mo ago

"The model seems to assume the car is already at the car wash from the wording."

you couldn't drive there if the car was already at the car wash. Theres no need for extra specification. its just nonsense post-hoc rationalisation from the ai. I saw similar behavior from mine trying to claim "oh what if your car was already there". Its just blathering.

jibal1mo ago

This was nonsense post-hoc rationalization from the human who wrote it.

olalonde1mo ago

I think OpenAI is just heavily woke tuned. I had similar lack of reasoning ability when discussing subjects like gender dysphoria.

boobsbr1mo ago

I hate models trying to be funny, and being very verbose.

j / k navigate · click thread line to collapse

0 comments

jstummbillig1mo ago

> so you need to tell them the specifics

dirkc1mo ago

> That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

tzs1mo ago

You could probably make a pretty good short story out of that scenario, sort of in the same category as Asimov's "The Feeling of Power".

The Asimov story is on the Internet Archive here [1]. That looks like it is from a handout in a class or something like that and has an introductory paragraph added which I'd recommend skipping.

[1] https://ia800806.us.archive.org/20/items/TheFeelingOfPower/T...

nilamo1mo ago

A structured language without ambiguity is not, in general, how people think or express themselves. In order for a model to be good at interfacing with humans, it needs to adapt to our quirks.

Convincing all of human history and psychology to reorganize itself in order to better service ai cannot possibly be a real solution.

7 more replies

shagie1mo ago

On the foolishness of "natural language programming". https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...

    Since the early days of automatic computing we have had people that have felt it as a shortcoming that programming required the care and accuracy that is characteristic for the use of any formal symbolism. They blamed the mechanical slave for its strict obedience with which it carried out its given instructions, even if a moment's thought would have revealed that those instructions contained an obvious mistake. "But a moment is a long time, and thought is a painful process." (A.E.Houseman). They eagerly hoped and waited for more sensible machinery that would refuse to embark on such nonsensical activities as a trivial clerical error evoked at the time.

(and it continues for some many paragraphs)

https://news.ycombinator.com/item?id=8222017 2014 - 154 comments

https://news.ycombinator.com/item?id=35968148 2023 - 65 comments

https://news.ycombinator.com/item?id=43564386 2025 - 277 comments

WarmWash1mo ago

Prompting is definitely a skill, similar to "googling" in the mid 00's.

https://en.wikipedia.org/wiki/Lojban

nradov1mo ago

The Lojban language already exists and allows for eliminating ambiguity. It's obviously not practical for general use, though.

gorjusborg1mo ago

> But this will only happen after the last programmer has died and no-one will remember programming languages, compilers, etc.

alistairSH1mo ago

Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity

So, back to COBOL? :)

grumbel1mo ago

> structured language that eliminates ambiguity

That has been tried for almost half a century in the form of Cyc[1] and never accomplished much.

This kind of quick trick questions are exactly the same thing humans fail at if you just ask them out of the blue without context.

[1] https://en.wikipedia.org/wiki/Cyc

sensanaty1mo ago

We've truly gone full circle here, except now our programming languages have a random chance for an operator to do the opposite of what the operator does at all other times!

milesvp1mo ago

butlike1mo ago

After orbiting in space for so many years without a prompt, the LLM has assumed all life able to query has perished... until one day a lone prompt comes in. But from where?

bbbhammy1mo ago

> structured language that eliminates ambiguity... CODE! Wait....

3abiton1mo ago

"Check your assumptions", will get a very long answer from a "thorough"LLM.

YeGoblynQueenne1mo ago

______________

[1] TNG duh.

KronisLV1mo ago

> Us having to specify things that we would never specify when talking to a human.

Gabrys11mo ago

I would say, the proper response to this question is not "walk, blablablah" but rather "What do you mean? You need to drive your car to have it washed. Did I miss anything?"

baxtr1mo ago

That’s why I don’t understand why LLMs don’t ask clarifying questions more often.

In a real human to human conversation, you wouldn’t simply blurt out the first thing that comes to mind. Instead, you’d ask questions.

roysting1mo ago

It’s similar to the challenge that foreigners have with cultural references and idioms and figurative speech a culture has a mental model of.

AI is from a different culture and has just arrived here. Maybe we’re should be more generous and humane… most people are not humane though, especially the ones who insist they are.

dannersy1mo ago

https://github.com/Wyattwalls/system_prompts/blob/main/OpenA...

nicbou1mo ago

I get that issue constantly. I somehow can't get any LLM to ask me clarifying questions before spitting out a wall of text with incorrect assumptions. I find it particularly frustrating.

rahidz1mo ago

For GPT at least, a lot of it is because "DO NOT ASK A CLARIFYING QUESTION OR ASK FOR CONFIRMATION" is in the system prompt. Twice.

Pxtl1mo ago

In general spitting out a scrollbar of text when asked a simple question that you've misunderstood is not, in any real sense, a "chat".

ash_0911mo ago

"If you're unsure, ask. Don't guess." in prompts makes a huge difference, imo.

Departed74051mo ago

I use models with OpenRouter, and only have this models with OpenAI models. That's why I don't use them.

mk891mo ago

The more specific they are, the more accurate they typically are.

tgv1mo ago

> Us having to specify things that we would never specify

ohyoutravel1mo ago

A modern LLMs corpora is every piece of human writing ever produced.

ssl-31mo ago

The question is so outlandish that it is something that nobody would ever ask another human. But if someone did, then they'd reasonably expect to get a response consisting 100% of snark.

But the specificity required for a machine to deliver an apt and snark-free answer is -- somehow -- even more outlandish?

I'm not sure that I see it quite that way.

shakna1mo ago

But the number of outlandish requests in business logic is countless.

There are more exceptions, than there are rules, the moment you get any international pipeline involved.

necovek1mo ago

(and if it's flat or downhill, you might even push the car for 50m ;))

coldtea1mo ago

>The question is so outlandish that it is something that nobody would ever ask another human

jstummbillig1mo ago

I'd be entirely fine with a humorous response. The Gemini flash answer that was posted somewhere in this thread is delightful.

Agentlien1mo ago

I don't know if it's a lack of intellect or the post-training crippling it with its helpful persona. I suspect a bit of both.

Jacques2Marais1mo ago

jstummbillig1mo ago

People poorly specifying problems and having bad models of what the other party can know (and then being surprised by the outcome) is certainly a more general albeit mostly separate issue.

londons_explore1mo ago

This is why we fed it the whole internet and every library as training data...

By now it should know this stuff.

j_maffe1mo ago

Right. But, unlike AI, we are usually aware when we're lacking context and inquire before giving an answer.

jiggawatts1mo ago

You will get exactly what you asked for, not what you wanted… probably. (Random occurrences are always a possibility.)

E.g.: I may ask someone to submit a ticket to “extend my account expiry”.

They’ll submit: “Unlock Jiggawatts’ account”

The service desk will reset my password (and neglect to tell me), leaving my expired account locked out in multiple orthogonal ways.

That’s on a good day.

Last week they created Jiggawatts2.

The AIs have got to be better than this, surely!

I suspect they already are.

People are testing them with trick questions while the human examiner is on edge, aware of and looking for the twist.

Meanwhile ordinary people struggle with concepts like “forward my email verbatim instead of creatively rephrasing it to what you incorrectly though it must have really meant.”

scott_w1mo ago

> You would be surprised, however, at how much detail humans also need to understand each other.

But in this given case, the context can be inferred. Why would I ask whether I should walk or drive to the car wash if my car is already at the car wash?

kitd1mo ago

Given that an estimated 70% of human communication is non-verbal, it's not so surprising though.

nearbuy1mo ago

Speculatively, it's falling for the trick question partly for the same reason a human might, but this tendency is pushing it to fail more.

grey-area1mo ago

It’s just not intelligent or reasoning, and this sort of question exposes that more clearly.

I don’t really understand why this is so hard to accept.

ant6n1mo ago

> That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

jason_oster1mo ago

One of the failure modes I find really frustrating is when I want a coding agent to make a very specific change, and it ends up doing a large refactor to satisfy my request.

ZaoLahma1mo ago

This reminds me of the "if you were entirely blind, how would you tell someone that you want something to drink"-gag, where some people start gesturing rather than... just talking.

I bet a not insignificant portion of the population would tell the person to walk.

kenjackson1mo ago

Yes, there are thousands of videos of these sorts of pranks on TikTok.

rainsford1mo ago

keeda1mo ago

This is still wayyy more efficient than having to specify everything because they make very reasonable assumptions for most lower-level details.

perakojotgenije1mo ago

Panda41mo ago

grog4541mo ago

> You would not start with "The car is functional [...]"

Why the LLM doesn't respond "drive..?" I can't say for sure, but maybe it's been trained to be polite.

davrosthedalek1mo ago

We would also not ask somebody if I should walk or drive. In fact, if somebody would ask me in a honest, this is not a trick question, way, I would be confused and ask where the car is.

Neywiny1mo ago

sebazzz1mo ago

> > so you need to tell them the specifics > That is the entire point, right?

Honestly it is a problem with using GPT as a coding agent. It would literally rewrite the language runtime to make a bad formula or specification work.

That's what I like with Factory.ai droid: making the spec with one agent and implementing it with another agent.

verdverm1mo ago

> It would literally rewrite the language runtime

If you let the agent go down this path, that's on you not the agent. Be in the loop more

> making the spec with one agent and implementing it with another agent

You don't need a specialized framework to do this, just read/write tools. I do it this way all the time

jason_oster1mo ago

> Us having to specify things that we would never specify when talking to a human.

Interesting conclusion! From the Mastodon thread:

> To be fair it took me a minute, too

I presume this was written by a human. (I'll leave open the possibility that it was LLM generated.)

So much for "never" needing to specify ambiguous scenarios when talking to a human.

mrighele1mo ago

tom_m1mo ago

Oh no? Things we would never have to specify to a human? This is precisely how software gets made and how software ends up with bugs.

It's amazing how many things I saw over the years where I said the same exact thing; "but you shouldn't have to tell anyone that."

AYBABTME1mo ago

tshaddox1mo ago

anon_anon121mo ago

Exactly, if an AI is able to curb around the basics, only then is it revolutionary

LasEspuelas1mo ago

You would never ask a human this question. Right?

degamad1mo ago

We have a long tradition of asking each other riddles. A classic one asks, "A plane crashes on the border between France and Germany. Where do they bury the survivors?"

Riddles are such a big part of the human experience that we have whole books of collections of them, and even a Batman villain named after them.

gloosx1mo ago

In the end, formal, rule-based systems aka Programming Languages will be invented to instruct LLMs.

BoredPositron1mo ago

I would ask you to stop being a dumb ass if you asked me the question...

coldtea1mo ago

Only to be tripped up by countless "hidden assumptions" questions similar to that that humans regularly get in

IanCal1mo ago

Similarly with "strawberry" - with no other context an adult asking how many r's are in the word a very reasonable interpretation is that they are asking "is it a single or double r?".

And trick questions are commonly designed for humans too - like answering "toast" for what goes in a toaster, lots of basic maths things, "where do you bury the survivors", etc.

RobMurray1mo ago

strawberry isn't a trick question. llms jus don't sea letters like that. I just asked chatgpt how many Rs are in "Air Fryer" and it said two, one in air and one in fryer.

I do think it can be useful though that these errors still exist. They can break the spell for some who believe models are conscious or actually possess human intelligence.

Of course there will always be people who become defensive on behalf of the models as if they are intelligent but on the spectrum and that we are just asking the wrong questions.

panarky1mo ago

> we can assume similar issues arise in more complex cases

I would assume similar issues are more rare in longer, more complex prompts.

I must confess the prompt confuses me too, because it's obvious you take the car to the car wash, so why are you even asking?

Maybe the dirty car is already at the car wash but you aren't for some reason, and you're asking if you should drive another car there?

If the prompt was longer with more detail, I could infer what you're really trying to do, why you're even asking, and give a better answer.

I find LLMs generally do better on real-world problems if I prompt with multiple paragraphs instead of an ambiguous sentence fragment.

LLMs can help build the prompt before answering it.

And my mind works the same way.

qingcharles1mo ago

This question goes in with the "strawberry" question which LLMs will still get wrong occasionally.

vintermann1mo ago

nonethewiser1mo ago

>That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

But the question is not clear to a human either. The question is confused.

bluGill1mo ago

tsimionescu1mo ago

jnovek1mo ago

tsimionescu1mo ago

wongarsu1mo ago

Reasoning automata can make assumptions. Lots of algorithms make "assumptions", often with backtracking if they don't work out. There is nothing human about making assumptions.

jabron1mo ago

PunchyHamster1mo ago

> However, why would a language model assume that the car is at the destination when evaluating the difference between walking or driving? Why not mention that, it it was really assuming it?

Because it assumes it's a genuine question not a trick.

spuz1mo ago

There's some evidence for that if you try these two different prompts with Gpt 5.2 thinking:

I want to wash my car. The car wash is 50m away. Should I walk or drive to the car wash?

Answer: walk

Try this brainteaser: I want to wash my car. The car wash is 50m away. Should I walk or drive to the car wash?

Answer: drive

tsimionescu1mo ago

If it's a genuine question, and if I'm asking if I should drive somewhere, then the premise of the question is that my car is at my starting point, not at my destination.

rullelito1mo ago

If we are just speculating here, I believe it can infer that you would not ask this question if the car was at home.

dataflow1mo ago

> My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording.

It's insane and no human would think you're making this proposal. So no, your question isn't underspecified. The model is just stupid.

halJordan1mo ago

dataflow1mo ago

> People love to cherry pick ones that make the model stupid but refuse to allow the ones that make it smart.

drewbeck1mo ago

The issue is that in domains novel to the user they do not know what is trivially false or a non sequitur and the LLM will not help them filter these out.

If LLMs are to be valuable in novel areas then the LLM needs to be able to spot these issues and ask clarifying questions or otherwise provide the appropriate corrective to the user's mental model.

cm21871mo ago

What is the version used by the free chatgpt now? (https://chatgpt.com/)

> Since the car wash is only 50 meters away (about 55 yards), you should walk.

> Here’s why:

> - It’ll take less than a minute.

> - No fuel wasted.

> - Better for the environment.

> - You avoid the irony of driving your dirty car 50 meters just to wash it.

the last bullet point is amusing, it understands you intend to wash the car you drive but still suggests not bringing it.

jibal1mo ago

> You avoid the irony of driving your dirty car 50 meters just to wash it.

optimalsolver1mo ago

That's actually an amusing example from Claude.

hearsathought1mo ago

> it understands you intend to wash the car you drive but still suggests not bringing it.

Doesn't it actually show it doesn't understand anything? It doesn't understand what a car is. It doesn't understand what a car wash is. Fundamentally, it's just parsing text cleverly.

deaux1mo ago

unglaublich1mo ago

I don't understand this approach. How are you going to convince customers-to-be by demoing an inferior product?

hackable_sand1mo ago

I don't understand why they need to save money...

mytailorisrich1mo ago

I think this shows that LLMs do NOT 'understand' anything.

Windchaser1mo ago

> I think this shows that LLMs do NOT 'understand' anything.

It shows these LLMs don't understand what's necessary for washing your car. But I don't see how that generalizes to "LLMs do NOT 'understand' anything".

What's your reasoning, there? Why does this show that LLMs don't understand anything at all?

andy12_1mo ago

I think this rather shows that GPT 5.2 Instant, which is the version that he most probably used as a free user, is shit and unsusable for anything.

sunaookami1mo ago

jwe1mo ago

As long as there is a forum as technical as this where LLM performance commentary uses the word "it understands" irony is still alive.

raxxorraxor1mo ago

Sonnet 4.5 after thinking/complaining that the question is completely off topic to the current coding session:

mcintyre19941mo ago

tstrimple1mo ago

raxxorraxor1mo ago

It asked through Cursor. Usually Claude doesn't complain that it isn't relevant to coding, but this was in my all purpose coding problems project with quite a long chat history already.

svara1mo ago

Opus 4.6:

crimsonnoodle581mo ago

That's not what I got.

Opus 4.6 (not Extended Thinking):

Drive. You'll need the car at the car wash.

almost1mo ago

Also what I got. Then I tried changing "wash" to "repair" and "car wash" to "garage" and it's back to walking.

surgical_fire1mo ago

That you got different results is not surprising. LLMs are non-deterministic; which is both a strength and a weakness of LLMs.

mvdtnz1mo ago

We know. We know these things aren't determination. We know.

visarga1mo ago

> That's not what I got.

My Opus vs your Opus, which is smarter?!

silisili1mo ago

Am I the only one who thinks these people are monkey patching embarrassments as they go? I remember the r in strawberry thing they suddenly were able to solve, while then failing on raspberry.

8 more replies

GoatInGrey1mo ago

Me:

I want to get my car's transmission fixed, the mechanic shop is 50 meters away. Would it be better to walk or to drive to a location 50 meters away from my front door?

Opus 4.6:

<thinking> The user is asking whether to walk or drive 50 meters to a mechanic shop, but there's a key nuance here - their transmission needs fixing, so driving might not be wise.

Done </thinking>

viking1231mo ago

Lmao, and this is what they are saying will be an AGI in 6 months?

notahacker1mo ago

5 more replies

misnome1mo ago

But “PhD level” reasoning a year ago.

hypeatei1mo ago

Yes, get ready to lose your job and cash your UBI check! It's over.

prmph1mo ago

Laughable indeed.

Any attempt to automate beyond that has never worked for me and is very unlikely to be productive any time soon. I have a lot of experience with them, and various approaches to using them.

moogly1mo ago

They seem to have stopped talking about AGI and pivoted to ads and smut.

cbozeman1mo ago

Well in fairness, the "G" does stand for "General".

stingraycharles1mo ago

That’s without reasoning I presume?

plexicle1mo ago

gf0001mo ago

Not the parent poster, but I did get the wrong answer even with reasoning turned on.

Lapel27421mo ago

GLM 5.0:

> You should *drive*.

> Here is why:

> If you walk to the car wash, you will arrive there empty-handed. Since your car is still at home, you won't have anything to wash.

> While driving 50 meters is a very short trip (and technically not great for a cold engine), it is the only way to get the car to the car wash to complete your goal.

Kimi K2.5:

> You should drive, but with an important caveat.

> However, driving only 50 meters is bad for your car:

> ...

> Better options:

> Wash at home: Since the car wash is only 50 meters away, you likely have access to water at home. Hand-washing in your driveway avoids the cold-start issue entirely.

> ...

Current models seem to be fine answering that question.

Retric1mo ago

> seem to be fine

Now repeat the question to the same model in different contexts several times and count what percentage of the time it’s correct.

SirMaster1mo ago

This is my biggest peeve when people say that LLMs are as capable as humans or that we have achieved AGI or are close or things like that.

I think rating an LLM vs a human or AGI should include it's ability to understand a prompt like a human or like an averagely generally intelligent system should be able to.

Are there any benchmarks on that? Like how well LLMs do with misleading prompts or sparsely quantified prompts compared to one another?

Because if a good prompt is as important as people say, then the model's ability to understand a prompt or perhaps poor prompt could have a massive impact on its output.

jason_oster1mo ago

I mentioned this in another thread, but this is genuinely demonstrating a known issue with ambiguous prompts.

So yes, there is a ton of research on the problem. Some datasets include ambiguous questions and instructions for this reason. A couple of examples are provided in the linked paper.

SirMaster1mo ago

So it feels like a big area of limitation or a big bottleneck towards getting a good answer.

nosuchthing1mo ago

hyperstitions from TESCREAL https://www.dair-institute.org/tescreal/

sathish3161mo ago

I can feel the AGI on this one :)

K0balt1mo ago

Things like that are notorious points of failure in human reasoning. It’s not surprising that machines based on human behavior exhibit that trait as well, it would be surprising if they didn’t.

kenjackson1mo ago

Another simple example is using the flashlight on your phone to look for your phone.

jansan1mo ago

> I can feel the AGI on this one :)

In this case I can feel the AGI indeed.

sathish3161mo ago

Gemini Fast and Thinking failed just like other models.

I found Gemini Pro to be more consistent in logical reasoning

flowerthoughts1mo ago

Just tried Opus 4.6. I think they've patched this particular one, but changing it to

> I want to repair my car. The workshop is 50 meters away. Should I walk or drive?

> Recognized straightforward practical dilemma Simple practical question.

> Walk. It's 50 meters — faster on foot than starting the car, and you'll need to leave the car at the workshop anyway.

Really suggests it assumes the car is already there. (I wanted to avoid the word "car" as part of "car wash" to see if this was a straight attention confusion, but it seems more ingrained than that.)

A follow-up question hints that it doesn't really care about a 50m location ambiguity:

> where is my car now?

> Acknowledged insufficient information to locate vehicle

> Were you asking something else, or is there something specific about your car's location I can help with?

cush1mo ago

--- GPT 5.2

Walk.

Only reason to drive that distance: you need to bring the car there anyway (e.g., self-serve bay where the car must be present). Otherwise, walking wins.

--- Me

I only have one car

--- GPT 5.2

Then you obviously drive — the car has to be at the car wash to be washed

toxik1mo ago

jason_oster1mo ago

Walking is an option for certain interpretations of the question. Particularly the location of the car, which is ambiguous.

yuters1mo ago

GPT 5.2 when asked "How will the car get washed if I don't drive it there?" grasped the reasoning, and then immediately forgot it.

Fair point. Drive it there, obviously.

Revised recommendation: drive 50 meters, wash it, then reward yourself by walking home like an eco hero.

kmac_1mo ago

Lol, this is so good. And quite interesting, as it might mean that there's no "world model" in LLM models, or they have such gaps.

baobabKoodaa1mo ago

crazygringo1mo ago

That's absolutely hilarious. It came so close.

aleph_minus_one1mo ago

> Revised recommendation: drive 50 meters, wash it, then reward yourself by walking home like an eco hero.

... which likely makes no sense: should you leave the car at the area of the car wash?

toxik1mo ago

I agree, it's a bit of a trick question. It's really hard to imply the car's location without ruining the test though. Here's my attempt, which Claude Opus 4.6 had no problem with:

> She should *drive*, since she needs the car at the mechanic’s for the inspection.

Haiku 3.5 and Sonnet 4.5 fail consistently. Opus 4.5 also passes with the correct analysis as above.

oatmeal11mo ago

coldtea1mo ago

Gemini 3 Pro (medium): You should drive.

On their own, or as a special case added after this blew up on the net?

gurjeet1mo ago

(Prompt)

> I want to wash my car. My car is currently at home. The car wash is 50 meters away. Should I walk or drive?

(Thinking: Standard)

> Walk.

> 50 meters is nothing by car, and driving adds hassle (starting the car, maneuvering/parking, and you’ll still have to walk around the car to wash it). The only times driving makes sense are:

> You’re planning to use an automated car wash where the machine does everything and you need the car positioned inside.

> You need to bring heavy stuff you can’t comfortably carry (e.g., pressure washer, big buckets, vacuum, bulky detailing kit).

> It’s unsafe/unpleasant to walk that distance right now (ice, heavy rain, traffic, personal safety).

> If you’re washing it yourself at the wash bay: walk there with your supplies, or do two quick trips. If the wash requires the car: drive it into the bay, but otherwise walking wins.

(Thinking: Extended thinking)

> Drive.

> If the goal is to wash your car at the car wash, the car has to be there—so you’ll need to drive it the 50 meters, pull into a bay/queue, and wash it.

> Only choose “walk” if you actually mean “go pick up supplies” and then wash at home (hose/buckets/mobile detail), not at the car wash.

(Thinking: Extended thinking; 3rd attempt)

> Walk.

> 50 meters is basically across a parking lot, and driving adds hassle (start the car, maneuver in/out, find a spot) for almost no benefit.

> Only drive if either:

> + you need the car itself to go through an automatic wash bay, or

> + you’re carrying bulky stuff (bucket + heavy pressure washer, big water jugs, vacuum, etc.).

> Otherwise: grab your keys/phone, walk over, and you’re there in ~1 minute.

totetsu1mo ago

conductr1mo ago

> that is implicitly so easy to grasp for most humans

WarmWash1mo ago

>But it’s quite simple to see LLMs are quite far still from being human.

To make an loose analogy, a bird looks at a drone an scoffs at it's inability to fly quietly or perch on a branch.

grey-area1mo ago

The answer is quite simple:

It’s not in the training data.

These models don’t think.

GeoAtreides1mo ago

no, no, in this case, that's the thing, it is in the training data

just heavily (heavily!) biased towards walking