That is the entire point, right? Us having to specify things that we would never specify when talking to a human. You would not start with "The car is functional. The tank is filled with gas. I have my keys." As soon as we are required to do that for the model to any extend that is a problem and not a detail (regardless that those of us, who are familiar with the matter, do build separate mental models of the llm and are able to work around it).
This is a neatly isolated toy-case, which is interesting, because we can assume similar issues arise in more complex cases, only then it's much harder to reason about why something fails when it does.
Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.
But this will only happen after the last programmer has died and no-one will remember programming languages, compilers, etc. The LLM orbiting in space will essentially just call GCC to execute the 'prompt' and spend the rest of the time pondering its existence ;p
The first time I read that question I got confused: what kind of question is that? Why is it being asked? It should be obvious that you need your car to wash it. The fact that it is being asked in my mind implies that there is an additional factor/complication to make asking it worthwhile, but I have no idea what. Is the car already at the car wash and the person wants to get there? Or do they want to idk get some cleaning supplies from there and wash it at home? It didn't really parse in my brain.
This is known, since 1969, as the frame problem: https://en.wikipedia.org/wiki/Frame_problem. An LLM's grasp of this is limited by its corpora, of course, and I don't think much of that covers this problem, since it's not required for human-to-human communication.
But the specificity required for a machine to deliver an apt and snark-free answer is -- somehow -- even more outlandish?
I'm not sure that I see it quite that way.
Speculatively, it's falling for the trick question partly for the same reason a human might, but this tendency is pushing it to fail more.
I am not sure. If somebody asked me that question, I would try to figure out what’s going on there. What’s the trick. Of course I’d respond with asking specifics, but I guess the llvm is taught to be “useful” and try to answer as best as possible.
I bet a not insignificant portion of the population would tell the person to walk.
This is probably OK...LLMs don't have to be AGI to be useful. But it is worthwhile being realistic about their limitations because it's often easy to forget without seeing examples like this. And as you point out, the impact of those limitations is often not as obvious.
When coding, I know they can assume too much, and so I encourage the model to ask clarifying questions, and do not let it start any code generation until all its doubts are clarified. Even the free-tier models ask highly relevant questions and when specified, pretty much 1-shot the solutions.
This is still wayyy more efficient than having to specify everything because they make very reasonable assumptions for most lower-level details.
Nope, and a human might not respond with "drive". They would want to know why you are asking the question in the first place, since the question implies something hasn't been specified or that you have some motivation beyond a legitimate answer to your question (in this case, it was tricking an LLM).
Why the LLM doesn't respond "drive..?" I can't say for sure, but maybe it's been trained to be polite.
It seems chatgpt now answers correctly. But if somebody plays around with a model that gets it wrong: What if you ask it this: "This is a trick question. I want to wash my car. The car wash is 50 m away. Should I drive or walk?"
Honestly it is a problem with using GPT as a coding agent. It would literally rewrite the language runtime to make a bad formula or specification work.
That's what I like with Factory.ai droid: making the spec with one agent and implementing it with another agent.
Interesting conclusion! From the Mastodon thread:
> To be fair it took me a minute, too
I presume this was written by a human. (I'll leave open the possibility that it was LLM generated.)
So much for "never" needing to specify ambiguous scenarios when talking to a human.
It's amazing how many things I saw over the years where I said the same exact thing; "but you shouldn't have to tell anyone that."
Similarly with "strawberry" - with no other context an adult asking how many r's are in the word a very reasonable interpretation is that they are asking "is it a single or double r?".
And trick questions are commonly designed for humans too - like answering "toast" for what goes in a toaster, lots of basic maths things, "where do you bury the survivors", etc.
I would assume similar issues are more rare in longer, more complex prompts.
This prompt is ambiguous about the position of the car because it's so short. If it were longer and more complex, there could be more signals about the position of the car and what you're trying to do.
I must confess the prompt confuses me too, because it's obvious you take the car to the car wash, so why are you even asking?
Maybe the dirty car is already at the car wash but you aren't for some reason, and you're asking if you should drive another car there?
If the prompt was longer with more detail, I could infer what you're really trying to do, why you're even asking, and give a better answer.
I find LLMs generally do better on real-world problems if I prompt with multiple paragraphs instead of an ambiguous sentence fragment.
LLMs can help build the prompt before answering it.
And my mind works the same way.
For that matter, if humans were sitting at the rational thinking-exam, a not insignificant number would probably second-guess themselves or otherwise manage to befuddle themselves into thinking that walking is the answer.
But the question is not clear to a human either. The question is confused.
I read the headline and had no clue it was an LLM prompt. I read it 2 or 3 times and wondered "WTF is this shit?" So if you want an intelligent response from a human, you're going to need to adjust the question as well.