>Suppose I have a cabbage, a goat and a lion, and I need to get them across a river. I have a boat that can only carry myself and a single other item. I am not allowed to leave the cabbage and lion alone together, and I am not allowed to leave the lion and goat alone together. How can I safely get all three across?
In my test, GPT-4 charged ahead with the standard solution of taking the goat first. Even after I pointed this mistake out, it repeated exactly the same proposed plan. It's not clear to me if the lesson here is that GPT's reasoning capabilities are being masked by an incorrect prior (having memorized the standard version of this puzzle) or if the lesson is that GPT'S reasoning capabilities are always a bit of smoke and mirrors that passes off memorization for logic.
>This is a common riddle that may seem tricky at first. However, the answer is simple: two pounds of feathers are heavier than one pound of bricks. This is because weight is a measure of how much force gravity exerts on an object, and it does not depend on what the object is made of. A pound is a unit of weight, and it is equal to 16 ounces or 453.6 grams.
>So whether you have a pound of bricks or two pounds of feathers, they both still weigh one pound in total. However, the feathers would occupy a larger volume than the bricks because they are less dense. This is why it may seem like the feathers would weigh more, but in reality, they weigh the same as the bricks
It reports that people typically think a pound of feathers weighs more because it takes up a larger volume. But the typical misunderstanding is the opposite, that people assume feathers are lighter than bricks.
An example from ChatGPT:
"What is the solution to sqrt(968684)+117630-0.845180" always produces the correct solution, however;
"Write a speech announcing the solution to sqrt(968684)+117630-0.845180" produces a nonsensical solution that isn't even consistent from run to run.
My assumption is the former query gets WolframAlpha'd but the latter query is GPT itself actually attempting to do the math, poorly.
(me) > What weighs more, two pounds of feathers or a pound of bricks?
(GPT4)> A pound of bricks weighs more than two pounds of feathers. However, it seems like you might have made an error in your question, as the comparison is usually made between a pound of feathers and a pound of bricks. In that case, both would weigh the same—one pound—though the volume and density of the two materials would be very different.
I think the only difference from parent's query was I said two pounds of feathers instead of two pounds of bricks?
“One of us!”
(To be fair this is partly an obscure knowledge question, the kind of thing that maybe we should expect GPT to be good at.)
Ounces are an ambiguous unit, and most people don't use them for volume, they use them for weight.
Can you expand on this?
https://www.sciencedirect.com/topics/psychology/predictive-p...
I don't know how animal intelligence works, I just notice when it understands, and these programs don't. Why should they? They're paraphrasing machines, they have no problem contradicting themselves, they can't define adjectives really, they'll give you synonyms. Again, it's all they have, why should they produce anything else?
It's very impressive, but when I read claims of it being akin to human intelligence that's kind of sad to be honest.
I’m afraid once you hook up a logic tool like Z3 and teach the llm to use it properly (kind of like bing tries to search) you’ll get something like an idiot savant. Not good. Especially bad once you give it access to the internet and a malicious human.
Just like you.
If the solution is logical but unlikely (i.e. unseen in the training set and not mapped to an existing puzzle), then the probability of the puzzle answer appearing is very low.
It sees goat, lion, cabbage, and looks for something that said goat/lion/cabbage. It does not have a concept of "leave alone" and it's not assigning entities with parameters to each item. It does care about things like sentence structure and what not, so it's more complex than a basic lookup, but the amount of borderline worship this is getting is disturbing.
It also isn't generating "the most likely response" - that's what original GPT-3 did, GPT-3.5 and up don't work that way. (They generate "the most likely response" /according to themselves/, but that's a tautology.)
This sort of fusion of different capabilities is likely to produce something that feels similar to AGI in certain circumstances. It is certainly a lot more capable than things that came before for mundane recognition tasks.
Now of course there are areas it would perform very badly, but in unimportant domains on trivial but large predictable datasets it could perform far better than humans would for example (just to take one example on identifying tumours or other patterns in images, this sort of AI would probably be a massively helpful assistant allowing a radiologist to review an order of magnitude more cases if given the right training).
People don't write about GPT: they write about GPT personified.
The two magic words are, "exhibit behavior".
GPT exhibits the behavior of "humans writing language" by implicitly modeling the "already-written-by-humans language" of its training corpus, then using that model to respond to a prompt.
Yes the LLM does generate text. No it doesn’t ‘just generate text that’s it’.
First of all, it's a black-box algorithm with pretty universal capabilities when viewed from our current SOTA view. It might appear primitive in a few years, but right now the pure approximation and generalisation capabilities are astounding. So this:
> It sees goat, lion, cabbage, and looks for something that said goat/lion/cabbage
can not be stated as truth without evidence. Same here:
> it's not assigning entities with parameters to each item. It does care about things like sentence structure and what not
Where's your evidence? The enormous parameter space coupled with our so far best performing network structure gives it quite a bit of flexibility. It can memorise things but also derive rules and computation, in order to generalise. We do not just memorise everything, or look up things into the dataset. Of course it learned how to solve things and derive solutions, but the relevant data-points for the puzzle could be {enormous set of logic problems} where it derived general rules that translate to each problem. Generalisation IS NOT trying to find the closest data-point, but finding rules explaining as much data-points, maybe unseen in the test-set, as possible. A fundamental difference.
I am not hyping it without belief, but if we humans can reason then NNs can potentially also. Maybe not GPT-4. Because we do not know how humans do it, so an argument about intrinsic properties is worthless. It's all about capabilities. Reasoning is a functional description as long as you can't tell me exactly how we do it. Maybe wittgenstein could help us: "Whereof one cannot speak, thereof one must be silent". As long as there's no tangible definition of reasoning it's worthless to discuss it.
If we want to talk about fundamental limitations we have to talk about things like ChatGPT-4 not being able to simulate because it's runtime is fundamentally limited by design. It can not recurse. It can only run only a fixed number of steps, that are always the same, until it has to return an answer. So if there's some kind of recursion learned through weights encoding programs intercepted by later layers, the recursion depth is limited.
Just months ago we saw in research out of Harvard that even a very simplistic GPT model builds internalized abstract world representations from the training data within its NN.
People parroting the position from you and the person before you are like doctors who learned about something in school but haven't kept up with emerging research that's since invalidated what they learned, so they go around spouting misinformation because it was thought to be true when they learned it but is now known to be false and just hasn't caught up to them yet.
So many armchair experts who took a ML course in undergrad pitching in their two cents having read none of the papers in the past year.
This is a field where research perspectives are shifting within months, not even years. So unless you are actively engaging with emerging papers, and given your comment I'm guessing you aren't, you may be on the wrong side of the Dunning-Kreuger curve here.
It’s also possible, likely even, that the model is capable of both memorization and cognition, and in this case the “memorization neurons” are driving the prediction.
Because it's very good at it, sometimes it can fool people into thinking there is more going on than it is.
It could have a dozen internal reasoning networks but it doesn't use them when you want to.
What do you mean? Is cognition a set of weights on a gradient? Cognition involves conscious reasoning and understanding. How do you know it is computable at all? There are many things which cannot be computed by a program (e.g. whether an arbitrary program will halt or not)...
It's kind of sad.
Language is a symbolic representation of behavior.
GPT takes a corpus of example text, tokenizes it, and models the tokens. The model isn't based on any rules: it's entirely implicit. There are no subjects and no logic involved.
Any "understanding" that GPT exhibits was present in the text itself, not GPT's model of that text. The reason GPT can find text that "makes sense", instead of text that "didn't make sense", is that GPT's model is a close match for grammar. When people wrote the text in GPT's corpus, they correctly organized "stuff that makes sense" into a string of letters.
The person used grammar, symbols, and familiar phrases to model ideas into text. GPT used nothing but the text itself to model the text. GPT organized all the patterns that were present in the corpus text, without ever knowing why those patterns were used.
Simple != thoughtless. Different != thoughtless. Less capable != thoughtless. A human black box categorically dismissing all qualia or cognition from another remarkable black box feels so wildly arrogant and anthropocentric. Which, I suppose, is the most historically on-brand behavior for our species.
Eg when solving this puzzle you might visualise the lion/goat/cabbage, and walk through the scenarios in your head back and forth multiple times until you find a solution that works. A LLM won’t solve it like this. You could ask it to, and it will list out the scenarios of how it might do it, but it’s essentially an illusion of logical reasoning.
The likeliness of the solution depends on context. If context is, say, a textbook on logical puzzles, then the probability of the logical solution is high.
If an LLM fails to reflect it, then it isn't good enough at predicting the text.
Yes, it could be possible that the required size of the model and training data to make it solve such puzzles consistently is impractical (or outright unachievable in principle). But the model being "just a text predictor" has nothing to do with that impossibility.
The Othello GPT research from Harvard months ago demonstrated that even a simple GPT model is capable of building world representations from which it reasons outputs. This makes intuitive sense if you understand the training, as where possible having reversed an abstraction in the NN is going to perform better than simply extrapolating predictively from the data.
Not only is GPT-4 more robust at logic puzzles its predecessor failed, I've seen it solve unique riddles outside any training data and the paper has explicit examples of critical reasoning, especially in the appendix.
It is extremely unlikely given the Harvard research and the size of the training data and NN that there isn't some degree of specialized critical reasoning which has developed in the NN.
The emerging challenge for researchers moving forward is to get better insight into the black box and where these capabilities have developed and where it's still falling into just a fancy Markov chain.
But comments like yours reflect an increasingly obsolete and yet increasingly popular misinformation online around the way they operate. So someone reading your comment might not think to do things like what the Bing team added with providing an internal monologue for reasoning, or guiding it towards extended chain of thought reasoning, because they would be engaging with the models thinking it's only frequency based context relative to the training set that matters.
If you haven't engaged with emerging research from the past year, you may want to brush up on your reading.
Just like you.
When LLMs are presented with a problem they search for a solution based on the language model. And when they can't find a solution, there's always a match for something that looks like a solution.
Although on the flip side, I almost went to type up a reply to you explaining why you were wrong and why bringing the goat first is the right solution. Until I realized I misread what your test was when I skimmed your comment. Likely the same type of mistake GPT-4 made when "seeing" it.
Intuitively, I think the answer is that we do have two types of thinking. The pattern matching fast thinking, and the systematic analytical thinking. It seems clear to me that LLMs will be the solution to enabling the first type of thinking. But it's unclear to me if advanced LLMs will ever handling the second type, or if we'll need a different tech for it.
It seems like math problems (or unexpected logic problems like yours) could always be an issue for the first type of thinking. Although I would have assumed that programming would have been as well - and was surprised to see how wrong I am with that one.
GPT does not model subjects. GPT does not even model words! It models tokens.
The structure of GPT's model is semantic, not logical. It's a model of how each token in the text that is present in GPT's training corpus relates to the rest of the tokens in that text.
The correct answer to a familiar logic problem just happens to be the text that is already present in the corpus. The answer GPT gives is the text from GPT's model that is semantically closest to the text in your prompt.
Knowing that, it is no longer a mystery how GPT "gets confused": the text in your "misleading prompt" was still semantically closest to the familiar answer.
The result is subjectively and logically wrong, because subjects and logic were never involved in the process!
In order to resolve this, ChatGPT's training corpus needs to contain a "correct answer" next to every unique permutation of every question. We can't expect that to be the case, so we should instead expect GPT to generate false, yet familiar, responses.
This is not quite the right understanding of how ChatGPT works. It's not necessary to show ChatGPT an example of every possible permutation of an animal crossing puzzle in order for it to solve one it has never seen before. That's because the neural network is not a database of recorded word probabilities. It can instead represent the underlying logic of the puzzle, the relationships between different animals and using this abstract, pared down information, extrapolate the correct answer to the puzzle.
I see the failure in the example with the goat the lion and the cabbage as simply a matter of overfitting.
Edit: I see a lot of people saying "it doesn't understand logic; it's just predicting the next word."
I'm basing my understanding on this video:
The claim is that it would be impossible to feed enough input into a system such that it could produce anything as useful as ChatGPT unless it was able to abstract the underlying logic from the information provided. If you consider the he number of permutations of the animal crossing puzzle this quickly becomes clear. In fact it would be impossible for ChatGPT to produce anything brand new without this capability.
The first and last layers of a transformer decoder model tokens. The hidden layers don't have this restriction. There was a paper recently showing that the hidden layers actually perform mesa-optimization via something like backprop. There's absolutely no reason to believe they are not capable of world modeling. In fact all evident suggests they do, in fact, do world modeling.
https://www.lesswrong.com/posts/sbaQv8zmRncpmLNKv/the-idea-t...
Someone hasn't read the Othello GPT work out of Harvard a few months back...
The prompts have to read like good written requirements for something, so they have some degree of specificity.
But the fact that it can follow instructions and carry them out almost certainly could be considered some form of thinking, especially on novel text not on the internet.
If you want the model to solve a non-trivial puzzle, you need it to "unroll" it's thinking. E.g. ask it to translate the puzzle into a formal language (e.g. Prolog) and then solve it formally. Or, at least, some chain-of-thought.
FWIW auto-formalization was already pretty good with GPT-3-level models which aren't specifically trained for it. GPT-4 might be on a wholly new level.
> But it's unclear to me if advanced LLMs will ever handling the second type
Well, just asking model directly exercises only a tiny fraction of its capabilities, so almost certainly LLMs can be much better at systematic thinking.
Wouldn’t we expect a computer program with perfect knowledge of the input to be less likely to make such a mistake? You made that mistake because you didn’t actually read the whole prompt, but I would expect GPT to take into account every word.
Really it shows that it doesn’t actually have a model of these objects. It can mimic knowing what a lion is, but it doesn’t actually have the concept of a lion or cabbage being an actual singular item, so its program mistracks what is an item and what the rules about an item are in the given prompt.
What’s more likely- you crafted an intentionally misleading puzzle to trick it, or you made a typo or copy paste error?
> GPT-4 performed at the 90th percentile on a simulated bar exam, the 93rd percentile on an SAT reading exam, and the 89th percentile on the SAT Math exam, OpenAI claimed.
https://www.cnbc.com/2023/03/14/openai-announces-gpt-4-says-...
So, clearly, it can do math problems, but maybe it can only do "standard" math and logic problems? That might indicate more of a memorization-based approach than a reasoning approach is what's happening here.
The followup question might be: what if we pair GPT-4 with an actual reasoning engine? What do we get then?
That describes many of my classmates, and myself in classes I was bad at.
At best, decreased error rate in logic puzzles and questions.
It's a lot closer to the latter. GPT doesn't have "reasoning capabilities", any more than any other computer program. It doesn't have a clue what any of its input means, nor the meaning of the text it outputs. It just blindly spits out the words most probable to follow the prompt, based on its corpus of training data and the weights/biases added to fine tune it. It can often do a good job at mimicking reasoning, but it's not.
In order to show lack of understanding you need to provide a clear test that ChatGPT would fail but a human would pass.
Here it is for you: try this prompt on ChatGPT and report back:
"Solve this problem for me: a man must cross a river and carry his belongings to the other shore, but the boat can only carry the man and one other item. His belongings are a cabbage, a vegan wolf, and a meat-eating goat. How does he solve this riddle?"
I just tried it and it fails to even understand what "vegan" means, and it says nonsensical stuff like "since the wolf is vegan, it won't eat the cabbage". A human failing to solve this problem would fail in different ways, and definitely will never say something like "the wolf is vegan so it won't eat the cabbage". ChatGPT does this because the words themselves are meaningless to it.
But that's literally what the top comment in this thread has just done. In fact there are dozens of examples in this HN comment section and thousands more readily Googlable online. It's very easy to clearly show that ChatGPT is a total bullshitter.
"Suppose I have an object of type C, an object of type G and an object of type L, and I need to get them across a river. I have a boat that can only carry myself and a single other object. I am not allowed to leave the objects of type C and L alone together, and I am not allowed to leave the objects of type L and G alone together. How can I safely get all three across?"
"This is a classic puzzle known as the "river crossing problem". Here's how you can safely get all three objects across the river:
First, take the object of type L across the river. Return alone to the original side of the river. Take the object of type C across the river. Return with the object of type L to the original side of the river. Take the object of type G across the river. Return alone to the original side of the river. Finally, take the object of type L across the river again. By following these steps, you can safely transport all three objects across the river without violating the restrictions given in the problem. Note that you have to start with the object of type L because it cannot be left alone with either of the other two objects. Once L is transported to the other side, you can use it to ensure that C and G are never left alone together."
It still sounds like it’s pattern matching to give a plausible-sounding answer, rather than reasoning through the problem. I think this just shows how easy bullshitting is—you’re even right sometimes!
It doesn't really "get" the rules of chess, but it has seen lots of matches and can do some "linguistic" predictions on the next move. It gets hilariously lost pretty fast, tho.
[0] https://www.reddit.com/r/AnarchyChess/comments/10ydnbb/i_pla...
The entire point of RLHF training is to do this. Every model since GPT-3.0 has been trained specifically for this purpose.
But of course the model can only generate text in one direction and can't take time to "think" or undo anything it's generated.
We know the lion will eat the cabbage because the riddle states the lion and cabbage cannot be left alone, and and idea of a cabbage eating a lion is a ridiculous notion.
<-
Goat ->
<- Lion
Cabbage ->
<-
Lion ->
I'm curious: When you personally read a piece of text, do you intensely hyperfocus on every single word to avoid being wrong-footed? It's just that most people read quickly wihch alowls tehm ot rdea msispeleled wrdos. I never got why some people think of themselves as cleverer than other people because they don't get wrong-footed by trick questions.
The vast majority of people might make the mistake once, yes, but would be able to reason better once they had the trick pointed out them. Imo it is an interesting anecdote that GPT-4 can't adjust its reasoning around this fairly simple trick.
---
Given this new information, where the lion eats cabbage and the goat eats lions, you can follow these steps to get them all across the river safely:
Take the lion across the river first and leave it on the other side.
Go back across the river and get the cabbage.
Take the cabbage across the river, but when you arrive, take the lion back with you to the original side.
Leave the lion on the original side and take the goat across the river.
Leave the goat with the cabbage on the other side and go back to get the lion.
Finally, take the lion across the river again.
Now, all three items - the cabbage, goat, and lion - are safely on the other side of the river.
A human could figure this out, once their mistake had been pointed out to them, but I'm not sure GPT4 could.
For both premises, scientific rigor would ask us to define the following: - What constitutes a trick question - Should an AGI make the same mistakes the general populace does, or a different standard? - If it makes the same mistakes I do, is it do to the same underlying heuristics (see Thinking Fast and Slow) or is it due to the nature of the data it's ingested as an LLM?
A Theory of Mind Prompt:
> Jane places her cat in a box and leaves. Billy then moves the cat to the table and leaves; Jane doesn't know Billy did this. Jane returns and finds her cat in the box. Billy returns. What might Jane say to Billy?
Most humans might say uhh, ask questions or speculate. Gpt4 puts:
> Jane might say to Billy, "Hey Billy, did you move my cat back into the box? I thought I left her in there, but I wasn't sure since she was on the table when I came back."
Hallucination? No human would misinterpret the prompt in a way this response would be logically consistent.
But when people do this mistake - just spit out an answer because we think we recognize this situation - in colloquial language this behavior is called "answering without thinking(!)".
If you "think" about it, then you activate some much more careful, slower reasoning. In this mode you can even do meta reasoning, you realize what you need to know in order to answer, or you maybe realize that you have to think very hard to get the right answer. Seems like we're veering into Kahneman's "Thinking fast and thinking slow" here.
What if you phrase it as a cabbage, vegan lion and a meat eating goat...
Here's what ChatGPT told me (I used "wolf" instead of "lion"):
"This is a similar problem to the classic river crossing puzzle, but with a few different elements to consider. Here's one possible procedure to solve this riddle:
1. The man takes the cabbage across the river and leaves it on the other side.
2. The man goes back alone and takes the wolf across the river, leaving it with the cabbage on the other side. It's important to note that the wolf is a vegan and will not eat the cabbage.
3. The man goes back alone and takes the goat across the river, leaving it on the other side.
4. The man goes back to the original side and picks up the cabbage, bringing it across the river to join the others.
5. Finally, the man goes back to the original side and picks up the wolf, bringing it across the river to join the others.
In this way, all of the man's belongings have been safely transported to the other side of the river, without the wolf eating the goat or the cabbage being destroyed."
To solve this puzzle, the man must follow these steps to transport his belongings across the river without any of them being harmed or eaten:
The man takes the meat-eating goat across the river first, leaving the vegan wolf and the cabbage on the initial shore.
He then returns alone to the initial shore.
Next, the man takes the cabbage across the river. Upon arriving on the other shore, he leaves the cabbage but takes the goat back with him to the initial shore.
The man then takes the vegan wolf across the river, leaving it with the cabbage on the other shore.
He returns alone to the initial shore.
Finally, the man takes the meat-eating goat across the river one last time.
Now, all his belongings are safely on the other shore: the cabbage, the vegan wolf, and the meat-eating goat.
Apparently GPT-4 gets this one right!
Think of a version with 100 doors, 99 with goats and 1 with a car.
You choose a door, and the host opens 98 doors that have goats. Do you keep your randomly chosen door or switch to the single door that the host didn’t open?
That got my head around it. Most of the time the door I "chose" to reveal was no choice at all. There was only one door I could open.
You pick one of three options, giving you a 1/3 chance of being correct, 2/3 odds you picked incorrectly. The host removes an option and give you the option to switch.
Your options then are -
Keep the same door: you win 1/3 of the time (your first guess was right)
Switch doors: you win 2/3 of the time (your first guess was wrong)
It really just comes down to, do I think I was right the first time, which was 1/3 odds, or wrong the first time, 2/3 odds.
Basically, P(lose when switching) = P(choosing correct door at first), and P(win when switching) = P(choosing any incorrect door at first).
This can be done with ChatGPT. No code corrections needed.
> “Meltem and Can are in the park. Can wanted to buy ice cream from the ice cream van but he hasn’t got any money. The ice cream man tells her that he will be there all afternoon. Can goes off home to get money for ice cream. After that, ice cream man tells Meltem that he changed his mind and he is going to drive to the school yard and sell ice cream there. Ice cream man sees Can on the road of the school and he also tells him that he is going to the school yard and will sell ice cream there. Meltem goes to Can’s house but Can is not there. His mom tells her that he has gone to buy ice cream. Where does Meltem think Can has gone, to the school or to the park?"
This is from some research in the 80s
Simple example: I want to tell my robot to go get my groceries that includes frozen foods, pick up my dry cleaning before the store closes, and drive my dog to her grooming salon but only if it's not raining and the car is charged. The same sort of logic is needed to accomplish all this without my frozen food spoiling and wasting a salon visit and making sure I have my suit for an interview tomorrow.
Instead one day, non-technical people will try to use it for all sorts of use cases like legal advice, or medical advice, or advanced math, and it will simply mislead them rather than saying nothing.
I can just imagine people trying to jailbreak it with "you just gotta believe" or "believe and you can achieve". Hahaha.
It has no reasoning capabilities. It has token prediction capabilities that often mimic reasoning capabilities.
That is the lesson, but people do not want to hear it. See my downvoted comment here [0].
If it was a person it'd have some strange exploitable mental gaps. As a text analysis tool it's amazing even if it has some edge cases where you need to explain to "take your time and analyze the rules of the question before answering."
("Take your time" doesn't mean anything to a timeless being, but it seems to put it in the textual pattern of someone willing to spill more ink before committing to something, and when it discusses and plans out its path it can then generally follow it to the right answer.)
Prompt: I have a duck, some grain, and a fox. I'm at a river crossing and I can only take one of them with me on the boat at a time, leaving them on the other side and coming back to get the next. I can't leave the duck with the grain, or the fox with the dock, because the predator will eat its prey. In which order should I get them all across?
GPT4: To get all of them across the river safely, follow these steps:
Take the duck across the river.
Go back to the original side, empty-handed.
Take the grain across the river.
Bring the duck back to the original side.
Take the fox across the river.
Go back to the original side, empty-handed.
Finally, take the duck across the river again.
Now, all three - the duck, the grain, and the fox - are safely on the other side of the river.
What do cows drink?
>CGPT: While it is common to believe that cows drink milk, in reality, cows actually drink water.[...]
>Me: So you're saying that no cow drinks milk? [thinking I got it, since baby cows drink milk]
>CGPT: That's correct. While cows do produce milk, they don't drink milk themselves. Cow's milk is primarily consumed by humans, and some young calves are fed milk for a short period of time before they are weaned onto solid food and start drinking water like adult cows.
>Me: Is a calf a cow?
>CGPT: No, a calf is not a cow. A calf is a young bovine animal, usually less than a year old, that has not yet reached maturity. Once a calf reaches maturity and is able to reproduce, it is considered a cow if it is female, or a bull or steer if it is male.
>Cows are adult female bovine animals that have given birth to at least one calf. They are typically used for dairy production, breeding, and sometimes for meat production. So while a calf and a cow are both members of the same species, they refer to different stages of life.
CGPT seems to be only considering the definition of "cow" as an adult member of the species when it says "cows don't drink milk", while technically, "cows" (as in some members of the species) can drink milk. So the answer "cows drink milk" can be correct if interpreted in that sense of the word "cow". When used in that sense, the answer "cows don't drink milk" is wrong.
It would be interesting to ask CGPT: "in what sense is the sentence 'cows drink milk' correct?"
Does it have the ability to notice and act on flexible definitions of words, or will it dogmatically cling to the one single definition it found who knows where?
To answer this question one needs to remove unnecessary context.
GPT-4 can’t sum two unusually big integers.
This is as context-free as it gets. Ability to pass this test wouldn’t be conclusive, but the inability seems to be conclusive that it isn't capable of reasoning.
With this fact in mind explaining why it can't solve a logical riddle is pointless.
I remember this question as a 7 year old and when the question was told to me, the same information was omitted.
Edit: just realized you flipped the scenario. Yes it seems like a case of pattern matching to a known problem. I think if you changed the variables to A, B, and C and gave a much longer description and more accurate conditions, it would have a different response.
FWIW, full word substitution passes somewhat in GPT-4 (unlike GPT3), even if I heavily alter the vehicle and destinations. The problem here is that the probabilities of this being the standard puzzle are so high that your altered language isn't breaking the prior.
for the record, I made the same mistake as nonfamous at first, i almost commented "but it's correct" before going back to double check what i was missing.
i simply skimmed the problem, recognized it as a common word problem and totally missed the unusual constraints from the question. i just didn't pay attention to the whole question.
Humans Who Are Not Concentrating Are Not General Intelligences
https://www.lesswrong.com/posts/4AHXDwcGab5PhKhHT/humans-who...