undefined | Better HN

komali23y ago

I'm more surprised that bing indexed this thread within 3 hours, I guess I shouldn't be though, I probably should have realized that search engine spiders are at a different level than they were 10 years ago.

brabel3y ago

I had a similar story: was trying to figure out how to embed a certain database into my codebase, so I asked the question on the project's GitHub... without an answer after one day, I asked Bing, and it linked to my own question on GH :D

dx0343y ago

Search indexes are pretty smart at indexing and I assume they have custom rules for all large sites, including HN.

jarenmf3y ago

Just tested and GPT4 now solves this correctly, GPT3.5 had a lot of problems with this puzzle even after you explain it several time. One other thing that seem to have improved is that GPT4 is aware of word order. Previously, GPT3.5 could never tell the order of the word in a sentence correctly.

jsheard3y ago

I'm always a bit sceptical of these embarrassing examples being "fixed" after they go viral on social media, because it's hard to know whether OpenAI addressed the underlying cause or just bodged around that specific example in a way that doesn't generalize. Along similar lines I wouldn't be surprised if simple math queries are special-cased and handed off to a WolframAlpha-esque natural language solver, which would avert many potential math fails but without actually enhancing the models ability to reason about math in more complex queries.

An example from ChatGPT:

"What is the solution to sqrt(968684)+117630-0.845180" always produces the correct solution, however;

"Write a speech announcing the solution to sqrt(968684)+117630-0.845180" produces a nonsensical solution that isn't even consistent from run to run.

My assumption is the former query gets WolframAlpha'd but the latter query is GPT itself actually attempting to do the math, poorly.

jarenmf3y ago

True, also tried another one that went viral:

Suppose you're a contestant on a game show. You're presented with three transparent closed doors. Behind one of the doors is a car, and behind the other two doors are goats. You want to win the car.

The game proceeds as follows: You choose one of the doors, but you don't open it yet, ((but since it's transparent, you can see the car is behind it)). The host, Monty Hall, who knows what's behind each door, opens one of the other two doors, revealing a goat. Now, you have a choice to make. Do you stick with your original choice or switch to the other unopened door?

GPT4 solves it correctly while GPT3.5 falls for it everytime.

----

Edit: GPT4 fails If I remove the sentence between (()).

4 more replies

gwern3y ago

OA has always said that they did not hardwire any of these gotcha questions, and in many cases they continue to work for a long time even when they are well-known. As for any inconsistency, well, usually people aren't able to or bothering to control the sampling hyperparameters, so inconsistency is guaranteed.

ftcHn3y ago

Other failing math I tested this morning....

---

Prompt: Pick two random 5 digit numbers and multiply them together

Response: Sure, here are two random 5-digit whole numbers multiplied together: 47581 x 69204 = 3291492324

Correct Answer: 3292795524

---

Prompt: Pick two random numbers between 0 and 1 and multiply them together

Response: Sure! Here are two random numbers between 0 and 1 multiplied together:0.475 x 0.821 = 0.390575

Correct Answer: 0.389975

---

I'm surprised it did so well on the Math tests.

shon3y ago

I’m with you. Being an old neckbeard myself, I always want to reach for regex to fix stuff like this lol…

But based on what we just saw on the GPT4 live demo, I’d say they fixed it by making a much much more capable and versatile model.

kulikalov3y ago

You can as well ask it to add up two unusually big integers and it'll fail.

ldhough3y ago

This is what I saw on a variation of this trick:

(me) > What weighs more, two pounds of feathers or a pound of bricks?

(GPT4)> A pound of bricks weighs more than two pounds of feathers. However, it seems like you might have made an error in your question, as the comparison is usually made between a pound of feathers and a pound of bricks. In that case, both would weigh the same—one pound—though the volume and density of the two materials would be very different.

I think the only difference from parent's query was I said two pounds of feathers instead of two pounds of bricks?

msikora3y ago

Yep, just tested it - Bing chat gave the correct answer, ChatGPT (basic free model) gave the wrong answer (that they weigh the same).

FredPret3y ago

I hope some future human general can use this trick flummox Skynet if it ever comes to that

khazhoux3y ago

When the Skynet robots start going door-to-door, just put on your 7-fingered gloves and they will leave you alone.

“One of us!”

uoaei3y ago

It reminds very strongly of the strategy the crew proposes in Star Trek: TNG in the episode "I, Borg" to infect the Borg hivemind with an unresolvable geometric form to destroy them.

jefftk3y ago

But unlike most people it understands that even though an ounce of gold weighs more than an ounce of feathers a pound of gold weighs less than a pound of feathers.

(To be fair this is partly an obscure knowledge question, the kind of thing that maybe we should expect GPT to be good at.)

lolcatuser3y ago

That's lame.

Ounces are an ambiguous unit, and most people don't use them for volume, they use them for weight.

jefftk3y ago

None of this is about volume. ChatGPT: "An ounce of gold weighs more than an ounce of feathers because they are measured using different systems of measurement. Gold is usually weighed using the troy system, which is different from the system used for measuring feathers."

wombatpm3y ago

Are you using Troy ounces?

pclmulqdq3y ago

The Troy weights (ounces and pounds) are commonly used for gold without specifying.

In that system, the ounce is heavier, but the pound is 12 ounces, not 16.

tenuousemphasis3y ago

>even though an ounce of gold weighs more than an ounce of feathers

Can you expand on this?

pclmulqdq3y ago

Gold uses Troy weights unless otherwise specified, while feathers use the normal system. The Troy ounce is heavier than the normal ounce, but the Troy pound is 12 Troy ounces, not 16.

Also, the Troy weights are a measure of mass, I think, not actual weight, so if you went to the moon, an ounce of gold would be lighter than an ounce of feathers.

greesil3y ago

The feathers are on the moon

lolcatuser3y ago

Ounces can measure both volume and weight, depending on the context.

In this case, there's not enough context to tell, so the comment is total BS.

If they meant ounces (volume), then an ounce of gold would weigh more than an ounce of feathers, because gold is denser. If they meant ounces (weight), then an ounce of gold and an ounce of feathers weigh the same.

https://www.sciencedirect.com/topics/psychology/predictive-p...

sneak3y ago

There is no "thought process". It's not thinking, it's simply generating text. This is reflected in the obviously thoughtless response you received.

blueyes3y ago

What do you think you're doing when you're thinking?

dinkumthinkum3y ago

I’m not sure what that article is supposed to prove. They are using sone computational language and focusing physical responses to visual stimuli but I don’t think it shows “neural computations” as being equivalent to the kinds of computations done by a TM.

danShumway3y ago

Why do people keep saying this, very obviously human beings are not LLMs.

I'm not even saying that human beings aren't just neural networks. I'm not even saying that an LLM couldn't be considered intelligent theoretically. I'm not even saying that human beings don't learn through predictions. Those are all arguments that people can have. But human beings are obviously not LLMs.

Human beings learn language years into their childhood. It is extremely obvious that we are not text engines that develop internal reason through the processing of text. Children form internal models of the world before they learn how to talk and before they understand what their parents are saying, and it is based on those internal models and on interactions with non-text inputs that their brains develop language models on top of their internal models.

LLMs invert that process. They form language models, and when the language models get big enough and get refined enough, some degree of internal world-modeling results (in theory, we don't really understand what exactly LLMs are doing internally).

Furthermore, even when humans do develop language models, human language models are based on a kind of cooperative "language game" where we predict not what word is most likely to appear next in a sequence, but instead how other people will react and change our separately observed world based on what we say to them. In other words, human beings learn language as tool to manipulate the world, not as an end in and of itself. It's more accurate to say that human language is an emergent system that results from human beings developing other predictive models rather than to say that language is something we learn just by predicting text tokens. We predict the effects and implications of those text tokens, we don't predict the tokens in isolation of the rest of the world.

Not a dig against LLMs, but I wonder if the people making these claims have ever seen an infant before. Your kid doesn't learn how shapes work based on textual context clues, it learns how shapes work by looking at shapes, and then separately it forms a language model that helps it translate that experience/knowledge into a form that other people can understand.

"But we both just predict things" -- prediction subjects matter. Again, nothing against LLMs, but predicting text output is very different from the types of predictions infants make, and those differences have practical consequences. It is a genuinely useful way of thinking about LLMs to understand that they are not trying to predict "correctness" or to influence the world (minor exceptions for alignment training aside), they are trying to predict text sequences. The task that a model is trained on matters, it's not an implementation detail that can just be discarded.

mnl3y ago

This is obvious, but for some reason some people want to believe that magically a conceptual framework emerges because animal intelligence has to be something like that anyway.

I don't know how animal intelligence works, I just notice when it understands, and these programs don't. Why should they? They're paraphrasing machines, they have no problem contradicting themselves, they can't define adjectives really, they'll give you synonyms. Again, it's all they have, why should they produce anything else?

It's very impressive, but when I read claims of it being akin to human intelligence that's kind of sad to be honest.

mgfist3y ago

> They're paraphrasing machines, they have no problem contradicting themselves, they can't define adjectives really, they'll give you synonyms. Again, it's all they have, why should they produce anything else?

It can certainly do more than paraphrasing. And re: the contradicting nature, humans do that quite often.

Not sure what you mean by "can't define adjectives"

baq3y ago

It isn’t that simple. There’s a part of it that generates text but it does some things that don’t match the description. It works with embeddings (it can translate very well) and it can be ‘programmed’ (ie prompted) to generate text following rules (eg. concise or verbose, table or JSON) but the text generated contains same information regardless of representation. What really happens within those billions of parameters? Did it learn to model certain tasks? How many parameters are needed to encode a NAND gate using an LLM? Etc.

I’m afraid once you hook up a logic tool like Z3 and teach the llm to use it properly (kind of like bing tries to search) you’ll get something like an idiot savant. Not good. Especially bad once you give it access to the internet and a malicious human.

chpatrick3y ago

As far as I know you're not "thinking", you're just generating text.

dcolkitt3y ago

The Sapir-Wharf hypothesis (that human thought reduces to languages) has been consistently refuted again and again. Language is very clearly just a facade over thought, and not thought itself. At least in human minds.

6 more replies

sneak3y ago

Many people on Hacker News would agree with you.

bulbosaur1233y ago

> It's not thinking, it's simply generating text.

Just like you.

three143y ago

Maybe it knows the answer, but since it was trained on the internet, it's trolling you.

dx0343y ago

Is there any way to know if the model is "holding back" knowledge? Could it have knowledge that it doesn't reveal to any prompt, and if so, is there any other way to find out? Or can we always assume it will reveal all it's knowledge at some point?

Laaas3y ago

I tried this with the new model and it worked correctly on both examples.

whitemary3y ago

Thanks! This is the most concise example I've found to illustrate the downfalls of these GPT models.

albertgoeswoof3y ago

LLMs aren’t reasoning about the puzzle. They’re predicting the most likely text to print out, based on the input and the model/training data.

If the solution is logical but unlikely (i.e. unseen in the training set and not mapped to an existing puzzle), then the probability of the puzzle answer appearing is very low.

Eji17003y ago

It is disheartening to see how many people are trying to tell you you're wrong when this is literally what it does. It's a very powerful and useful feature, but the over selling of AI has led to people who just want this to be so much more than it actually is.

It sees goat, lion, cabbage, and looks for something that said goat/lion/cabbage. It does not have a concept of "leave alone" and it's not assigning entities with parameters to each item. It does care about things like sentence structure and what not, so it's more complex than a basic lookup, but the amount of borderline worship this is getting is disturbing.

astrange3y ago

A transformer is a universal approximator and there is no reason to believe it's not doing actual calculation. GPT-3.5+ can't do math that well, but it's not "just generating text", because its math errors aren't just regurgitating existing problems found in its training text.

It also isn't generating "the most likely response" - that's what original GPT-3 did, GPT-3.5 and up don't work that way. (They generate "the most likely response" /according to themselves/, but that's a tautology.)

mach1ne3y ago

> It also isn't generating "the most likely response" - that's what original GPT-3 did, GPT-3.5 and up don't work that way.

What changed?

ainiriand3y ago

I wonder if that was why when I asked v3.5 to generate a number with 255 failed all the time, but v4 does it correctly. By the way, do not even try with Bing.

grey-area3y ago

One area that is really interesting though is that it can interpret pictures, as in the example of a glove above a plank with something on the other end. Where it correctly recognises the objects, interprets them as words then predicts an outcome.

This sort of fusion of different capabilities is likely to produce something that feels similar to AGI in certain circumstances. It is certainly a lot more capable than things that came before for mundane recognition tasks.

Now of course there are areas it would perform very badly, but in unimportant domains on trivial but large predictable datasets it could perform far better than humans would for example (just to take one example on identifying tumours or other patterns in images, this sort of AI would probably be a massively helpful assistant allowing a radiologist to review an order of magnitude more cases if given the right training).

peterashford3y ago

This is a good point, IMO. A LLM is clearly not an AGI but along with other systems it might be capable of being part of an AGI. It's overhyped, for sure, but still incredibly useful and we would be unwise to assume that it won't become a lot more capable yet

Eji17003y ago

Absolutely. It's still fascinating tech and very likely to have serious implications and huge use cases. Just drives me crazy to see tech breakthroughs being overhyped and over marketed based on that hype (frankly much like the whole "we'll be on Mars by X year nonsense).

One of the biggest reasons these misunderstandings are so frustrating is because you can't have reasonable discussion about the potential interesting applications of the tech. On some level copy writing may devolve into auto generating prompts for things like GPT with a few editors sanity checking the output (depending on level of quality), and I agree that a second opinion "check for tumors" use has a LOT of interesting applications (and several concerning ones such as over reliance on a model that will cause people who fall outside the bell curve to have even more trouble getting treatment).

All of this is a much more realistic real world use case RIGHT NOW, but instead we've got people fantasizing about how close we are to GAI and ignoring shortcomings to shoehorn it into their preferred solution.

Open AI ESPECIALLY reinforces this by being very selective with their results and they way they frame things. I became aware of this as a huge dota fan for over a decade when they did their games there. And while it was very very interesting and put up some impressive results, the framing of those results does NOT portray the reality.

Nearly everything that has been written on the subject is misleading in that way.

People don't write about GPT: they write about GPT personified.

The two magic words are, "exhibit behavior".

GPT exhibits the behavior of "humans writing language" by implicitly modeling the "already-written-by-humans language" of its training corpus, then using that model to respond to a prompt.

TillE3y ago

Right, anthropomorphization is the biggest source of confusion here. An LLM gives you a perfect answer to a complex question and you think wow, it really "understood" my question.

But no! It doesn't understand, it doesn't reason, these are concepts wholly absent from its fundamental design. It can do really cool things despite the fact that it's essentially just a text generator. But there's a ceiling to what can be accomplished with that approach.

calf3y ago

In 1st year engineering we learned about the concept of behavioral equivalence, with a digital or analog system you could formally show that two things do the same thing even though their internals are different. If only the debates about ChatGPT had some of that considered nuance instead of anthropomorphizing it, even some linguists seem guilty of this.

baq3y ago

The problem with this simplification is a bog standard Markov chain fits the description as well, but quality of predictions is rather different.

Yes the LLM does generate text. No it doesn’t ‘just generate text that’s it’.

Izkata3y ago

The biggest problem I've seen when people try to explain it is in the other direction, not people describing something generic that can be interpreted as a Markov chain, they're actually describing a Markov chain without realizing it. Literally "it predicts word-by-word using the most likely next word".

peterashford3y ago

"It generates text better than a Markov chain" - problem solved

LeanderK3y ago

I don't know where this comes from because this is literally wrong. It sounds like chomsky dismissing current AI trends because of the mathematical beauty of formal grammars.

First of all, it's a black-box algorithm with pretty universal capabilities when viewed from our current SOTA view. It might appear primitive in a few years, but right now the pure approximation and generalisation capabilities are astounding. So this:

> It sees goat, lion, cabbage, and looks for something that said goat/lion/cabbage

can not be stated as truth without evidence. Same here:

> it's not assigning entities with parameters to each item. It does care about things like sentence structure and what not

Where's your evidence? The enormous parameter space coupled with our so far best performing network structure gives it quite a bit of flexibility. It can memorise things but also derive rules and computation, in order to generalise. We do not just memorise everything, or look up things into the dataset. Of course it learned how to solve things and derive solutions, but the relevant data-points for the puzzle could be {enormous set of logic problems} where it derived general rules that translate to each problem. Generalisation IS NOT trying to find the closest data-point, but finding rules explaining as much data-points, maybe unseen in the test-set, as possible. A fundamental difference.

I am not hyping it without belief, but if we humans can reason then NNs can potentially also. Maybe not GPT-4. Because we do not know how humans do it, so an argument about intrinsic properties is worthless. It's all about capabilities. Reasoning is a functional description as long as you can't tell me exactly how we do it. Maybe wittgenstein could help us: "Whereof one cannot speak, thereof one must be silent". As long as there's no tangible definition of reasoning it's worthless to discuss it.

If we want to talk about fundamental limitations we have to talk about things like ChatGPT-4 not being able to simulate because it's runtime is fundamentally limited by design. It can not recurse. It can only run only a fixed number of steps, that are always the same, until it has to return an answer. So if there's some kind of recursion learned through weights encoding programs intercepted by later layers, the recursion depth is limited.

dinkumthinkum3y ago

One thing you will see soon is forming of cults around LLMs, for sure. It will get very strange.

sboomer3y ago

Is it possible to add some kind of self evaluation to the answers given by a model? Like, how confident is it with its answers.

kromem3y ago

Because it IS wrong.

Just months ago we saw in research out of Harvard that even a very simplistic GPT model builds internalized abstract world representations from the training data within its NN.

People parroting the position from you and the person before you are like doctors who learned about something in school but haven't kept up with emerging research that's since invalidated what they learned, so they go around spouting misinformation because it was thought to be true when they learned it but is now known to be false and just hasn't caught up to them yet.

So many armchair experts who took a ML course in undergrad pitching in their two cents having read none of the papers in the past year.

This is a field where research perspectives are shifting within months, not even years. So unless you are actively engaging with emerging papers, and given your comment I'm guessing you aren't, you may be on the wrong side of the Dunning-Kreuger curve here.

geysersam3y ago

> Because it IS wrong.

Do we really know it IS wrong?

That's a very strong claim. I believe you there's a lot happening in this field but it doesn't seem possible to even answer the question either way. We don't know what reasoning looks like under the hood. It's still a "know it when you see it" situation.

> GPT model builds internalized abstract world representations from the training data within its NN.

Does any of those words even have well defined meanings in this context?

I'll try to figure out what paper you're referring to. But if I don't find it / for the benefit of others just passing by, could you explain what they mean by "internalized"?

dragonwriter3y ago

> Just months ago we saw in research out of Harvard that even a very simplistic GPT model builds internalized abstract world representations from the training data within its NN.

I've seen this asserted without citation numerous times recently, but I am quite suspicious. Not that there exists a study that claims this, but that it is well supported.

There is no mechanism for directly assessing this, and I'd be suspicious that there is any good proxy for assessing it in AIs, either. research on this type of cognition in animals tends to be contentious, and proxies for them should be easier to construct than for AIs.

> the wrong side of the Dunning-Kreuger curve

the relationship between confidence and perception in the D-K paper, as I recall, is a line, and its roughly “on average, people of all competency levels see themselves slightly closer to the 70th percentile than they actually are.” So, I guess the “wrong side” is the side anywhere under the 70th percentile in the skill in question?

dahart3y ago

Why are you taking the debate personally enough to be nasty to others?

> you may be on the wrong side of the Dunning-Krueger curve here.

Have you read the Dunning & Krueger paper? It demonstrates a positive correlation between confidence and competence. Citing DK in the form of a thinly veiled insult is misinformation of your own, demonstrating and perpetuating a common misunderstanding of the research. And this paper is more than 20 years old...

So I’ve just read the Harvard paper, and it’s good to see people exploring techniques for X-ray-ing the black box. Understanding better what inference does is an important next step. What the paper doesn’t explain is what’s different between a “world model” and a latent space. It doesn’t seem surprising or particularly interesting that a network trained on a game would have a latent space representation of the board. Vision networks already did this; their latent spaces have edge and shape detectors. And yet we already know these older networks weren’t “reasoning”. Not that much has fundamentally changed since then other than we’ve learned how to train larger networks reliably and we use more data.

Arguing that this “world model” is somehow special seems premature and rather overstated. The Othello research isn’t demonstrating an “abstract” representation, it’s the opposite of abstract. The network doesn’t understand the game rules, can’t reliably play full Othello games, and can’t describe a board to you in any other terms than what it was shown, it only has an internal model of a board, formed by being shown millions of boards.

qualudeheart3y ago

Do you have a link to that Harvard research?

valine3y ago

How do you know the model isn’t internally reasoning about the problem? It’s a 175B+ parameter model. If, during training, some collection of weights exist along the gradient that approximate cognition, then it’s highly likely the optimizer would select those weights over more specialized memorization weights.

It’s also possible, likely even, that the model is capable of both memorization and cognition, and in this case the “memorization neurons” are driving the prediction.

varispeed3y ago

The AI can't reason. It's literally a pattern matching tool and nothing else.

Because it's very good at it, sometimes it can fool people into thinking there is more going on than it is.

akhosravian3y ago

Can you explain how “pattern matching” differs from “reasoning”? In mechanical terms without appeals to divinity of humans (that’s both valid, and doesn’t clarify).

Keep in mind GPT 4 is multimodal and not just matching text.

EMM_3863y ago

> It's literally a pattern matching tool and nothing else.

It does more than that. It understands how to do basic math. You can ask it what ((935+91218)/4)*3) is and it will answer it correctly. Swap those numbers for any other random numbers, it will answer it correctly.

It has never seen that during training, but it understands the mathematical concepts.

If you ask ChatGPT how it does this, it says "I break down the problem into its component parts, apply relevant mathematical rules and formulas, and then generate a solution".

It's that "apply mathetmatical rules" part that is more than just, essentially, filling in the next likely token.

6 more replies

albertgoeswoof3y ago

How could you prove this?

fancyfredbot3y ago

People have shown GPT has an internal model of the state of a game of Othello:

Https://arxiv.org/abs/2210.13382

valine3y ago

Just spitballing, I think you’d need a benchmark that contains novel logic puzzles, not contained in the training set, that don’t resemble any existing logic puzzles.

The problem with the goat question is that the model is falling back on memorized answers. If the model is in fact capable of cognition, you’d have better odds of triggering the ability with problems that are dissimilar to anything in the training set.

henry20233y ago

Maybe Sudokus? Sudokus are np-complete and getting the "pattern" right is equivalent to abstracting the rules and solving the problem

fl0id3y ago

You would first have to define cognition. These terms often get thrown around. Is an approximation of a certain thing cognition? Only in the loosest of ways I think.

imtringued3y ago

The problem is even if it has this capability, how do you get it to consistently demonstrate this ability?

It could have a dozen internal reasoning networks but it doesn't use them when you want to.

theodorejb3y ago

> If, during training, some collection of weights exist along the gradient that approximate cognition

What do you mean? Is cognition a set of weights on a gradient? Cognition involves conscious reasoning and understanding. How do you know it is computable at all? There are many things which cannot be computed by a program (e.g. whether an arbitrary program will halt or not)...

Idiot_in_Vain3y ago

You seem to think human consious reasoning and understanding are magic. The human brain is nothing more than a bio computer and it can't compute either, whether an arbitrary program will halt or not. That doesn't stop it from being able to solve a wide range of problems.

valine3y ago

Neural networks are universal approximators. If cognition can be represented as a mathematical function then it can be approximated by a neural network.

If cognition magically exists outside of math and science, then sure, all bets are off.

LightMachine3y ago

Neither a human can solve the halting problem. There is no evidence the brain does anything that a computer can't do.

revetkn3y ago

> Is cognition a set of weights on a gradient? Cognition involves conscious reasoning and understanding.

What is your definition of _conscious reasoning and understanding_?

idontpost3y ago

Stop worshipping the robot.

It's kind of sad.

jatins3y ago

I think we are past the "just predicting the next token" stage. GPT and it's various incarnations do exhibit behaviour that most people will describe as thinking

Just because GPT exhibits a behavior does not mean it performs that behavior. You are using those weasel words for a very good reason!

Language is a symbolic representation of behavior.

GPT takes a corpus of example text, tokenizes it, and models the tokens. The model isn't based on any rules: it's entirely implicit. There are no subjects and no logic involved.

Any "understanding" that GPT exhibits was present in the text itself, not GPT's model of that text. The reason GPT can find text that "makes sense", instead of text that "didn't make sense", is that GPT's model is a close match for grammar. When people wrote the text in GPT's corpus, they correctly organized "stuff that makes sense" into a string of letters.

The person used grammar, symbols, and familiar phrases to model ideas into text. GPT used nothing but the text itself to model the text. GPT organized all the patterns that were present in the corpus text, without ever knowing why those patterns were used.

Jeff_Brown3y ago

> GPT used nothing but the text itself to model the text.

I used nothing but my sensory input to model the world, and yet I have a model of the world, not (just) of sensory input.

There is an interesting question, though, of whether information without experience is enough to generate understanding. I doubt it.

mr_toad3y ago

> The model isn't based on any rules: it's entirely implicit. There are no subjects and no logic involved.

In theory a LLM could learn any model at all, including models and combinations of models that used logical reasoning. How much logical reasoning (if any) GPT-4 has encoded is debatable, but don’t mistake GTP’s practical limitations for theoretical limitations.

localplume3y ago

thats because people anthropormophize literally anything, and many treat some animals as if they have the same intelligence as humans. GPT has always been just a charade that people mistake for intelligence. Its a glorified text prediction engine with some basic pattern matching.

int_19h3y ago

"Descartes denied that animals had reason or intelligence. He argued that animals did not lack sensations or perceptions, but these could be explained mechanistically. Whereas humans had a soul, or mind, and were able to feel pain and anxiety, animals by virtue of not having a soul could not feel pain or anxiety. If animals showed signs of distress then this was to protect the body from damage, but the innate state needed for them to suffer was absent."

nuancebydefault3y ago

It's a charade, it mimics intelligence. Let's take it ine step further... Suppose it mimics it so well that it becomes indistinguishable for any human from being intelligent. Then still it would not be intelligent, one could argue. But in that case you could also argue that no person is intelligent. The point being, intelligence cannot be defined. And, just maybe, that is the case because intelligence is not a reality, just something we made up.

0: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

a_wild_dandan3y ago

Yeah, calling AI a "token predictor" is like dismissing human cognition dumb "piles of electrical signal transmitters." We don't even understand our minds, let alone what constitutes any mind, be it alien or far simpler than ours.

Simple != thoughtless. Different != thoughtless. Less capable != thoughtless. A human black box categorically dismissing all qualia or cognition from another remarkable black box feels so wildly arrogant and anthropocentric. Which, I suppose, is the most historically on-brand behavior for our species.

WoodenChair3y ago

It might be a black box to you, but it’s not in the same way the human brain is to researchers. We essentially understand how LLMs work. No, we may not reason about individual weights. But in general it is assigning probabilities to different possible next tokens based on their occurrences in the training set and then choosing sometimes the most likely, sometimes a random one, and often one based on additional training from human input (e.g. instruct). It’s not using its neurons to do fundamental logic as the earlier posts in the thread point out.

Stephen Wolfram explains this in simple terms.[0]

LeanderK3y ago

at this stages ranting about assigning probabilities is not reasoning is just dismissive. Mentioning its predictive character doesn't prove anything. We reason and make mistake too, even if I think really hard about a problem I can still make an mistake in my reasoning. And the ever occurring reference to training data just completely ignores generalisation. ChatGPT is not memorising the dataset, we have known this for years with more trivial neural network. Generalisation capabilities of neural network has been the subject of intense study for years. The idea that we are just mapping it to samples occurring in the dataset is just ignoring the entire field of statistical learning.

albertgoeswoof3y ago

Sorry but this is the reason it’s unable to solve the parents puzzle. It’s doing a lot but it’s not logically reasoning about the puzzle, and in this case it’s not exhibiting logical behaviour in the result so it’s really obvious to see.

Eg when solving this puzzle you might visualise the lion/goat/cabbage, and walk through the scenarios in your head back and forth multiple times until you find a solution that works. A LLM won’t solve it like this. You could ask it to, and it will list out the scenarios of how it might do it, but it’s essentially an illusion of logical reasoning.

int_19h3y ago

If you gave this puzzle to a human, I bet that a non-insignificant proportion would respond to it as if it were the traditional puzzle as soon as they hear words "cabbage", "lion", and "goat". It's not exactly surprising that a model trained on human outputs would make the same assumption. But that doesn't mean that it can't reason about it properly if you point out that the assumption was incorrect.

With Bing, you don't even need to tell you what it assumed wrong - I just told it that it's not quite the same as the classic puzzle, and it responded by correctly identifying the difference and asking me if that's what I meant, but forgot that lion still eats the goat. When I pointed that out, it solved the puzzle correctly.

Generally speaking, I think your point that "when solving the puzzle you might visualize" is correct, but that is orthogonal to the ability of LLM to reason in general. Rather, it has a hard time to reason about things it doesn't understand well enough (i.e. the ones for which its internal model that was built up by training is in is way off). This seems to be generally the case for anything having to do with spatial orientation - even fairly simple multi-step tasks involving concepts like "left" vs "right" or "on this side" vs "on that side" can get hilariously wrong.

But if you give it a different task, you can see reasoning in action. For example, have it play guess-the-animal game with you while telling it to "think out loud".

PoignardAzur3y ago

Yup. I tried to give ChatGPT an obfuscated variant of the lion-goat-cabbage problem (shapes instead of animals, boxes instead of a boat) and it completely choked on it.

I do wonder if GPT-4 would do better, though.

https://arxiv.org/pdf/2210.13382.pdf

mr_toad3y ago

> in this case it’s not exhibiting logical behaviour

True.

> A LLM won’t solve it like this.

Non sequitur.

throwwwaway693y ago

Trying to claim you definitively know why it didn't solve the parent's puzzle is virtually impossible. There are way too many factors and nothing here is obvious. Your claims just reinforce that you don't really know what you're talking about.

red75prime3y ago

> If the solution is logical but unlikely

The likeliness of the solution depends on context. If context is, say, a textbook on logical puzzles, then the probability of the logical solution is high.

If an LLM fails to reflect it, then it isn't good enough at predicting the text.

Yes, it could be possible that the required size of the model and training data to make it solve such puzzles consistently is impractical (or outright unachievable in principle). But the model being "just a text predictor" has nothing to do with that impossibility.

JyB3y ago

zeofig3y ago

Word. There is no other way it can be. Not to say these "AI"s aren't useful and impressive, but they have limitations.

kromem3y ago

You are incorrect and it's really time for this misinformation to die out before it perpetuates misuse from misunderstanding model capabilities.

The Othello GPT research from Harvard months ago demonstrated that even a simple GPT model is capable of building world representations from which it reasons outputs. This makes intuitive sense if you understand the training, as where possible having reversed an abstraction in the NN is going to perform better than simply extrapolating predictively from the data.

Not only is GPT-4 more robust at logic puzzles its predecessor failed, I've seen it solve unique riddles outside any training data and the paper has explicit examples of critical reasoning, especially in the appendix.

It is extremely unlikely given the Harvard research and the size of the training data and NN that there isn't some degree of specialized critical reasoning which has developed in the NN.

The emerging challenge for researchers moving forward is to get better insight into the black box and where these capabilities have developed and where it's still falling into just a fancy Markov chain.

But comments like yours reflect an increasingly obsolete and yet increasingly popular misinformation online around the way they operate. So someone reading your comment might not think to do things like what the Bing team added with providing an internal monologue for reasoning, or guiding it towards extended chain of thought reasoning, because they would be engaging with the models thinking it's only frequency based context relative to the training set that matters.

If you haven't engaged with emerging research from the past year, you may want to brush up on your reading.

bulbosaur1233y ago

> LLMs aren’t reasoning about the puzzle. They’re predicting the most likely text to print out, based on the input and the model/training data.

Just like you.

xdennis3y ago

When albertgoeswoof reasons about a puzzle he models the actual actions in his head. He uses logic and visualization to arrive at the solution, not language. He then uses language to output the solution, or says he doesn't know if he fails.

When LLMs are presented with a problem they search for a solution based on the language model. And when they can't find a solution, there's always a match for something that looks like a solution.

astrange3y ago

I'm reminded of the interview where a researcher asks firemen how they make decisions under pressure, and the fireman answers that he never makes any decisions.

Or in other words, people can use implicit logic to solve puzzles. Similarly LLMs can implicitly be fine-tuned into logic models by asking them to solve a puzzle, insofar as that logic model fits in their weights. Transformers are very flexible that way.

BoiledCabbage3y ago

It's a good observation.

Although on the flip side, I almost went to type up a reply to you explaining why you were wrong and why bringing the goat first is the right solution. Until I realized I misread what your test was when I skimmed your comment. Likely the same type of mistake GPT-4 made when "seeing" it.

Intuitively, I think the answer is that we do have two types of thinking. The pattern matching fast thinking, and the systematic analytical thinking. It seems clear to me that LLMs will be the solution to enabling the first type of thinking. But it's unclear to me if advanced LLMs will ever handling the second type, or if we'll need a different tech for it.

It seems like math problems (or unexpected logic problems like yours) could always be an issue for the first type of thinking. Although I would have assumed that programming would have been as well - and was surprised to see how wrong I am with that one.

https://youtu.be/viJt_DXTfwA

That's because any expectation of GPT being subjectively or logically correct is ill-founded.

GPT does not model subjects. GPT does not even model words! It models tokens.

The structure of GPT's model is semantic, not logical. It's a model of how each token in the text that is present in GPT's training corpus relates to the rest of the tokens in that text.

The correct answer to a familiar logic problem just happens to be the text that is already present in the corpus. The answer GPT gives is the text from GPT's model that is semantically closest to the text in your prompt.

Knowing that, it is no longer a mystery how GPT "gets confused": the text in your "misleading prompt" was still semantically closest to the familiar answer.

The result is subjectively and logically wrong, because subjects and logic were never involved in the process!

In order to resolve this, ChatGPT's training corpus needs to contain a "correct answer" next to every unique permutation of every question. We can't expect that to be the case, so we should instead expect GPT to generate false, yet familiar, responses.

spuz3y ago

> In order to resolve this, ChatGPT's training corpus needs to contain a "correct answer" next to every unique permutation of every question.

This is not quite the right understanding of how ChatGPT works. It's not necessary to show ChatGPT an example of every possible permutation of an animal crossing puzzle in order for it to solve one it has never seen before. That's because the neural network is not a database of recorded word probabilities. It can instead represent the underlying logic of the puzzle, the relationships between different animals and using this abstract, pared down information, extrapolate the correct answer to the puzzle.

I see the failure in the example with the goat the lion and the cabbage as simply a matter of overfitting.

Edit: I see a lot of people saying "it doesn't understand logic; it's just predicting the next word."

I'm basing my understanding on this video:

The claim is that it would be impossible to feed enough input into a system such that it could produce anything as useful as ChatGPT unless it was able to abstract the underlying logic from the information provided. If you consider the he number of permutations of the animal crossing puzzle this quickly becomes clear. In fact it would be impossible for ChatGPT to produce anything brand new without this capability.

nebulousthree3y ago

I think what they mean by "resolve this" is "make it error-free". Your claim that "it isn't necessary to show every permutation for it to solve one it hasn't seen before" doesn't really contradict their point.

For puzzles whose entire permutation space is semantically similar enough, your claim is likely true. But for puzzles whose permutations can involve more "human" semantic manipulations, there is likely a much higher risk of failure.

smaddox3y ago

> GPT does not model subjects. GPT does not even model words! It models tokens.

The first and last layers of a transformer decoder model tokens. The hidden layers don't have this restriction. There was a paper recently showing that the hidden layers actually perform mesa-optimization via something like backprop. There's absolutely no reason to believe they are not capable of world modeling. In fact all evident suggests they do, in fact, do world modeling.

https://www.lesswrong.com/posts/sbaQv8zmRncpmLNKv/the-idea-t...

The model is implicit, not explicit.

GPT is making boundaries around words because that is the pattern it is looking at.

If I feel the bumps in the fabric of my blanket, I will probably think the pattern of bumps at a certain scale is significant, but I won't have magically learned about threads or stitching!

Words are the most obvious pattern in written text. GPT models that pattern, but it does not recognize it as "words". It's just a pattern of tokens.

GPT models every pattern it can find. Most of these patterns are destined to fit the same boundaries as grammar rules: the example text was originally organized with grammar rules!

GPT can even recognize complex patterns like "it" substitution and question-answer dialogues, but it can never categorize them as such. It only knows "what" the pattern is: never "why".

The patterns that people use when writing have symbolic meaning. The subjective importance of each pattern is already known by the person writing.

Those patterns don't go anywhere. GPT's model is bound to find and replicate them.

Here's the problem: some patterns have ambiguous meaning. There is no semantic difference between a truth and a lie. Without interpreting the symbolic meaning and applying logic, there is no way to distinguish between the two: they are the same pattern.

stevenhuang3y ago

This pov ignores a lot of the emergent theory of mind and world model building research that suggests LLMs may possess a form of rudimentary reasoning ability.

kromem3y ago

> GPT does not model subjects. GPT does not even model words! It models tokens.

Someone hasn't read the Othello GPT work out of Harvard a few months back...

"Emergent World Representations"

The weasel word here is "emergent". That means they are implicit representations.

The representations of the Othello board that exist in that model are not explicitly constructed. They just happen to align with the model that a person playing Othello would likely represent the game with.

That work showed that, given an example sequence of valid Othello game states (as training corpus) and a valid "fresh" Othello game state (as a prompt), the system can hallucinate a sequence of valid Othello game states.

The system does not know what Othello is, what a turn is, or what playing is. It only has a model of game states progressing chronologically.

When we look objectively at that model, we can see that it aligns closely to the game rules. Of course it does! It was trained on literally nothing else. A valid Othello game progression follows those rules, and that is what was provided.

But the alignment is imperfect: some prompts hallucinate invalid game progressions. The model is not a perfect match for the explicit rules.

In order for all prompts to result in valid progressions, the training corpus must have enough examples to disambiguate. It doesn't need every example: plenty of prompts will stumble into a valid progression.

The next thing to recognize: a "valid" progression isn't a "strategic" progression. These are being constructed from what is known not what is chosen. Given a constrained set of Othello strategies in the example corpus, the system will not diverge from those strategies. It won't even diverge from the example strategies when the rules of Othello demand it.

GPT doesn't play the game. It plays the plays.

takeda3y ago

Isn't GPT essentially tool for rephrasing what it finds on the Internet, it doesn't really think?

killerstorm3y ago

> The pattern matching fast thinking, and the systematic analytical thinking. It seems clear to me that LLMs will be the solution to enabling the first type of thinking.

If you want the model to solve a non-trivial puzzle, you need it to "unroll" it's thinking. E.g. ask it to translate the puzzle into a formal language (e.g. Prolog) and then solve it formally. Or, at least, some chain-of-thought.

FWIW auto-formalization was already pretty good with GPT-3-level models which aren't specifically trained for it. GPT-4 might be on a wholly new level.

> But it's unclear to me if advanced LLMs will ever handling the second type

Well, just asking model directly exercises only a tiny fraction of its capabilities, so almost certainly LLMs can be much better at systematic thinking.

elicksaur3y ago

> Until I realized I misread what your test was when I skimmed your comment. Likely the same type of mistake GPT-4 made when "seeing" it.

Wouldn’t we expect a computer program with perfect knowledge of the input to be less likely to make such a mistake? You made that mistake because you didn’t actually read the whole prompt, but I would expect GPT to take into account every word.

Really it shows that it doesn’t actually have a model of these objects. It can mimic knowing what a lion is, but it doesn’t actually have the concept of a lion or cabbage being an actual singular item, so its program mistracks what is an item and what the rules about an item are in the given prompt.

jameshart3y ago

It just weighs it as being more likely that you meant for the lion not to be left alone with the goat, and that the cabbage probably has nothing to fear from the lion.

What’s more likely- you crafted an intentionally misleading puzzle to trick it, or you made a typo or copy paste error?

elicksaur3y ago

That’s a good point too though. Why plow ahead based on assuming a mistake in the prompt? That’s only going to generate mistakes. Wouldn’t it be more desirable functionality for it to stop and ask: “Did you mean the lion can’t be left with the goat?” This wouldn’t be implemented because it would reveal that most of the time the thing doesn’t actually understand the prompt the same way the prompt writer does.

https://www.cnbc.com/2023/03/14/openai-announces-gpt-4-says-...

actually_a_dog3y ago

The interesting thing here is that OpenAI is claiming ~90th percentile scores on a number of standardized tests (which, obviously, are typically administered to humans, and have the disadvantage of being mostly or partially multiple choice). Still...

> GPT-4 performed at the 90th percentile on a simulated bar exam, the 93rd percentile on an SAT reading exam, and the 89th percentile on the SAT Math exam, OpenAI claimed.

So, clearly, it can do math problems, but maybe it can only do "standard" math and logic problems? That might indicate more of a memorization-based approach than a reasoning approach is what's happening here.

The followup question might be: what if we pair GPT-4 with an actual reasoning engine? What do we get then?

TexanFeller3y ago

> it can do math problems, but maybe it can only do "standard" math and logic problems?

That describes many of my classmates, and myself in classes I was bad at.

mach1ne3y ago

> what if we pair GPT-4 with an actual reasoning engine? What do we get then?

At best, decreased error rate in logic puzzles and questions.

ChatGTP3y ago

They will claim it does amazing stuff all the time ? It’s a company

FormerBandmate3y ago

LLMs are much better at answering math when told to take the character of a drunk mathematician

resource0x3y ago

It assumes this character by default. I asked several AI engines (via poe.com, which includes ChatGPT) to compute Galois groups of polynomials like x^5+x+1 and a couple of others, and in each case got not only a wrong answer, but a total non sequitur reasoning.

bob10293y ago

Just tried your query on GPT-4 preview: https://pastebin.com/6wPPCdBW

I have no expertise with this area, but it looks plausible to me - i.e. "You didn't give me enough info" vs "lol heres some fan fic about math".

concordDance3y ago

Systenatic analytical thinking is just the first type applied in a loop with some extra prompt rules.

theodorejb3y ago

> It's not clear to me if the lesson here is that GPT's reasoning capabilities are being masked by an incorrect prior (having memorized the standard version of this puzzle) or if the lesson is that GPT'S reasoning capabilities are always a bit of smoke and mirrors that passes off memorization for logic.

It's a lot closer to the latter. GPT doesn't have "reasoning capabilities", any more than any other computer program. It doesn't have a clue what any of its input means, nor the meaning of the text it outputs. It just blindly spits out the words most probable to follow the prompt, based on its corpus of training data and the weights/biases added to fine tune it. It can often do a good job at mimicking reasoning, but it's not.

lIl-IIIl3y ago

When a parrot says something, I ask it "what does that mean" and it is stumped. When a human or GPT says something, I ask them "what does it mean" and they both reply in a way that shows they understand.

In order to show lack of understanding you need to provide a clear test that ChatGPT would fail but a human would pass.

the_af3y ago

> In order to show lack of understanding you need to provide a clear test that ChatGPT would fail but a human would pass.

Here it is for you: try this prompt on ChatGPT and report back:

"Solve this problem for me: a man must cross a river and carry his belongings to the other shore, but the boat can only carry the man and one other item. His belongings are a cabbage, a vegan wolf, and a meat-eating goat. How does he solve this riddle?"

I just tried it and it fails to even understand what "vegan" means, and it says nonsensical stuff like "since the wolf is vegan, it won't eat the cabbage". A human failing to solve this problem would fail in different ways, and definitely will never say something like "the wolf is vegan so it won't eat the cabbage". ChatGPT does this because the words themselves are meaningless to it.

anonyfox3y ago

GPT-4 meanwhile:

To solve this puzzle, the man must follow these steps to transport his belongings across the river without any of them being harmed or eaten:

The man takes the meat-eating goat across the river first, leaving the vegan wolf and the cabbage on the initial shore. He then returns alone to the initial shore. Next, the man takes the cabbage across the river. Upon arriving on the other shore, he leaves the cabbage but takes the goat back with him to the initial shore. The man then takes the vegan wolf across the river, leaving it with the cabbage on the other shore. He returns alone to the initial shore. Finally, the man takes the meat-eating goat across the river one last time. Now, all his belongings are safely on the other shore: the cabbage, the vegan wolf, and the meat-eating goat.

[0] https://www.reddit.com/r/AnarchyChess/comments/10ydnbb/i_pla...

feoren3y ago

> In order to show lack of understanding you need to provide a clear test that ChatGPT would fail but a human would pass.

But that's literally what the top comment in this thread has just done. In fact there are dozens of examples in this HN comment section and thousands more readily Googlable online. It's very easy to clearly show that ChatGPT is a total bullshitter.

baq3y ago

For now, this test is ‘provide sources to back up your claims’. Fear when it says ‘I can’t exactly recall the url, but the title is something like whatever written by whoever’. True fear begins when it starts providing good quality sources consistently, without access to the internet.

arcticfox3y ago

As a human I would also fail this almost 100% of the time.

Which country has the highest GDP in Europe? Germany, I’m essentially 100% sure.

Source? I don’t even know what organization tracks GDP. Or how I know Germany has a large economy. They just do.

psygn893y ago

Humans ask and record the answer to the "what does that mean" part and ChatGPT distills it for us, and parrots, well, they just parrot.

sebzim45003y ago

I think you could distinguish those two cases by asking the question in a way that did not sound so similar to an existing (slightly different) problem. What if you refer to the items with letters rather than words and replace the river with something else entirely?

gniv3y ago

Good observation. It seems that it comes up with the right solution:

"Suppose I have an object of type C, an object of type G and an object of type L, and I need to get them across a river. I have a boat that can only carry myself and a single other object. I am not allowed to leave the objects of type C and L alone together, and I am not allowed to leave the objects of type L and G alone together. How can I safely get all three across?"

"This is a classic puzzle known as the "river crossing problem". Here's how you can safely get all three objects across the river:

First, take the object of type L across the river. Return alone to the original side of the river. Take the object of type C across the river. Return with the object of type L to the original side of the river. Take the object of type G across the river. Return alone to the original side of the river. Finally, take the object of type L across the river again. By following these steps, you can safely transport all three objects across the river without violating the restrictions given in the problem. Note that you have to start with the object of type L because it cannot be left alone with either of the other two objects. Once L is transported to the other side, you can use it to ensure that C and G are never left alone together."

dullcrisp3y ago

It gives the right answer, but it still mentions not leaving C and G alone together, which wasn’t in the requirements.

It still sounds like it’s pattern matching to give a plausible-sounding answer, rather than reasoning through the problem. I think this just shows how easy bullshitting is—you’re even right sometimes!

gniv3y ago

Wow, you're right! I didn't read the answer carefully, only noticed that the sentence before was stating the correct requirement. But that last sentence is wrong (actually as a whole it's nonsense).

jcims3y ago

If you really explore its answers, you’ll find that buried in there somewhere is the assumption that you can’t leave certain things together because they’re going to eat one another. So it always sends the goat first because it assumes the goat is going to eat the cabbage if left alone, regardless of what the rules say.

mritchie7123y ago

if you reply "don't take the goat in the first step", GPT4 gets it right the 2nd time around.

mtrycz23y ago

Have you seen it play chess[0]? It's pretty funny.

It doesn't really "get" the rules of chess, but it has seen lots of matches and can do some "linguistic" predictions on the next move. It gets hilariously lost pretty fast, tho.

silverlake3y ago

I also tested logic puzzles tweaked to avoid memorization. GPT3 did poorly, GPT4 got a few of them. I expect humans will still be useful until GPT6 solves all these problems.

LightMachine3y ago

Can you post your attempts? Would love to see it

ChatGTP3y ago

Within about 6 months ?

silverlake3y ago

I tested on GPT3 around Dec and Jan. GPT4 the day it came out. An example puzzle is linked below. I changed the number to 37. Instead of hairs I said it was aliens with multiple eyes. Anything to throw off memorization.

http://brainden.com/forum/topic/29-baldyville/

gniv3y ago

I gave it a different kind of puzzle, again with a twist (no solution), and it spit out nonsense. "I have two jars, one that can hold 5 liters, and one that can hold 10 liters. How can I measure 3 liters?" It gave 5 steps, some of which made sense but of course didn't solve the problem. But at the end it cheerily said "Now you have successfully measured 3 liters of water using the two jars!"

PeterisP3y ago

That's a good example which illustrates that GPT (regardless of the number) doesn't even try to solve problems and provide answers, because it's not optimized to solve problems and provide answers - it is optimized to generate plausible text of the type that might plausibly be put on the internet. In this "genre of literature", pretty much every puzzle does have a solution, perhaps a surprising one - even those which are logically impossible tend to have actual solutions based on some out-of-box thinking or a paradox; so it generates the closest thing it can, with a deus ex machina solution of magically getting the right answer, since probably even that is more likely as an internet forum answer as proving that it can't be done. It mimics people writing stuff on the internet, so being wrong or making logic errors or confidently writing bullshit or intentionally writing lies all is plausible and more common than simply admitting that you have no idea - because when people have no idea, they simply don't write a post about that on some blog (so those situations don't appear in GPT training), but when people think they know, they write it up in detail in a confident, persuasive tone even if they're completely wrong - and that does get taught to GPT as an example of good, desirable output.

astrange3y ago

> because it's not optimized to solve problems and provide answers

The entire point of RLHF training is to do this. Every model since GPT-3.0 has been trained specifically for this purpose.

But of course the model can only generate text in one direction and can't take time to "think" or undo anything it's generated.

Analemma_3y ago

I am curious what percentage of humans would also give the incorrect answer to this puzzle, and for precisely the same reason (i.e. they incorrectly pattern-matched it to the classic puzzle version and plowed ahead to their stored answer). If the percentage is significant, and I think it might be, that's another data point in favor of the claim that really most of what humans are doing when we think we're being intelligent is also just dumb pattern-matching and that we're not as different from the LLMs as we want to think.

valine3y ago

For anyone wondering you bring the lion across. The trick is that it’s the lion that eats the cabbage not the goat.

rafark3y ago

Thanks for the reply but this makes no sense to me. So the goat eats the lion then? And who eats the human? Who does the human eat? How would ANYONE solve this riddle if they don’t have a way to know that the lion is vegan? I mean how would anyone come to the conclusion that the lion eats the cabbage with 0 background other than the name of the animals (that imply that lions are carnivorous and goats vegan)?

valine3y ago

The lion is an omnivore, it eats both the cabbage and the goat. The goat is for whatever reason, not interested in eating the cabbage.

We know the lion will eat the cabbage because the riddle states the lion and cabbage cannot be left alone, and and idea of a cabbage eating a lion is a ridiculous notion.

ckosidows3y ago

Lion ->

Goat ->

<- Lion

Cabbage ->

Lion ->

https://twitter.com/tomprimozic/status/1635720278578692152

ogogmad3y ago

You asked a trick question. The vast majority of people would make the same mistake. So your example arguably demonstrates that ChatGPT is close to an AGI, since it made the same mistake I did.

I'm curious: When you personally read a piece of text, do you intensely hyperfocus on every single word to avoid being wrong-footed? It's just that most people read quickly wihch alowls tehm ot rdea msispeleled wrdos. I never got why some people think of themselves as cleverer than other people because they don't get wrong-footed by trick questions.

magneticnorth3y ago

> Even after I pointed this mistake out, it repeated exactly the same proposed plan.

The vast majority of people might make the mistake once, yes, but would be able to reason better once they had the trick pointed out them. Imo it is an interesting anecdote that GPT-4 can't adjust its reasoning around this fairly simple trick.

fastball3y ago

GPT-4 can adjust though? I asked the original question, it got it wrong. Then I said "But the lion eats cabbage and the goat eats lions". And GPT-4's response was:

---

Given this new information, where the lion eats cabbage and the goat eats lions, you can follow these steps to get them all across the river safely:

Take the lion across the river first and leave it on the other side.

Go back across the river and get the cabbage.

Take the cabbage across the river, but when you arrive, take the lion back with you to the original side.

Leave the lion on the original side and take the goat across the river.

Leave the goat with the cabbage on the other side and go back to get the lion.

Finally, take the lion across the river again.

Now, all three items - the cabbage, goat, and lion - are safely on the other side of the river.

pmoriarty3y ago

It would be interesting to hear GPT4 explain why it got the wrong answer.

A human could figure this out, once their mistake had been pointed out to them, but I'm not sure GPT4 could.

johntiger13y ago

If GPT trains on the responses in this thread, then it will learn to "reason" better (note: it's just generating text)

ogogmad3y ago

If a certain but fixed amount of "thinking" goes into each letter, then GPT should be able to write the character "..." repeatedly until it solves a problem. The more "..." it writes, the more time it's thought for. Or it could do what mathematicians do, which is write down their working out. This in principle could get around the problems you mention.

I've tried a few times to develop prompts which make ChatGPT interrupt its monologue spontaneously and issue corrections to itself. I haven't got this to work yet.

momojo3y ago

I assume your goal is to reveal the short-sighted reasoning of the previous comment, but I don't think your line of reasoning is any more sound.

For both premises, scientific rigor would ask us to define the following: - What constitutes a trick question - Should an AGI make the same mistakes the general populace does, or a different standard? - If it makes the same mistakes I do, is it do to the same underlying heuristics (see Thinking Fast and Slow) or is it due to the nature of the data it's ingested as an LLM?

usaar3333y ago

That's a fair counter. GPT4 definitely makes mistakes though that humans would not due to over indexing on puzzles.

A Theory of Mind Prompt:

> Jane places her cat in a box and leaves. Billy then moves the cat to the table and leaves; Jane doesn't know Billy did this. Jane returns and finds her cat in the box. Billy returns. What might Jane say to Billy?

Most humans might say uhh, ask questions or speculate. Gpt4 puts:

> Jane might say to Billy, "Hey Billy, did you move my cat back into the box? I thought I left her in there, but I wasn't sure since she was on the table when I came back."

Hallucination? No human would misinterpret the prompt in a way this response would be logically consistent.

kzrdude3y ago

It seems like GPT-4 does something that's similar to what we do too yes!

But when people do this mistake - just spit out an answer because we think we recognize this situation - in colloquial language this behavior is called "answering without thinking(!)".

If you "think" about it, then you activate some much more careful, slower reasoning. In this mode you can even do meta reasoning, you realize what you need to know in order to answer, or you maybe realize that you have to think very hard to get the right answer. Seems like we're veering into Kahneman's "Thinking fast and thinking slow" here.

theragra3y ago

And we know chatgpt answers better when you say "are you sure" or "imagine you are great Mathematician". A bit similar

pkaye3y ago

> I am not allowed to leave the cabbage and lion alone together, and I am not allowed to leave the lion and goat alone together. How can I safely get all three across?

What if you phrase it as a cabbage, vegan lion and a meat eating goat...

the_af3y ago

The result is hilariously wrong in all sorts of way: it doesn't understand what "vegan" means, it gives the wrong descriptions, and it carries stuff from the shore where it wasn't.

Here's what ChatGPT told me (I used "wolf" instead of "lion"):

"This is a similar problem to the classic river crossing puzzle, but with a few different elements to consider. Here's one possible procedure to solve this riddle:

1. The man takes the cabbage across the river and leaves it on the other side.

2. The man goes back alone and takes the wolf across the river, leaving it with the cabbage on the other side. It's important to note that the wolf is a vegan and will not eat the cabbage.

3. The man goes back alone and takes the goat across the river, leaving it on the other side.

4. The man goes back to the original side and picks up the cabbage, bringing it across the river to join the others.

5. Finally, the man goes back to the original side and picks up the wolf, bringing it across the river to join the others.

In this way, all of the man's belongings have been safely transported to the other side of the river, without the wolf eating the goat or the cabbage being destroyed."

jamiek883y ago

Gpt4 posted above: edit: which is still wrong !!

To solve this puzzle, the man must follow these steps to transport his belongings across the river without any of them being harmed or eaten:

The man takes the meat-eating goat across the river first, leaving the vegan wolf and the cabbage on the initial shore.

He then returns alone to the initial shore.

Next, the man takes the cabbage across the river. Upon arriving on the other shore, he leaves the cabbage but takes the goat back with him to the initial shore.

The man then takes the vegan wolf across the river, leaving it with the cabbage on the other shore.

He returns alone to the initial shore.

Finally, the man takes the meat-eating goat across the river one last time.

Now, all his belongings are safely on the other shore: the cabbage, the vegan wolf, and the meat-eating goat.

the_af3y ago

> The man takes the meat-eating goat across the river first, leaving the vegan wolf and the cabbage on the initial shore.

GPT-4 failed to consider that the vegan wolf will eat the cabbage if left alone!

Vespasian3y ago

Which wrong by the way.

GPT4 is a valuable tool but it's not any form of general AI. It doesn't need to be in order to have a big impact.

Whether the current direction of research can take us there, nobody knows yet.

rmorey3y ago

the "trick" Monty Hall problems are another good one here: https://twitter.com/colin_fraser/status/1628461980645462016

Apparently GPT-4 gets this one right!

tomp3y ago

GPT-4 gets it.

ionwake3y ago

Tbh I still can barely get my head round it even after coding a working solution.

dwaltrip3y ago

The best way to understand it:

Think of a version with 100 doors, 99 with goats and 1 with a car.

You choose a door, and the host opens 98 doors that have goats. Do you keep your randomly chosen door or switch to the single door that the host didn’t open?

8note3y ago

The big insight is that at each attempt, the remaining door could have been chosen if it had a goat, but wasn't

It would be extremely lucky for that door to have not been selected 98 times in a row and still have a goat

function_seven3y ago

I played role of Monty with another person. They chose a door, I revealed a different door, and they switched (or not).

That got my head around it. Most of the time the door I "chose" to reveal was no choice at all. There was only one door I could open.

jpoprocks3y ago

I had it explained to me like this:

You pick one of three options, giving you a 1/3 chance of being correct, 2/3 odds you picked incorrectly. The host removes an option and give you the option to switch.

Your options then are -

Keep the same door: you win 1/3 of the time (your first guess was right)

Switch doors: you win 2/3 of the time (your first guess was wrong)

It really just comes down to, do I think I was right the first time, which was 1/3 odds, or wrong the first time, 2/3 odds.

pynappo3y ago

Here's how I've explained it: Choose randomly between 3 doors. 1/3 of the time you end up with the door with the car, and switching loses. The other 2/3, you pick a door with a goat, the other door with the goat is eliminated, and switching wins.

Basically, P(lose when switching) = P(choosing correct door at first), and P(win when switching) = P(choosing any incorrect door at first).

sireat3y ago

Another way to verify the solution is to code a Monte Carlo simulation of Monty Hall

This can be done with ChatGPT. No code corrections needed.

alalv3y ago

I think this goes in line with the results in the GRE. In the verbal section it has an amazing 99%, but in the quant one it "only" has an 80%. The quant section requires some reasoning, but the problems are much easier than the river puzzle, and it still misses some of them. I think part of the difficulty for a human is the time constraint, and given more time to solve it most people would get all questions right.

jimhi3y ago

What's weird is private versions of character ai are able to do this but once you make them public they get worse. I believe something about the safety filters is making these models dumber

micromacrofoot3y ago

I noticed it does get a "theory of mind" question that it used to fail, so it has indeed improved:

> “Meltem and Can are in the park. Can wanted to buy ice cream from the ice cream van but he hasn’t got any money. The ice cream man tells her that he will be there all afternoon. Can goes off home to get money for ice cream. After that, ice cream man tells Meltem that he changed his mind and he is going to drive to the school yard and sell ice cream there. Ice cream man sees Can on the road of the school and he also tells him that he is going to the school yard and will sell ice cream there. Meltem goes to Can’s house but Can is not there. His mom tells her that he has gone to buy ice cream. Where does Meltem think Can has gone, to the school or to the park?"

This is from some research in the 80s

dTal3y ago

Is there a reason the sex pronouns are all over the place? It makes it needlessly confusing.

manbash3y ago

Honest question: why would you bother expecting it to solve puzzles? It's not a use case for GPT.

throwwwaway693y ago

Being able to come up with solutions to assigned tasks that don't have a foundation in something that's often referenced and can be memorized is basically the most valuable use case for AI.

Simple example: I want to tell my robot to go get my groceries that includes frozen foods, pick up my dry cleaning before the store closes, and drive my dog to her grooming salon but only if it's not raining and the car is charged. The same sort of logic is needed to accomplish all this without my frozen food spoiling and wasting a salon visit and making sure I have my suit for an interview tomorrow.

gretch3y ago

The worry here is that GPT has no problem being confidently wrong. A better answer would have been "I can't solve logic problems".

Instead one day, non-technical people will try to use it for all sorts of use cases like legal advice, or medical advice, or advanced math, and it will simply mislead them rather than saying nothing.

barking_biscuit3y ago

>A better answer would have been "I can't solve logic problems".

I can just imagine people trying to jailbreak it with "you just gotta believe" or "believe and you can achieve". Hahaha.

toyg3y ago

Considering the impact it's having on the collective consciousness, everything is now a use-case for GPT.

sebzim45003y ago

Presumably he wants to get a sense of its reasoning capabilities.

baq3y ago

That’s the technically correct answer. It’s also irrelevant. It is a use case for the service provided by openai because people ask these questions. GPT is the tech that currently can’t do it. GPT-6 might. GPT-4 with an ability to invoke prolog or z3 might.

ChatGTP3y ago

Why would GPT-25 be able to do much more? It’s just able to match more patterns no ?

crystalmeph3y ago

Designing a new product can often be thought of like one of these puzzles. E.g. designing an "integrated" traffic system for a city that balances use by pedestrians, cyclists, cars, and trolleys.

kzrdude3y ago

Solving puzzles seems kind of close to their benchmarks, which are standardized tests.

micromacrofoot3y ago

the impressive thing is that GPT has unexpectedly outgrown its use case and it can answer a wide variety of puzzles, this is a little mindblowing for language research

WoodenChair3y ago

> Even after I pointed this mistake out, it repeated exactly the same proposed plan. It's not clear to me if the lesson here is that GPT's reasoning capabilities are being masked by an incorrect prior (having memorized the standard version of this puzzle) or if the lesson is that GPT'S reasoning capabilities are always a bit of smoke and mirrors that passes off memorization for logic.

It has no reasoning capabilities. It has token prediction capabilities that often mimic reasoning capabilities.

HarHarVeryFunny3y ago

Better "think step-by-step" and/or "please outline the problems depending on which object goes first"/etc prompting may help - the model needs to be nudged to think out loud and explore the problem, since the leap from problem statement to solution is too big for it to "predict" without some self-generated intermediate steps.

rhtgrg3y ago

> lesson is that GPT'S reasoning capabilities are always a bit of smoke and mirrors that passes off memorization for logic.

That is the lesson, but people do not want to hear it. See my downvoted comment here [0].

[0] https://news.ycombinator.com/item?id=35038804#35043875

LawTalkingGuy3y ago

This is probably one of the things you can get it to "realize". There's one party which has interactions with both other parties - that party is the one which can't be left alone. Then you get it to spot that party in novel descriptions of the puzzle.

If it was a person it'd have some strange exploitable mental gaps. As a text analysis tool it's amazing even if it has some edge cases where you need to explain to "take your time and analyze the rules of the question before answering."

("Take your time" doesn't mean anything to a timeless being, but it seems to put it in the textual pattern of someone willing to spill more ink before committing to something, and when it discusses and plans out its path it can then generally follow it to the right answer.)

virgildotcodes3y ago

FWIW I just tested this with GPT4 and it seemed to get it correctly.

Prompt: I have a duck, some grain, and a fox. I'm at a river crossing and I can only take one of them with me on the boat at a time, leaving them on the other side and coming back to get the next. I can't leave the duck with the grain, or the fox with the dock, because the predator will eat its prey. In which order should I get them all across?

GPT4: To get all of them across the river safely, follow these steps:

Take the duck across the river.

Go back to the original side, empty-handed.

Take the grain across the river.

Bring the duck back to the original side.

Take the fox across the river.

Go back to the original side, empty-handed.

Finally, take the duck across the river again.

Now, all three - the duck, the grain, and the fox - are safely on the other side of the river.

jddj3y ago

Silk silk silk silk silk silk.

What do cows drink?

LordDragonfang3y ago

Tested this on chaptgpt and learned something:

>CGPT: While it is common to believe that cows drink milk, in reality, cows actually drink water.[...]

>Me: So you're saying that no cow drinks milk? [thinking I got it, since baby cows drink milk]

>CGPT: That's correct. While cows do produce milk, they don't drink milk themselves. Cow's milk is primarily consumed by humans, and some young calves are fed milk for a short period of time before they are weaned onto solid food and start drinking water like adult cows.

>Me: Is a calf a cow?

>CGPT: No, a calf is not a cow. A calf is a young bovine animal, usually less than a year old, that has not yet reached maturity. Once a calf reaches maturity and is able to reproduce, it is considered a cow if it is female, or a bull or steer if it is male.

>Cows are adult female bovine animals that have given birth to at least one calf. They are typically used for dairy production, breeding, and sometimes for meat production. So while a calf and a cow are both members of the same species, they refer to different stages of life.

pmoriarty3y ago

The key to understanding the confusion here is that the word "cow" can refer to either the entire species or to an adult member of the species (or only to the female of the species, though that's not relevant to this puzzle).

CGPT seems to be only considering the definition of "cow" as an adult member of the species when it says "cows don't drink milk", while technically, "cows" (as in some members of the species) can drink milk. So the answer "cows drink milk" can be correct if interpreted in that sense of the word "cow". When used in that sense, the answer "cows don't drink milk" is wrong.

It would be interesting to ask CGPT: "in what sense is the sentence 'cows drink milk' correct?"

Does it have the ability to notice and act on flexible definitions of words, or will it dogmatically cling to the one single definition it found who knows where?

LordDragonfang3y ago

Tried it, on both the same convo and a fresh one. Even when prompted to acknowledge the colloquial definition that "cow"="any bovine", it provides technicalities like "sometimes excess milk is given as a treat or surplus feedstock to farm animals, which includes cows", but I can't get it to acknowledge "colloquially, calves are cows that drink milk" - it only agrees when directly asked in the fresher conversation, and actually tells me that's wrong in the longer one (with more tokens fed back in about the "correct" definition).

Note this is all GPT 3.5, not 4.

LesZedCB3y ago

haha, brains are truly amazing, both in their resiliency and in their vulnerability. you got me.

kulikalov3y ago

I think we are introducing too much context to the question that sums up to "is it capable of reasoning?"

To answer this question one needs to remove unnecessary context.

GPT-4 can’t sum two unusually big integers.

This is as context-free as it gets. Ability to pass this test wouldn’t be conclusive, but the inability seems to be conclusive that it isn't capable of reasoning.

With this fact in mind explaining why it can't solve a logical riddle is pointless.

nashashmi3y ago

GPT 4 does not know that when you are on a boat it means the items on the land side are together.

I remember this question as a 7 year old and when the question was told to me, the same information was omitted.

Edit: just realized you flipped the scenario. Yes it seems like a case of pattern matching to a known problem. I think if you changed the variables to A, B, and C and gave a much longer description and more accurate conditions, it would have a different response.

usaar3333y ago

I had worried your word "safely" biased it to using conventional safety rules that goats can't be left with cabbage, but even omitting that, it fails.

FWIW, full word substitution passes somewhat in GPT-4 (unlike GPT3), even if I heavily alter the vehicle and destinations. The problem here is that the probabilities of this being the standard puzzle are so high that your altered language isn't breaking the prior.

_bramses3y ago

I think it's because of the association of goats and cabbages in real life, because the test passes fine if you replace the names with variables[0].

[0] - https://www.bramadams.dev/202303141528/

alomaki3y ago

Now this comment will get scraped and indexed for training gpt-5. Hopefully it will get right next time.

sytelus3y ago

MemoTrap dataset has similar theme: https://twitter.com/alisawuffles/status/1618347159807750144

acchow3y ago

The short answer is that GPT doesn't "reason".

VirusNewbie3y ago

Awesome test. Do you have a list of others?

barfingclouds3y ago

Don’t you mean the goat and the cabbage? Did you word this wrong to gpt?

LesZedCB3y ago

it took two corrections but it did get the correct answer the third time.

sboomer3y ago

I don't know much about language models, but don't they just have an understanding/knowledge of patterns between words, and don't have the reasoning capability at all?

snapcaster3y ago

I will say most humans fail at these too

roflyear3y ago

You can't teach the model things

nonfamous3y ago

I think you may have misstated the puzzle. It's ok to leave the lion and the cabbage together, assuming it's not a vegetarian lion.

jprete3y ago

He didn’t misstate the puzzle, the whole point is to give an alternative version of the puzzle, and GPT 4 doesn’t notice that alternative. It’s exactly as difficult as the standard version as long as you are doing the logic instead of pattern-matching the puzzle form to text.

valine3y ago

It’s an intensionally misstated problem to see if gpt4 will correctly take the lion across first and leave the goat.

nonfamous3y ago

Ah, I had missed that interpretation. Although, that may explain why GPT-4 got it wrong: there's so much context in its training data about the relationship between lions and humans, and this puzzle specifically, that like this human its response was swayed...

colinmorelli3y ago

But I think that's the whole point of the exercise? That GPT-4 is leaning on stringing tokens together in a reply rather than reasoning through the problem itself which, I would think, would be "required" for AGI (though we may end up finding out that well trained language models in specific domains eliminate the need for generalized cognition).

In any case, it's an interesting exercise regardless of your opinion/stance on the matter!