Are emergent abilities of large language models a mirage? (opens in new tab)

(arxiv.org)

154 pointschewxy3y ago130 comments

130 comments

74 comments · 20 top-level

_8j503y ago· 17 in thread

I had someone much knowledgable on this topic than myself claim ChatGPT and the like "understand" stuff. My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything, the hallucination is when their approximation of what a real response would be falls short.

These emergent abilities are not actually that, but a result of humans' poor understanding of cognition and communication.

What concerns me very much is how the harms that can be caused by LLMs has been so greatly under reported.

I imagine being someone with the right power and access telling ChatGPT "find all people that would vote against this candidate in real time and devise ad content and social media messaging and bit interaction to change their minds or discourage them from voting" heck, any intel org of a major country is probably already working on this. No more whistleblowing or posting anonymously on social media, companies would even share models based on private email and conversations you had so other companies could use LLMs to identify everything you posted elsewhere and to have LLMs designate a score for hoe hireable you are. Police can crack down on crime better but also crack down on dissent or any police reforms.

And we aren't even talking about war time use of LLMs or what happens when you marry something like ChatGPT with Dall-E and make it all real-time.

I am warning anyone who will listen. Smartphones are the most dangerous things out there. Any service or interaction that depends on them is deteimental to peace and liberty of the masses long term. People have not learned a thing from Snowden or 2016 elections.

And why are all the smart journalists asleep on the job on this topic. Where are the unreasonable scaremongerers when you need them!

nl3y ago

> My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything, the hallucination is when their approximation of what a real response would be falls short.

This doesn't seem to make sense. If anything the opposite is true - if the things that are hallucinated make sense (even if not true) it means there is some "understanding" or a world model.

_8j503y ago

No, hallucinations are similar to but not quite the expected result which shows they are approximations.

For example, you tell Dall-E to draw a picutre of a man smoking a pipe but then it draws the pipe coming out of the mans butt and instead of a head the man has a leg on his neck. This is approximation. Now if the pipe looked wrong or it was a trumpet or the guy's head looked weird but still a head maybe it knows what a head is and it just made a mistake and it knows that the pipe goes into the mouth so it will be somewhere near the face.

Undesrstanding is a 3yr old child drawing terribly, approximarion is drawing really well but all wrong.

All this applies to LLMs I used Dall-E because pictures are easier to talk about.

nl3y ago

I haven't used dall-e much but I've never seen stable diffusion or midjourney make an error like that, unless of course deliberately promoted.

You can see this because of the big deal people made about image gen tools getting hands wrong: it was the most significant error that was systematically occuring.

1 more reply

p-e-w3y ago

> My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything

Alexander Grothendieck, one of the greatest mathematicians of the 20th century, answered "57" when asked in a public lecture to provide an example of a prime number. 57 = 3 * 19 is not a prime number.

According to your argument, this would imply Grothendieck did not understand what a prime number is. Which is laughable.

_8j503y ago

My argument is not that it shouldn't make a mistake but that it cannot recognize it's mistake and correct it because it would have. That mathematician I am sure recognize that 57 isn't a prime number if you ask him again.

Also mathematics is not the right field to compare this to because there are rules. In languages and object recognition/synthesis it is all subjective. Understanding here means understanding context and human-subjective interpretation.

When I say "that sunset is beautiful" you understand what I mean, ML models simply approximate based on what they see other humans do or say. I am not calling the sunset beautiful because other people are, it is my own subjective interpretation.

vixen993y ago

Good point. Within organisms there are a number of survival mechanisms even in juveniles. Braitenberg hypothesized simple mechanical vehicles that evolved survival tactics. Not understanding the latter will ultimately be a fatal flaw irrespective of any insightful understanding of other problems. If Grothendieck had known he faced execution for failing to give a correct response, he would certainly have survived. I'd be interested to see AI configurations linked to a number of direct externalities (not linked to human derived/directed information) that might then determine their fate.

https://en.wikipedia.org/wiki/Braitenberg_vehicle

Samuel Johnson: “Depend upon it, sir, when a man knows he is to be hanged in a fortnight, it concentrates his mind wonderfully.”

deafpolygon3y ago

Humans are not infallible. It's a mistake to anthromorphize LLM.

SanderNL3y ago

LLM wrong => unredeemable and fatally flawed

human wrong => no problem, mistakes happen

2 more replies

flangola73y ago

> My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything, the hallucination is when their approximation of what a real response would be falls short.

> These emergent abilities are not actually that, but a result of humans' poor understanding of cognition and communication.

My counterargument is that humans hallucinate too, and often. As just one small example, eyewitness testimony is stupifyingily unreliable. Neurological research and even basic behavioral research shows our brains act as bullshit machines fabricating satisfying narratives constantly. Not to even get into the fact that the word hallucination still has a non-AI meaning, and that dreams exist. As I put see it GPT models simply hallucinate more often and, more noticably, in a different manner than humans. The hallucination frequency need not reach zero, only human equivalent or better , and GPT-4 is already much better than GPT-3.

I agree with everything else you said fully. "True" reasoning machines or not, society will be catastrophically destabilized. Amongst the chaos I expect plentiful of "normal" conventional and nuclear war to go on.

ChatGTP3y ago

What's the difference between "hallucinating" and just getting something wrong?

Hallucinations are where you hear, see, smell, taste or feel things that appear to be real but only exist in your mind. Get medical help if you or someone else have hallucinations.

Why do we say LLMs hallucinate and why do people keep parroting the same thing "oh humans hallucinate", do we really hallucinate all the time? I can think of only one time a healthy adult hallucinates and it's not sitting around drinking a cup of tea.

SanderNL3y ago

The word has a slightly different meaning here (confabulation) and more akin to the witness example: confidently telling what you think is true, but it turned out to be complete shit.

It’s not about humans intentionally lying or truly hallucinating like on drugs. It’s about confidently thinking they are right about something which turns out not to be true.

I mean, if I say it like this is like humanity’s core business.

1 more reply

Mike_123453y ago

> I had someone much knowledgable on this topic than myself claim ChatGPT and the like "understand" stuff. My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything, the hallucination is when their approximation of what a real response would be falls short.

ChatGPT models semantic relationships in the data. That's what your smart buddy means by "understanding". That is a high dimensional model of the data set which infers abstract semantic relationships / concepts. But he would not claim that those semantic relationships are exactly the same as any human interpretation of the data (which you refer to as the "real" response).

Language models also have limited reasoning abilities. They are capable of misunderstanding as much as understanding.

_8j503y ago

When I say understanding I mean the meaning and context around it. It is one thing to know how to respond to different inputs surrounding a context, it is another to undersrand why!

Now, for all the disagreements in this thread I have not seen anyone claim it knows and reasons why things are the way they are, I don't disagree that it knows what to do for different inputs but it is not processing the input and making a reasonable decision more like mapping input with most approximate output. Is thay incorrect of me to say?

Mike_123453y ago

> When I say understanding I mean the meaning and context around it.

Yes it is actually modelling the meaning and context around it. ChatGPT is a ~100 layer deep neural network that was trained specifically to solve "natural language understanding tasks". The term "understanding" is an important term in NLP research along with the concept of "semantic similarity".

I believe it is more powerful and ultimately (eventually) more dangerous than you fear it is.

1 more reply

PaulDavisThe1st3y ago

Not bad, but:

> Language models also have limited reasoning abilities.

They have no reasoning abilities at all.

visarga3y ago

But language itself does, it's not the model, it's the data that has this ability. It's in the language patterns. Humans use that too - 99.99% of what reasoning we do is just replaying older ideas adapted in context. Being truly original and improving on the best ideas is a rare thing.

1 more reply

Mike_123453y ago

"Language Models Perform Reasoning via Chain of Thought"

Posted by Jason Wei and Denny Zhou, Research Scientists, Google Research, Brain team

https://ai.googleblog.com/2022/05/language-models-perform-re...

1 more reply

p-e-w3y ago· 13 in thread

I'm quite skeptical of analyses like this one, because I doubt the metrics themselves. Emergence is something that is intuitively noticed by human observers. The desire to quantify everything then leads to the creation of (imperfect) metrics designed to capture what the observers already know. Those same metrics are then taken as the definition of the properties said to be emergent, and articles like this one are among the consequences of that choice.

The paper's claim is essentially "these metrics which appear to demonstrate emergence can be replaced by other metrics that also represent model behavior, but that do not have scale discontinuities, so emergence isn't a real phenomenon".

But an equally valid interpretation would be "none of these metrics actually capture the properties we are truly interested in". Which, given the complexity of what we are dealing with here, seems entirely reasonable. It's not like we suddenly learned how to accurately quantify performance at language tasks. The whole reason LLMs are so great in the first place is because traditional 'mechanical' language models suck so bad.

Centigonal3y ago

I think the claim that "these metrics which appear to demonstrate emergence can be replaced by other metrics that also represent model behavior, but that do not have scale discontinuities (emergence)" is really powerful.

That might mean that behaviors we consider emergent are the consequence of a process that scales continuously with model size.

i.e.: there may exist a bijection between, say, a step-function `can_do_arithmetic(size)` and a smooth, continuous function `arithmetic_skill_metric(size)`

If we can use continuous metrics to back out the step-function equivalents, that'll help us predict when and how to get particular abilities to "emerge."

For example: If a change results in a steeper slope on the continuous metric, we can predict it would cause the associated capability to emerge at relatively smaller model sizes.

p-e-w3y ago

> That might mean that behaviors we consider emergent are the consequence of a process that scales continuously with model size.

Or it might mean that the metrics used are worthless for describing high-level model behavior. That's my whole point. Emergent behavior was observed, which is why we want those metrics so we can try to understand what is going on. But just because we have metrics that exhibit discontinuities at scales where humans observe emergence, doesn't mean that those metrics really represent the hard-to-define behavioral changes we have observed.

Centigonal3y ago

I'm not sure I understand. I think you're suggesting that the metrics currently being used to assess emergence of new capabilities in LLMs are imperfect and potentially worthless, but I don't understand what is missing.

Using the example of 4 digit multiplication in the source paper: The researcher wants to know if the model has developed the ability to multiply two four-digit integers, so they generate a battery of such problems, e.g. "what is 4363*1285? output only your answer." The metric is what percentage of the problems the LLM answers correctly.

This is pretty much the same way a human observer would identify the same emergent behavior, and also how we assess it in other humans. It's not some contrived metric that's detached from the emergent ability in question.

2 more replies

freehorse3y ago

> Emergence is something that is intuitively noticed by human observers.

The takeaway from this paper is actually that this says more about the human observers than the LLMs themselves. Humans observe "emergence" using some loose metrics themselves. Whether we formalise the metrics in a quantitative way or we stay in the realm of humnan intuition is not important; we still use some criteria to analyse things, in the latter case these are more clear and amenable to critique, in the former case they are obscure and can evade critique by moving goalposts and ground easily.

In the end of the day, either we talk about quantification or just qualitative observations, it is the same phenomenon we observe with the same qualities. The problem is that human intuition uses a lot of discontinuous metrics; we judge if an LLM passes or fails what we ask for it, but it is harder to judge the underlying tokenisation process as itself. For this reason, and considering the findings of this paper, the observations and claims of emergence in LLMs carry less value now, imo.

p-e-w3y ago

Calling human intuition "just another metric" ignores the fact that human intuition performs spectacularly well at many high-level tasks.

Whatever "insight" and "understanding" actually mean, there is no denying that they are immensely useful, which is why we want AIs that can replicate them.

When trying to understand complex systems that don't yield to quantitative analysis in an obvious way, the starting point should be to assume that the intuitive evaluation is (roughly) correct, until proven otherwise. Trying to cast this intuition into a simple metric and then using that metric (or other simple metrics) to demonstrate that the intuition is wrong is circular reasoning.

kelseyfrog3y ago

> human intuition performs spectacularly well at many high-level tasks.

It's also the number one most common cognitive bias. Humans are especially prone to reification - the confusion that the construction of a measure equates to an objective reality.

Humans are often launder subjectivity through the creation of metrics without 1) knowing they've done so, and 2) become emotional when accused of having done so.

TheOnly923y ago

I agree that perhaps the metrics are not as useful themselves, but I think you're giving too little credit to the paper where maybe some credit is due.

I think the paper is correct that there are no "emergent abilities", i.e. abilities that might suddenly appear when scale of the model is increased. And though it might not be accurate, but the paper did make some effort to formalize and I think it is a good attempt to kind of prove the point.

However as we recognize, there are still some weird discontinuities in which at one point the model is useless and suddenly it becomes very useful. This "discontinuity" IMHO is probably just perceptional, but the underlying metric is continuous.

JohnFen3y ago

> Emergence is something that is intuitively noticed by human observers.

The problem, of course, is that people's intuition is particularly awful for this sort of thing. We have a very strong tendency to anthropomorphize everything, and that illusion can be quite overpowering.

beders3y ago

> Emergence is something that is intuitively noticed by human observers.

If you can't quantify that, then it is just a hunch. What exactly do you think "observers already know"?

p-e-w3y ago

There are many fundamental qualities that can be observed but not quantified. General intelligence being one of them (intelligence tests measure some aspects of it, but not others).

If everything that is real could be quantified, we wouldn't need AI. Traditional computing is already absolutely phenomenal at dealing with quantifiable systems. The whole point of wanting "artificial intelligence" is because we don't know how to quantify the high-level properties of speech, thought, intellect, and consciousness. And not for lack of trying.

roguecoder3y ago

Lots of things are intuitively noticed by human observers, only some of which exist. We are heuristic pattern-finding machines who believe in ESP and fairies: why wouldn't we also find mythology in the machines we build?

amw-zero3y ago

You just described science, and why you don't believe in science.

p-e-w3y ago

If by "don't believe in science" you mean "don't believe that every metric claimed to be representing a phenomenon actually does represent that phenomenon", you are correct.

raydiatian3y ago· 10 in thread

“Emergent anything” is probably the most obnoxious buzzword in all of machine learning.

opportune3y ago

I’m sorry you find it obnoxious but emergent phenomena are everywhere in math and science and as annoying as it is to you, it also happens with AI.

The quest for more generalized models boils down to studying emergent behavior because we could never prescriptively define all the parameters/behaviors/requirements necessary for such a complex outcome. We don’t even understand how the relatively easily observed interactions between neurons in our own brains result in emergent intelligence.

What’s so impressive about LLMs is they understand the semantics of some concepts so well that they can consistently produce higher quality outputs for tasks like “explain this complicated concept with a nursery song from the perspective of a pirate” than humans could, with approximately no instances of that task in their training data. That is emergent behavior and it’s a pretty big deal.

raydiatian3y ago

I agree that emergent behaviors are real, and important.

I am skeptical, although not completely unconvinced, that LLMs like GPT are going to produce truly emergent phenomena, such as true first-principles logical reasoning. The limitations of the underlying transformer architecture itself are, in my opinion, the problem. The first problem is that the embedding space of the transformer needs to grow much, much larger, and it's already huge. This matters because you need to model the order of neurons in the brain. The second problem is that you're never going to train an LLM (as they're designed today) that is going to produce a truly good 'emergent-phenomena' answer without multiple network traversals. This is because the human mind constantly and autonomously refines its thoughts.

Perhaps a good counter-argument is that emergent phenomena are fundamentally a space-time domain concept.

I am aware that things like Conway's Game of Life are a fantastic counterargument to my "the transformer architecture doesn't support it" argument. But I agree that the definition of "emergent behavior" when it comes to machine learning is too easily corrupted to be novel rather than rigorous.

thomastjeffery3y ago

It gives finality to the idea that we don't understand the thing or where it came from.

Why? Did we just lose all interest in understanding things? Wasn't that the whole point in the first place?

Somehow, people are throwing up their hands and giving up at understanding the thing; yet at the same time they are acting like the thing will magically evolve into their wildest dreams!

The most fundamental feature of LLMs is that they cannot be literal. They can only infer, never define. Why is it that the people studying LLMs think they have to emulate that trait? It's like they are only allowing themselves to look at it as a mysterious black box: to infer its behavior from its results. Did they forget that they are the ones who wrote the damn thing?

amw-zero3y ago

The whole idea of machine learning / AI is to build functionality indirectly though, i.e. to build a system which evolves into another system over time. They are inherently meta-systems, so it does make sense to think of them differently.

opportune3y ago

Before blackbox AIs from deep learning there were basically a few different kinds of AI: one was basically “algorithm complicated enough that we thought it required intelligence”, another was “general problem solver” like you get by applying Constraint Satisfaction techniques and heuristics, and highly fine-tuned encodings of human knowledge and research (this is a decision tree clinicians use to perform a differential diagnosis of a fever, this is a function that finds edges in an image based on hand crafted CV algorithms). The first group is basically not AI, it was just assumed to be. The other two groups were fully explainable but required a ton of effort to get working outside of very tightly scoped situations. For a long time researchers thought that some combination of the two approaches would lead to more generalized models, but all attempts at morphing the two sucked ass because all knowledge had to be a hand crafted ontology of rules and atoms that could only explicitly encode relationships. Also, while computers can solve CSP/graph traversal algorithms impossibly fast compared to humans, those tasks are not good models for human cognition or tasks beyond stuff like crossword puzzles.

You should consider that despite considerable effort, human brains are themselves black boxes. And you know less about your own knowledge than you think. I do not know where I learned that Timbuktu is both a placeholder name and real place, though I could go find evidence for both. I don’t have to expend any effort to distinguish the sounds of different words, I don’t know why two things can both taste “good”but in completely different ways. Nobody ever taught me that newly met acquaintances tended to not care to discuss current events in the business world, I just figured it out based on a collection of experiences whose individual instances I can’t even remember. Even the best neuroscientist could not tell you why neurons interacting in a certain way makes it so I can both drive a car and sing, or why one person’s brain seems to better at some or generalized tasks than another’s.

And, well, deep learning overturned the paradigm of handcrafting AI systems by automating the process of “have the model produce this output from this input” without requiring humans to define the “how” beyond tuning the shape of the model, which was itself a hugely important innovation in reducing the human time required to build an AI system. But it’s not just faster to make these models, it’s so ridiculously better at making AI models for things like “is there a dog in this picture” that nobody would even consider doing those things without deep learning.

You actually can fiddle with DNNs to get an idea of how they work similar to what we do with brains and CAT scans, you have it do some stuff with commonalities and you figure out which common parts get activated. This is easy to do with convolutional layers as they very commonly learn for themselves how to perform edge detection.

Anyway, long story short, fully explainable AI utterly sucks ass at many tasks that are like a walk on the park for blackbox AI. And we cannot explain our own intelligence and knowledge except in terms of emergent phenomena, nor can we give the full provenance of some factoid or skill we have on demand (just like an LLM cannot tell you where it learned something) in many cases[0], so it seems reasonable that we’d be in the same situation with AI.

[0] The main difference is that we have memory of the various discrete experiences of our lives (which we can associate with some knowledge or skill), and there is no binary separation between “learning mode” and “doing mode” or “active memory” and “long term memory” for us like with AI. We can definitely associate some knowledge with a particular event, but this seems like it could be a false ontological representation of our knowledge because if the knowledge and event were unimportant (like what you had for breakfast on a particular day) we’d forget both of them; it’s actually all the subsequent cases in which the knowledge and memory of the event came in handy that contribute to us being able to explain it.

thomastjeffery3y ago

Most of the confusion here stems from the abuse of the word, "AI". That is the goal, and nothing else. "AI" does not (yet) exist. Every time we call something "an AI", we are telling a lie; and that lie turns the entire discussion away from logic and reason into magic and nonsense.

When we are dealing with a system that is made of logic and reason, we can use logic and reason to construct an understanding of that system. This is the explicit approach to understanding.

When we are dealing with a mysterious black box, we must take the implicit approach: using testing and inference to construct a model, we can construct an explicit understanding of that model. This is effectively the same process, but one step removed: we understand our model, not the system it applies to. That model may be incomplete and/or misaligned.

The human mind is a mysterious black box. We have made a lot of progress modeling that system, but our models are not complete or perfectly aligned.

While our study of the human mind is limited to the implicit approach, the human mind itself is capable of both implicit and explicit understanding.

So far, no software has been able to emulate that feature. Every tech that exists today uses only one of the two approaches.

> Anyway, long story short, fully explainable AI utterly sucks ass at many tasks that are like a walk on the park for blackbox AI.

What you have called "fully explainable AI" is any tech that uses the explicit approach. Because everything is explicitly defined, there is a clear place for logic to exist as part of the system. Because everything is explicitly defined, there is no room for ambiguity in the system.

What you have called "blackbox AI" is any tech that uses the implicit approach. Because nothing is literally defined, ambiguity can exist in the system. Because nothing is explicitly defined, there is no clear place for logic to exist in the system.

But is it really a black box? The program itself is explicitly defined! We should be able to use the explicit approach to understand it, just like we do any other software.

Mike_123453y ago

> Somehow, people are throwing up their hands and giving up at understanding the thing

Go tell that to the researchers who are working hard on studying emergent properties.

thomastjeffery3y ago

Yes: the effects of the thing, and not the thing itself.

joaquincabezas3y ago

all you need is emergent anything

raydiatian3y ago

Emergency emergent emergence

stilist3y ago· 3 in thread

I have zero technical understanding of the math or statistics, but looking at the graphs it seems suspicious that supposed jumps happen across unrelated tasks and models at the same scales--for example, in figure 1, the discontinuities are consistently in the 10^22 to 10^24 range. Obviously I'm just going by what the authors have chosen to include, but I'd expect more variation. At best I'd assume it's something about LLMs in general.

reubenmorais3y ago

The number of data points is tiny. There's only a handful of LLMs trained from scratch in the world, and sizes of models released in a "generation" tend to be close to each other somewhat. The field is very open source so people all over are building on top of the same shared literature. Plus I'm sure there are leaks very often and companies then rush to train their own pet architecture to whatever parameter size the competition is about to release.

kurthr3y ago

I think that's just because there are only 2-3 points between 10^22 and 10^24, which is more about the data available (and that they have just seen dramatic improvements) than the measures or models themselves.

mercer3y ago

Could that be something to do with the things I keep reading about how somehow knowledge from, say, an LLM for generative text somehow carries over (in some way) to an LLM for image generation? I'm obviously not very knowledgeable in this area :).

usgroup3y ago· 3 in thread

I think that part of the reasons conclusions about emergence are tenable is due to the opaque nature of transformer architectures.

For example if it was possible to train a Hidden Markov Model with billions of hidden states on a trillion tokens, you could more literally look and see what was going on.

Other than not being able to scale HMMs to this kind of scale, is there any good reason to believe they would not perform equally well but without the magic?

kolinko3y ago

You're using billions and trillions loosely here.

The hidden state in HMM would be (num_tokens ^ context size), so something like 60000^2000.

usgroup3y ago

I'm not sure that calculation is correct, but say it is, perhaps a Variable Length Markov Chain then.

kolinko3y ago

Variable length markov chains would merge some states, sure, but it will still be a similar order of magnitudes.

Anything longer than 4 tokens/words of context and you bump into 30k ^ 4-10 -> cross a billion/trillion state boundary and you lose any chance of using markov chains.

Also - but here I may be wrong - there is no way to "train" markov chains to do generalisations - that is, if a given sentence didn't appear on the internet, it won't be available as a state for the chain. In this aspect they are more similar to a database than anything else.

modeless3y ago· 2 in thread

The title of this paper is misleading. They are not arguing that the abilities are a mirage. They are arguing that the sudden ("emergent") appearance of unexpected abilities is not actually sudden, but gradual and predictable with model scale, if measured in an improved way.

TheDudeMan3y ago

"Emergent" doesn't mean sudden. (That's not on you but on them.)

Ygg23y ago

It does mean that certain properties not found in constituents is in the greater system (pilots in game of life can't do addition but can be used to make an adder that counts values).

ianbutler3y ago· 2 in thread

So if I'm reading this halfway correctly, quality isn't suddenly emergent, it's continuous and gradual based on size of the model. It only appears emergent when researchers pick bad metrics.

I and I assumed a lot of people, already thought performance was a function on model size (# of parameters). Is this not what the prevailing thought is for DNN performance?

Agreed with the other posters that this title is misleading.

freehorse3y ago

> I and I assumed a lot of people, already thought performance was a function on model size (# of parameters).

I guess the disagreement has been in whether this function is "continuous" or not.

I do not think the title is misleading, considering the article answers to quite specific claims in other articles. I agree it sounds misleading if you do not put it in that context.

sgt1013y ago

I think most people expect(ed) that performance vs. size would (is) be an s-curve. The surprise for most is that we have climbed up the slopes so far and so fast. What the shape is is not clear to me.

cjbprime3y ago· 2 in thread

As others have said, it's an awful title. Could instead be something like "Is the emergence aspect of Emergent Abilities in Large Language Models a Mirage?".

Like, there's supposed to be nothing academic researchers like more than re-using the same word in a title, or making it into a clever pun or quip -- it's like the Dad Jokiest subfield -- but instead we just get a title that implies one common argument that people make, and actually delivers an unpredictable different argument that seems plausible but not necessarily interesting.

mirekrusin3y ago

"Emergent Abilities in LLMs are not spontaneous"

jaidhyani3y ago

Could have gone with "More Comprehensive Metrics Are All You Need"

thomastjeffery3y ago· 1 in thread

The mistake is arriving these abilities to the model itself, and not the content being modeled.

Text contains more data than language. Large Language Models work implicitly: they are not limited to finding language-specific patterns in the text that they model.

Humans look at LLMs through a lens of expectation. Any time we find a feature we did not expect, we categorize it after-the-fact. That's our biggest mistake: LLMs are not made of categories!

ChatGTP3y ago

This is a very interesting way of looking at it...

Animats3y ago· 1 in thread

OK, the system improves with scale. For some metrics which have thresholds of success, that looks like a discontinuity. But the discontinuity comes from the metric, not the improvement.

Anything measured by "winning" has this property. Small changes near the "winning" threshold result in large changes in wins. This is well known in sports.

Is there more to this issue than this amplification effect?

nintendo18893y ago

replying to your old message [1]. The openqnx monartis source code is on github:

https://github.com/vocho/openqnx

[1] https://news.ycombinator.com/item?id=26255095

derrickrburns3y ago

Here is an analogy.

Two softball players. One hits the ball 230 feet on average. The other hits the ball 210 feet on average. The homerun fence is at 220 feet.

One is considered a GREAT homerun hitter. The other is considered a poor one.

The measure is non-linear.

That takes nothing away from the GREAT homerun hitter.

tunesmith3y ago

I've always taken emergence as just a word from the perspective of the beholder. It isn't anything essential to the thing itself. If you understand a complex system enough, emergence goes away and it's reductive again. But that's not to say that emergence as a concept isn't useful. It's very much about our relationship to our discoveries and how much we understand them.

1 more reply

6gvONxR4sf7o3y ago

This is interesting. There's another implication here. That reliability/usefulness is an "emergent" phenomenon as underlying abilities become more accurate.

It's the difference between siri not understanding you 1 word out of 10 (very accurate!), and it basically just understanding you. It's a continuous accuracy function and a discontinuous usefulness function.

seydor3y ago

People are right to doubt the claims of notOpenAI and others about the capabilities of their models. The nonlinear output gains do not mean that the quest for intelligence is over. It's already hard to steer them with RL to make proper math. It's more likely that the transformer will only be a part of the larger architecture.

derrickrburns3y ago

Here is an analogy.

Two softball players. One can hit the ball an average of 230 feet, 40% of the at bats. The other can hit the ball an average of 210 feet, 40% of the at bats. The homerun wall is 220 feet.

One is a GREAT homerun hitter. The other has a poor batting average.

The issue is that the success measure is non-linear.

colordrops3y ago

Any phenomenon that is not a fundamental property of reality is a mirage, or rather a fuzzy human construct on top of a conglomeration of phenomena without discrete boundaries. And even those "fundamental" properties are suspect.

pcrh3y ago

I'm not a mathematician, but it appears to me that "emergent" properties are being defined as those which do not appear in a minor form below a threshold.

However, many natural phenomena that are fully explainable from first principles show this property, giving rise to sigmoidal "S-curves", as shown in Figure 1.

m3kw93y ago

Is it that they don’t understand how the models derive outputs like “step by step reasoning”, and then say this is an emergent behaviour?

theonlybutlet3y ago

Is human consciousness a mirage? (I'd say yes, a complex arrangement much more simple things).

Gordonjcp3y ago

Aren't LLMs basically just Eliza with a huge a priori dataset?

j / k navigate · click thread line to collapse

130 comments

74 comments · 20 top-level

_8j503y ago· 17 in thread

These emergent abilities are not actually that, but a result of humans' poor understanding of cognition and communication.

What concerns me very much is how the harms that can be caused by LLMs has been so greatly under reported.

And we aren't even talking about war time use of LLMs or what happens when you marry something like ChatGPT with Dall-E and make it all real-time.

And why are all the smart journalists asleep on the job on this topic. Where are the unreasonable scaremongerers when you need them!

nl3y ago

> My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything, the hallucination is when their approximation of what a real response would be falls short.

This doesn't seem to make sense. If anything the opposite is true - if the things that are hallucinated make sense (even if not true) it means there is some "understanding" or a world model.

_8j503y ago

No, hallucinations are similar to but not quite the expected result which shows they are approximations.

Undesrstanding is a 3yr old child drawing terribly, approximarion is drawing really well but all wrong.

All this applies to LLMs I used Dall-E because pictures are easier to talk about.

nl3y ago

I haven't used dall-e much but I've never seen stable diffusion or midjourney make an error like that, unless of course deliberately promoted.

You can see this because of the big deal people made about image gen tools getting hands wrong: it was the most significant error that was systematically occuring.

1 more reply

p-e-w3y ago

> My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything

According to your argument, this would imply Grothendieck did not understand what a prime number is. Which is laughable.

_8j503y ago

vixen993y ago

https://en.wikipedia.org/wiki/Braitenberg_vehicle

Samuel Johnson: “Depend upon it, sir, when a man knows he is to be hanged in a fortnight, it concentrates his mind wonderfully.”

deafpolygon3y ago

Humans are not infallible. It's a mistake to anthromorphize LLM.

SanderNL3y ago

LLM wrong => unredeemable and fatally flawed

human wrong => no problem, mistakes happen

2 more replies

flangola73y ago

> My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything, the hallucination is when their approximation of what a real response would be falls short.

> These emergent abilities are not actually that, but a result of humans' poor understanding of cognition and communication.

ChatGTP3y ago

What's the difference between "hallucinating" and just getting something wrong?

Hallucinations are where you hear, see, smell, taste or feel things that appear to be real but only exist in your mind. Get medical help if you or someone else have hallucinations.

SanderNL3y ago

The word has a slightly different meaning here (confabulation) and more akin to the witness example: confidently telling what you think is true, but it turned out to be complete shit.

It’s not about humans intentionally lying or truly hallucinating like on drugs. It’s about confidently thinking they are right about something which turns out not to be true.

I mean, if I say it like this is like humanity’s core business.

1 more reply

Mike_123453y ago

Language models also have limited reasoning abilities. They are capable of misunderstanding as much as understanding.

_8j503y ago

When I say understanding I mean the meaning and context around it. It is one thing to know how to respond to different inputs surrounding a context, it is another to undersrand why!

Mike_123453y ago

> When I say understanding I mean the meaning and context around it.

I believe it is more powerful and ultimately (eventually) more dangerous than you fear it is.

1 more reply

PaulDavisThe1st3y ago

Not bad, but:

> Language models also have limited reasoning abilities.

They have no reasoning abilities at all.

visarga3y ago

1 more reply

Mike_123453y ago

"Language Models Perform Reasoning via Chain of Thought"

Posted by Jason Wei and Denny Zhou, Research Scientists, Google Research, Brain team

https://ai.googleblog.com/2022/05/language-models-perform-re...

1 more reply

p-e-w3y ago· 13 in thread

Centigonal3y ago

That might mean that behaviors we consider emergent are the consequence of a process that scales continuously with model size.

i.e.: there may exist a bijection between, say, a step-function `can_do_arithmetic(size)` and a smooth, continuous function `arithmetic_skill_metric(size)`

If we can use continuous metrics to back out the step-function equivalents, that'll help us predict when and how to get particular abilities to "emerge."

For example: If a change results in a steeper slope on the continuous metric, we can predict it would cause the associated capability to emerge at relatively smaller model sizes.

p-e-w3y ago

> That might mean that behaviors we consider emergent are the consequence of a process that scales continuously with model size.

Centigonal3y ago

2 more replies

freehorse3y ago

> Emergence is something that is intuitively noticed by human observers.

p-e-w3y ago

Calling human intuition "just another metric" ignores the fact that human intuition performs spectacularly well at many high-level tasks.

Whatever "insight" and "understanding" actually mean, there is no denying that they are immensely useful, which is why we want AIs that can replicate them.

kelseyfrog3y ago

> human intuition performs spectacularly well at many high-level tasks.

It's also the number one most common cognitive bias. Humans are especially prone to reification - the confusion that the construction of a measure equates to an objective reality.

Humans are often launder subjectivity through the creation of metrics without 1) knowing they've done so, and 2) become emotional when accused of having done so.

TheOnly923y ago

I agree that perhaps the metrics are not as useful themselves, but I think you're giving too little credit to the paper where maybe some credit is due.

JohnFen3y ago

> Emergence is something that is intuitively noticed by human observers.

beders3y ago

> Emergence is something that is intuitively noticed by human observers.

If you can't quantify that, then it is just a hunch. What exactly do you think "observers already know"?

p-e-w3y ago

There are many fundamental qualities that can be observed but not quantified. General intelligence being one of them (intelligence tests measure some aspects of it, but not others).

roguecoder3y ago

amw-zero3y ago

You just described science, and why you don't believe in science.

p-e-w3y ago

If by "don't believe in science" you mean "don't believe that every metric claimed to be representing a phenomenon actually does represent that phenomenon", you are correct.

raydiatian3y ago· 10 in thread

“Emergent anything” is probably the most obnoxious buzzword in all of machine learning.

opportune3y ago

I’m sorry you find it obnoxious but emergent phenomena are everywhere in math and science and as annoying as it is to you, it also happens with AI.

raydiatian3y ago

I agree that emergent behaviors are real, and important.

Perhaps a good counter-argument is that emergent phenomena are fundamentally a space-time domain concept.

thomastjeffery3y ago

It gives finality to the idea that we don't understand the thing or where it came from.

Why? Did we just lose all interest in understanding things? Wasn't that the whole point in the first place?

Somehow, people are throwing up their hands and giving up at understanding the thing; yet at the same time they are acting like the thing will magically evolve into their wildest dreams!

amw-zero3y ago

opportune3y ago

thomastjeffery3y ago

When we are dealing with a system that is made of logic and reason, we can use logic and reason to construct an understanding of that system. This is the explicit approach to understanding.

The human mind is a mysterious black box. We have made a lot of progress modeling that system, but our models are not complete or perfectly aligned.

While our study of the human mind is limited to the implicit approach, the human mind itself is capable of both implicit and explicit understanding.

So far, no software has been able to emulate that feature. Every tech that exists today uses only one of the two approaches.

> Anyway, long story short, fully explainable AI utterly sucks ass at many tasks that are like a walk on the park for blackbox AI.

But is it really a black box? The program itself is explicitly defined! We should be able to use the explicit approach to understand it, just like we do any other software.

Mike_123453y ago

> Somehow, people are throwing up their hands and giving up at understanding the thing

Go tell that to the researchers who are working hard on studying emergent properties.

thomastjeffery3y ago

Yes: the effects of the thing, and not the thing itself.

joaquincabezas3y ago

all you need is emergent anything

raydiatian3y ago

Emergency emergent emergence

stilist3y ago· 3 in thread

reubenmorais3y ago

kurthr3y ago

mercer3y ago

usgroup3y ago· 3 in thread

I think that part of the reasons conclusions about emergence are tenable is due to the opaque nature of transformer architectures.

For example if it was possible to train a Hidden Markov Model with billions of hidden states on a trillion tokens, you could more literally look and see what was going on.

Other than not being able to scale HMMs to this kind of scale, is there any good reason to believe they would not perform equally well but without the magic?

kolinko3y ago

You're using billions and trillions loosely here.

The hidden state in HMM would be (num_tokens ^ context size), so something like 60000^2000.

usgroup3y ago

I'm not sure that calculation is correct, but say it is, perhaps a Variable Length Markov Chain then.

kolinko3y ago

Variable length markov chains would merge some states, sure, but it will still be a similar order of magnitudes.

Anything longer than 4 tokens/words of context and you bump into 30k ^ 4-10 -> cross a billion/trillion state boundary and you lose any chance of using markov chains.

modeless3y ago· 2 in thread

TheDudeMan3y ago

"Emergent" doesn't mean sudden. (That's not on you but on them.)

Ygg23y ago

It does mean that certain properties not found in constituents is in the greater system (pilots in game of life can't do addition but can be used to make an adder that counts values).

ianbutler3y ago· 2 in thread

So if I'm reading this halfway correctly, quality isn't suddenly emergent, it's continuous and gradual based on size of the model. It only appears emergent when researchers pick bad metrics.

I and I assumed a lot of people, already thought performance was a function on model size (# of parameters). Is this not what the prevailing thought is for DNN performance?

Agreed with the other posters that this title is misleading.

freehorse3y ago

> I and I assumed a lot of people, already thought performance was a function on model size (# of parameters).

I guess the disagreement has been in whether this function is "continuous" or not.

I do not think the title is misleading, considering the article answers to quite specific claims in other articles. I agree it sounds misleading if you do not put it in that context.

sgt1013y ago

cjbprime3y ago· 2 in thread

As others have said, it's an awful title. Could instead be something like "Is the emergence aspect of Emergent Abilities in Large Language Models a Mirage?".

mirekrusin3y ago

"Emergent Abilities in LLMs are not spontaneous"

jaidhyani3y ago

Could have gone with "More Comprehensive Metrics Are All You Need"

thomastjeffery3y ago· 1 in thread

The mistake is arriving these abilities to the model itself, and not the content being modeled.

Text contains more data than language. Large Language Models work implicitly: they are not limited to finding language-specific patterns in the text that they model.

Humans look at LLMs through a lens of expectation. Any time we find a feature we did not expect, we categorize it after-the-fact. That's our biggest mistake: LLMs are not made of categories!

ChatGTP3y ago

This is a very interesting way of looking at it...

Animats3y ago· 1 in thread

OK, the system improves with scale. For some metrics which have thresholds of success, that looks like a discontinuity. But the discontinuity comes from the metric, not the improvement.

Anything measured by "winning" has this property. Small changes near the "winning" threshold result in large changes in wins. This is well known in sports.

Is there more to this issue than this amplification effect?

nintendo18893y ago

replying to your old message [1]. The openqnx monartis source code is on github:

https://github.com/vocho/openqnx

[1] https://news.ycombinator.com/item?id=26255095

derrickrburns3y ago

Here is an analogy.

Two softball players. One hits the ball 230 feet on average. The other hits the ball 210 feet on average. The homerun fence is at 220 feet.

One is considered a GREAT homerun hitter. The other is considered a poor one.

The measure is non-linear.

That takes nothing away from the GREAT homerun hitter.

tunesmith3y ago

1 more reply

6gvONxR4sf7o3y ago

This is interesting. There's another implication here. That reliability/usefulness is an "emergent" phenomenon as underlying abilities become more accurate.

seydor3y ago

derrickrburns3y ago

Here is an analogy.

Two softball players. One can hit the ball an average of 230 feet, 40% of the at bats. The other can hit the ball an average of 210 feet, 40% of the at bats. The homerun wall is 220 feet.

One is a GREAT homerun hitter. The other has a poor batting average.

The issue is that the success measure is non-linear.

colordrops3y ago

pcrh3y ago

I'm not a mathematician, but it appears to me that "emergent" properties are being defined as those which do not appear in a minor form below a threshold.

However, many natural phenomena that are fully explainable from first principles show this property, giving rise to sigmoidal "S-curves", as shown in Figure 1.

m3kw93y ago

Is it that they don’t understand how the models derive outputs like “step by step reasoning”, and then say this is an emergent behaviour?

theonlybutlet3y ago

Is human consciousness a mirage? (I'd say yes, a complex arrangement much more simple things).

Gordonjcp3y ago

Aren't LLMs basically just Eliza with a huge a priori dataset?

j / k navigate · click thread line to collapse