Using the example of 4 digit multiplication in the source paper: The researcher wants to know if the model has developed the ability to multiply two four-digit integers, so they generate a battery of such problems, e.g. "what is 4363*1285? output only your answer." The metric is what percentage of the problems the LLM answers correctly.
This is pretty much the same way a human observer would identify the same emergent behavior, and also how we assess it in other humans. It's not some contrived metric that's detached from the emergent ability in question.
Remember that, in this case, all I want to know is if they have that capability or not. Also, I don't care whether they can tell me how multiplication works or not, only if they can multiply two arbitrary numbers. (here, the abstraction to people breaks down, because people can walk you through their reasoning - whereas for an LLM, explaining how multiplication works and performing multiplication are very different tasks)
There are weaknesses to this approach: you get a binary yes/no, and you have no idea how close they got. You can't tell if they just made a small math error or if they don't even know what numbers are. Going back to the LLM setting, this is why a continuous metric is useful, compared to one that experiences step-function behavior.
Right, but unless you know something about their process, the only way you can determine whether they have the ability to (correctly) multiply two arbitrary four-digit numbers is to have them demonstrate multiplying every combination of four-digit numbers. One can easily imagine a system that gets most answers correct but fails certain cases (e.g. only carrying between pairs of digits).
The reason why an inexhaustive test (using only several dozen examples) works to some degree with schoolchildren is because we know something about their method: the algorithm for multiplication they've been exposed to – that they're explicitly being taught – is an algorithm that we know to be correct.
That has a very large effect on the likelihood of different failure modes and the evidence required to have a particular degree of confidence.
Furthermore, for "system 2" activities (to borrow the term from Kahneman), we can reasonably expect a person's description of their process to match the process they actually performed. (There are exceptions: people will produced incorrect post hoc explanations for their behavior under sufficient duress when "I don't know" isn't perceived as an acceptable answer.) But I'm personally not aware of any reason to believe this about LLMs. I don't know why the network's actual process for performing multiplication should have anything to do with the text it produces after the fact when asked to explain its work.
So I'm not debating that a reasonable way to guess whether a person can (correctly) multiply two numbers is to "ask them to multiply two numbers and see if they get it right"; I'm disagreeing that this works with LLMs.
Did Khaneman get into this in his book?
Also, as for the point of contention, I think if you tell the LLM to show its work in mathematically formal notation, it's far more likely to be able to produce correct answers (I think there was a post on here in the last week or so demonstrating that?). I think this kind of makes the comparison to humans more fair, because inside their mind humans are doing some sort of intermediate math in their head for anything beyond trivial problems, and an LLM needs to be able to speak explicitly to compete fairly with that (I speculate/propose).
The model doesn't "intend" to multiply two numbers together: it has been equipped to parrot what a person asked to multiply two numbers together might say. If we asked it how it performs multiplication, it is going to produce a process it thinks a human would claim to use to perform multiplication, but that doesn't mean it can actually apply that process.
The claimed "emergent" property is that a model that can successfully fool humans into thinking they are talking to a person quasi-magically involves becoming "good" (for some measure of "good") at the cognitive tasks humans are capable of. This paper suggests that the measures of "goodness" researchers have been using make the gains on those cognitive tasks look more dramatic than they would be if measured via linear metrics.
I suspect some of the disconnect expressed in these comments here is based on the participatory nature of being lied to by these models. The reader is a full participant in creating meaning from the output of a model. Even when the improvement in models is linear, our willingness to suspend our disbelief is not. Especially when we want to be fooled: it has to avoid anything that would jar us out of our belief rather than proactively and repeatably succeed at cognitive tasks, and that is a non-linear measure.
Citation needed.
When using GPT-2 vs GPT-3 vs GPT-4, a human can easily tell that each is leaps and bounds "better" than its predecessor, with "deeper" understanding of the input and "more human-like" responses and reasoning. There is a strong impression that those models aren't just progressing along a scale, but changing qualitatively.
I simply doubt that any of the proposed metrics captures this intuitively observable quality. Furthermore, I claim that it is this quality that actually matters. We don't need a language model in order to multiply two numbers. Any clearly defined and algorithmically solvable (and thus readily quantifiable) task is trivial for regular software.
I agree with you about all of this.
> I simply doubt that any of the proposed metrics captures this intuitively observable quality. Furthermore, I claim that it is this quality that actually matters.
I think there are important qualitative observations about LLMs that create understanding and guide future research, but I think it's bad science/engineering to rely on some indescribable quality of "goodness" when it comes to developing new models.
Metrics may not be perfect, but they are all we have. You could train a model, send it to a bunch of human crowdworkers, and ask them to rate it on how good it is (Anthropic does this!), but the result of that is... another metric.
That said, yeah, I think we definitely could use better metrics. OpenAI agrees, which is why they're pushing their Evals project super hard. The paper we're commenting on agrees, which is why they're proposing alternatives to these step-function metrics.
> We don't need a language model in order to multiply two numbers. Any clearly defined and algorithmically solvable (and thus readily quantifiable) task is trivial for regular software.
A language model multiplying two numbers is the whole point of emergence.
Transformers were designed to translate between different natural languages and trained on completing sentences with some words masked out. As we scaled them bigger and bigger, we found that the same model architecture was suddenly capable of more than completing sentences: it was answering questions about high school geography (MMLU), writing computer programs, and - yes - doing basic arithmetic. New capabilities were emerging with scale.
This means that a "language model," at sufficient scale, is more than a language model. This is the closest thing we have right now to generalized AI, where one model can complete a variety of disparate tasks. The arithmetic thing is exciting and unexpected because of what it represents, and understanding it should be a priority, even if we have better non-ML ways of doing that particular task.
A metric that is an aggregate of human intuition is not the same as the usual metrics though. It's just a semi-formalized way to capture those intuitive observations, rather than trying to replace them with much-simpler piecemeal mechanical evaluations.
> but I think it's bad science/engineering to rely on some indescribable quality of "goodness" when it comes to developing new models.
IMO, bad science is what is currently happening across the entire field. Astonishing high-level behavior is being observed from models, but the tools to analyze it don't exist, so instead, people are pushing out papers at a record pace that analyze every low-level property imaginable, as if such analysis would eventually yield high-level insights.
It's okay to not know. I wish every paper dealing with LLMs would start and end with the sentence "Overall, we have no idea what is happening." Instead, we get papers that add a few numbers and then wax philosophical about how LLMs supposedly do things (I'm exaggerating here, but the gist is accurate). Not a day goes by without a new article claiming that LLMs have reached their limit or similar, while we have no clue how they even work! This is really bad science, and yes, I know that much of it is coming from non-experts, but I've seen lots and lots of experts contribute to this nonsense by making completely unjustified claims of a similar nature.
I don't think that follows. We only get to see GPT-2, 3, 4, not 2.5, 3.33, 3.95. We have no way to assess whether LLM performance is continuous or discontinuous.
It's like if you gave me 3 different cars with 50 horsepower, 100 horsepower and 250 horsepower. I'd say that these cars aren't "progressing along a scale" but "showing radical leaps in performance" but in fact top speed does scale with engine power.
To be fair though, the paper's discussion says they are not claiming that models cannot display any emergent abilities, just that some may be mirages.