Compare, "I like what you were wearing", "Pass me the salt", and "Have you been to London recently?" as generated by an LLM and as spoken by a person.
What is the reason each piece of text (in a whatsapp chat, say) is provided?
When the LLM generates each word it does so because it is, on average, the most common word in a corpus of text on which it was trained: "wearing" follows, "I like what you were" because most people who were having these conversations, captured in the training data, were talking about clothes.
When a person types those words on a keyboard, the following are the causes: the speaker's mental states of recollection, preference, taste; the speaker's affective/attachement states with respect to their friend; the speaker's habitation into social cues; the speaker's imagining through recall what their friend was wearing; the speaker's ability to abstract from their memories into identifying clothing; and so on.
Indeed, the cause of a person speaking is so vastly different to generating a word based on a historical frequency, that to suppose these are related seems incomprehensible.
The only reason the illusion of similarity is effective is because the training data is a text-based observation of the causal process in people: the training data is distributed by people talking (and so on). Insofar as you cannot just replay variations on these prior conversations, the LLM will fail and expose itself as actually insensitive to any of these things.
I'd encourage credulous fans of AI not to dehumanize themselves and others by the supposition that they speak because they are selecting an optimal word from a dictionary based on all prior conversations they were a part of. You aren't doing that.
> the most common word in a corpus of text on which it was trained
I think you are downplaying the fine grains of knowledge that can be encoded in a huge corpus of text. LLM-s are capable of taking context into account and encoding that too, not simply how often word A comes after B.
When I'm in a conversation I'm also selecting the optimal word from a predefined dictionary. That's precisely what's like speaking in a given language. Sure, I'm thinking a bit ahead and I can tap into my memory, feelings and experiences which influences everything.
But the optimal part is derived from context for me, it changes which word I use when I talk to a colleague, family or friend, but I might want to say the same thing. For stock LLM-s everything must be defined in the prompt if we are talking about zero-shot inference.
These models are opening good insight on how language works and I don't find that too dehumanising. There's plenty of room exists still for me to be human and do non AI things.
I get the notion that if we understand fully how something works the magic is gone, this always happens to AI. Are we afraid that this might happen to us too?
I've experienced LLMs lacking spatial awareness, such as switching locations in a description despite no indication of moving to the new location. The same applies to other concepts that have a visual/spatial component.
I've also experienced LLMs struggling to get subtext, some metaphors, etc., especially when used in casual conversations instead of as a question/answer style prompt.
LLMs are great, but need more work to fix these gaps.
No, I'm more frustrated by the pseudoscience that models of frequency associations in text are explanations of people (or anything else). The choice isn't between a pseudoscientific behaviouralism where animals have no presence in the world, no mental faculties, and so on vs. "magic".
> When I'm in a conversation I'm also selecting the optimal word from a predefined dictionary
Consider it this way: the probability dististribution over all possible world for you speaking is parameterized on space and time: Pyou(x, t; ...). And of the LLM generating text, Pllm(historical data).
So imagine plotting the live probability distributions of Pyou and Pllm for any given situation. As you think, imagine, move, recall, prefer, desire... the Pyou goes "wild" with dramatic discontinous shifts in distribution brought about by these causes.
Whereas the Pllm remains the same. It never changes. It never reacts to anything at all.
The whole distribution over all prior text tokens, Pllm is a stationary model of frequency associations. Yours is not. This makes all the difference in the world when claiming that Pllm somehow models, or is even relevant to, Pyou.
In order to “predict the next word”, the LLM doesn’t just learn the most likely word from a corpus for the preceding string. If that were true, it would not generalise outside of its training set.
The LLM learns about the structure of the language, the context, and in the process of doing so constructs a model of the world as represented by words.
Admittedly the model is still limited, but it seems to me that there is something more insightful to be gleaned here: that given enough data, and sufficient pressure to learn, that excelling at scale on a relatively simple task leads indirectly to a form of intelligence.
For me the biggest takeaway of LLMs might be that “intelligence is pretty cheap, actually” and that the human brain is not so remarkable as we’d like to believe.
So technically the LLM is not doing P(next word |previous word) -- but rather, P(associated_words(next word)|assocated_words(previous), associated_words(previous_-1), ...).
This means its search space for each conditional step is still extremely large in the historical corpus, and there's more flexibility to reach "across and between contexts" -- but it isnt sensitive to context.. we just arranged the data that way.
Soon enough people with enough money will build diagnostic (XAI) models of LLMs that are powerful enough to show this process at work over its training data.
To visualize roughly, imagine you're in a library and you're asked a question. The first word selects a very large number of pages across many books (and whole books), the second word selects both other books, and pages across the books you have. Keep going.. each more word you're ask, you convert to a set of words, and find more pages and books and also get narrower paragraph samples from the ones you have. Now finally, with total set of pages and paragraphs etc. you have to hand at the end of the question, you then find the word most probable following the other.
This process will eventually be visualised properly, with a real-world LLM, but it'll take a significant investement to build this sort of explanatory model.. since you need to reverse from weights to training data across the entire inference process.
Well, it IS pretty seamlessly integrated with a very impressive suite of sensors.
When I write some 100% bespoke code that is rather hastily composed and then paste it all into ChatGPT4 asking it to "refactor this code with a focus on testability and maintainability" and not only does it do so, but it does a pretty damn good job about it, it feels rather reductive to say "it's just providing the next most likely word".
I mean, maybe that's how it works, but that statistical output clearly involves modeling what my code does and what I want it to do. Rather than make me think LLMs are a cheap trick, it just has me thinking, "shit - maybe that's all I do too."
But ChatGPT 4 follows instructions passably well. For example I just asked it: "Construct a sentence of at least 10 words each of which is extremely grammatically unlikely to follow the word before it. (For example "be are isn't had" as each of those words is impossible after the word before it.) Do not give any explanation of how you have arrived at your answer, reply only with your answer. However, as you construct it ensure that you cannot think of any context in which each next word would ever come after the word before it. Reply with your constructed nonsense sentence only."
Indeed it replied with a good nonsense sentence: "Dogs swimming beautifully reads soft under Wednesday during sky oranges" ("sky oranges" is unlikely, "under Wednesday" is nonsensical and ungrammatical) and when I complained that "dogs swimming" could be sensible as can "swimming beautifully" it came up with an even more nonsense sentence "Apples slowly would butter river quickly seven whenever blue music".
Do you think "Wednesday" is really the most likely word to follow "under" and "river" is really the most likely to follow "butter", or isn't it obvious that it was, for lack of a better word, "trying to" follow my prompt?
https://chat.openai.com/share/50037af6-0f3e-4de3-aff7-53a7b9...
Nevertheless, roughly consider a dataset D for which we have an approximate stochastic model of its conditional frequency associations: P(next|previous..., D) etc.
Then if your prompt really got that reply, from this model, it would do so like this:
"Construct" is first projected to an encoding which replaces it, effectively, with a set of related words (Construct, Make, Create, Write...) all weighted by how they co-occur with construct.
Then we sample from D based on this word set, obtaining roughly, all conversations where these related words were used, call this Dc.
Next take "a sentence" and replace it with its word-set, say, (Sentence, Phrase, Words, ...) and sample conversations from Dc in which these occur, Dcs..
And so on. Since each token in your prompt actually corresponds to basically all possible words but weighted by association, each "filtering operation" actually selects vast amounts of the training data (space).
Finally, consider the reverse problem: what words could this system possibly produce from this process that weren't relevant to your prompt? Given enough data (PBs of text from all possible digitized conversations, books, etc.) then a sensible-seeming answer becomes the only plausible one to generate.
Now, I do think here PBs wouldnt be enough to generate a single statistical model that behaved this way -- so you need a mixture of them (ie., ChatGPT) and I suspect you also need a system for regulating discrete constraints such as quantities. I suspect many deployed LLMs have improved in this area due to models trained to be specifically sensitive to quantities.
Thinking hard it makes your brain hurt. It's exhausting. Most of the work we do, including programming, is not like that. Some of the work we do is fiendishly difficult but much of it is more like word-completion based on prior experience.
Evolutionary processes optimize for energy efficiency. We don't think at 100% brain power all the time because we can't afford to. It makes a ton of sense, in retrospect, that our brains have optimized for language to the point that very little compute is required. And even so, the brain still consumes 20% of our daily calories.
Hard thinking is the exception and casual thinking is the norm. Why is it so hard to persuade people of anything on the internet? Because we mostly engage in LLM-like auto-completion. Very little actual thinking is involved and very few calories are spent.
Chatgpt answers it correctly. abcdpqrs is perhaps not in the training set. If it is we can pick some othername.
Predicting the probable token in a conversation requires predicting the probable subject of the conversation, predicting the interlocutors’ relationship and manner of speaking to one another, predicting the state of recollection, preference and taste of the speaker, predicting the speaker’s mental model…
If the LLM isn’t predicting all of those things then it will produce poor predictions of the next word; doing it well - and humans tend to agree that in a vast array of cases state of the art LLMs do predict tokens very well - requires that prediction model to predict all that context as well.
Suppose I take a billion images of all the coffee cups in the world, at a set of angles on the cup, and then build an associative (ie., frequency) statistical model of their pixels (ie., statistical AI). Consider generating one pixel at a time, in sequence, through the image. My associative model tells me P(col of next pixel | all previous).
Now, I can generate coffee cups images similar to any variation or combination of the images in the dataset. Now, you might say, "well you can only do that if you have a model of a coffee cup" (rather than of pixels) -- if so, just generate a coffee cup at one of the angles not in the dataset. This will not happen, because the model has not been provided with enough information to do so.
Namely, the model does not know the distance from the camera, the camera lens parameters, the angle to the coffee cup, etc. So there's literally a very very large inifinity of possible objects at unseen angles. Consider that underneath a coffee cup, the bottom might be missing entirely, etc.
Now it will appear to know all of these things, because its just generating images with these same parameters (camera, angle, distance, etc.). But as soon as you want "a coffee further away than has been seen before", or "a coffee using a macro lens", etc. the whole thing will fall over.
It is you, the view, who attributes 3D knowledge to the model because under ordinary circumstances the cause of a photo is features of a 3D environment.
This argument is backwards. Humans don't measure the next token prediction ability of the agents they speak to, human or AI. We rate speakers on whether they seem to understand what we say in context and respond by contributing useful information and analysis.
The attributes you're saying can be inferred from known superior next token prediction ability are the things we can actually detect and measure, at least qualitatively. Next token prediction quality is not measurable by humans in any human-meaningful way. Improving test cross entropy by 50% doesn't mean anything to us. It is irrelevant except as a mechanism to train LLMs.
> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.
> Alas, that does not remotely resemble how people are pitching this technology.
Yes. Now extend this to the context length of GPT-4 turbo- about 240 pages of text. So from your description, "wearing" is just the "most common word" to follow those 240 pages of (previously unseen, unique) text according to its training data. Quite simple, nothing to see here, I suppose.
These two things cannot be compared or contrasted. It's very common to see people write something like "LLMs don't actually do <thing they obviously actually do>, they just do <dismissive description of the same thing>."
Typically, like here, the dismissive description just ignores the problem of why it manages to write complete novel sentences when it's only "guessing" subword tokens, why those sentences appear to be related to the question you asked, and why they are in the form of an answer to your question instead of another question (which is what base models would do).
This line of reasoning that LLMs "only predict" the next token is akin to saying humans can only think or speak one word at a time. Yes, we use one token/word at a time, but it is the aggregate thought that matters, regardless of what underlies it.
If there are 50K possible tokens and I don't have any other information, I could make a naive estimate that every token has equal probability and start generating text that is just gibberish. With the simple single-token Markov-chain example I would estimate probabilities based the previous token, and that probability estimate would be much better. If you use it for generating text it will look like something that is almost, but not quite, entirely unlike human speech. [1]
The difference lies entirely in how accurately you model the world and what information you have available when estimating probabilities. Models like GPT4 happen to be very good at it because they encode a huge amount of knowledge about the world and take a lot of context into account when estimating the probability. That's not something to be taken lightly.
If you do beam search, RAG, tool usage, etc then the whole system no longer is one.
It's really long! But worth the read.
What separates this from the following:
"I'll begin by clearing a big misunderstanding people have regarding how the human brain works. The assumption that most people make is that the brain can think, reason, and understand language, but in reality all it can do is process electrical and chemical signals."
Here's a cool animated transformer, also open source prvnsmpth.github.io/animated
Here's the "attention is all you need paper" with links to open source implementations paperswithcode.com/paper/attention-is-all
Yep. Nice article, though!