Where Is GPT in the Chomsky Hierarchy? (opens in new tab)

(fi-le.net)

55 pointsfi-le3mo ago76 comments

76 comments

The Chomsky Hierarchy isn't the right way to think about, in fact the Chomsky paradigm probably held back progress in language technology over the past 70 years.

In the Kuhnian sense, Syntactic Structures was the vanguard of a paradigm shift that let linguists do "normal science" in terms of positing a range of problems and writing papers about them. It was also useful in computer science for formally defining computer languages that are close enough to natural languages that people find them usable.

On the other hand, attempts to use the Chomsky theory to develop language technology failed miserably outside of very narrow domains and in the real world Markov models and conditional random fields often outperformed when the function was "understanding", "segmentation", etc.

From an engineering standpoint, the function that tells if a production is in the grammar that is central to the theory is not so interesting for language technology, I mean, if LLMs were -centric then an LLM would go on strike if you got the spelling or grammar wrong or correct you in a schoolmarmish William Safire way but rather it is more useful for LLMs to silently correct for obvious mistakes the same way that real language users do.

The other interesting thing is the mixing of syntax and semantics. For a long time I believed the "Language Instinct" idea of Pinker in which the ability to serialize and deserialize language is a peripheral that is attached to an animal and you need the rest of the animal and even the surrounding world to have "human" competence. Now LLMs come around and manage to show a lot of linguistic competence without the rest of the animal and boy we have egg on our face, and coming out of the shadow of the Chomsky theory, structuralism will rear it's head again. (e.g. "X is structured like a language" got hung up on the unspoken truth that "language is not structured like a language")

nabla93mo ago

You saw name "Noam Chomsky" and that started a process in your mind that generated the standard spiel about Syntactic Structures.

Chomsky Hierarchy is his more fundamental work that joins computer science and linguistics. It was published in IRE Transactions on Information Theory. Chomsky, Noam (1956). "Three models for the description of language" https://chomsky.info/wp-content/uploads/195609-.pdf

Type-3 grammar ≡ finite-state automaton

Type-2 grammar ≡ Non-deterministic pushdown automaton

Type-1 grammar ≡ Linear-bounded non-deterministic Turing machine

Type-0 grammar ≡ Turing machine

ps. Chomsky was already aware of Finite State Automata and Turing Machines and understood that they match Type-3 and Type-0. Pushdown Automata was invented later, and connection between Type-1 grammar and Linear Bounded Automata was made few years later.

PaulHoule3mo ago

An LLM is not type 0. It always finishes in finite time so it is not Turing complete.

I asked Copilot

   answer yes or no,  is the following Java method syntactically well formed
   
   static int aMethod() { return "5"; }

and got what I thought was the wrong answer

   No.
   It’s syntactically valid as Java code, but it will not compile because it returns a String where 
   an int is required.

because I hadn't specified clearly that I was talking about the type 2 CFG of the parser as opposed to the type 1 behavior of the compiler as a whole. [1] I had a good conversation with copilot about it and I'm sure I'd get better results with a better prompt... It would make a good arXiv paper to pose an LLM grammatical recognition problems by prompting

   here is a set of rules: ...  is the production ... in the grammar?

with a wide range of cases. Somebody who just tries a few examples might be impressed by its capability but if you were rigorous about it you would conclude that an LLM pretends to be able to recognize grammars but can't actually do it.

And that's true about everything they do, one could argue that "in an exact sense, LLM can't do anything at all") They'll make a try at a weird question like

   what percent of North Americans would recognize the kitsune hand gesture?

which is a public opinion research question similar in character to

   what is lowest mass eigenstate of the neutrino?

in that it could be answered rigorously (but still in terms of probability, even hep results have p-values)

[1] javac implements type 1 behavior in the java language which is a substrate

1 more reply

xg153mo ago

I think it could be useful to combine the two paradigms to maybe get a better understanding of what transformers can and cannot learn.

E.g. would it be possible to create an algorithm that takes a grammar (and maybe a desired context window size) as input and constructs a transformer network that generates sentences exactly from that grammar?

("Construct" meaning directly setting the weights, without any iterative training process)

1 more reply

yongjik3mo ago

Chomsky Hierarchy pertains to "languages" defined in the theory of computation: i.e., it is a subset of the set of all finite sequence of alphabets (for some fixed notion of "alphabets"). If a sentence (a particular finite sequence of alphabets) is in the subset, then it is a "valid" sentence of the language. Otherwise it is invalid.

It should be already clear from this that this notion of language is rather different from natural languages. For example, if there is a formal language that contains "Good morning" and "My hovercraft is full of eels" as valid sentences, then nothing distinguishes these sentences any more. (Of course you could add annotations and build semantic values but they are not essential to the discussion of formal languages.)

It gets a bit more ridiculous when you try to connect LLMs to the Chomsky hierarchy. Modern LLMs do not really operate on the principle of "is this a valid sentence?" yet provide vastly superior results when it comes to generating naturally sounding sentences.

I think LLMs have put an end to any hope that formal language theory (in the style of Chomsky Hierarchy) will be relevant to understanding human languages.

1 more reply

Legend24403mo ago

The trouble is that english doesn’t fit neatly into any of these categories. It has features that make it at least a context-free language, but can’t handle other features of context-free languages like unlimited nesting.

Ultimately these are categories of formal languages, and natural language is an entirely different kind of thing.

1 more reply

czzr3mo ago

I don’t think LLMs contradict the Pinker description - it just turns out that the output stream embodies a lot of information, and you can construct a model (of some reasonable fidelity) of the original sources from sufficient examples of the output stream.

ogogmad3mo ago

I think most PL parsers use recursive descent, which doesn't require any formal language theory. That said, it's possible that formal language theory has enabled regular expressions to enter widespread use - first via theory; then to make tokenisers; then included in text editors; and then to make GREP.

Maxatar3mo ago

You're going to have a very hard time parsing a language using recursive descent if you don't understand some formal language theory. Recursive descent algorithms have a lot of trouble with left-recursion, and can result in arbitrary backtracking if you're not careful, which can blow up memory and time. To fix left-recursion you need to rewrite your grammar, which means familiarity with formal language theory. To avoid backtracking you typically use an LL(N) or variations thereof, which once again require having a pretty good understanding of formal language theory.

1 more reply

suddenlybananas3mo ago

> For a long time I believed the "Language Instinct" idea of Pinker in which the ability to serialize and deserialize language is a peripheral that is attached to an animal and you need the rest of the animal and even the surrounding world to have "human" competence.

Yeah you just need 10 trillion tokens of data.

jesuslop3mo ago

In a complementary story, from light reading I gather that Zellig Harris, that advised Chomsky's dissertation, contributed decisively to distributional semantics, a forerunner of word embeddings, in turn a key to the door of neural NLP. The ML guys had a shiny toolbox ready to use for decades. Had it been more fashionable...

Shalgren, Distributional Legacy: The Unreasonable Effectiveness of Harris’s Distributional Program. https://doi.org/10.1080%2F00437956.2024.2414515

DiscourseFan3mo ago

Don't worry, the LLMs are already post-structuralist (at least the ones working on them are often familiar with those thinkers). Which is incredibly disturbing, I think, for many linguists who have spent their entire careers fighting against the French, only to have such philosophical and intellectual work become the most important in the development of this new technology. Hey, at least I'm employed! (And they said it "wasn't serious," that Chomskian Linguistics was the "real science"!)

lukev3mo ago

Well, technically, they're more structuralist than post-structuralist. Arguably they even prove that Saussure's original structuralist hypothesis was correct.

2 more replies

pohl3mo ago

Regular...albeit astronomically large (unless we're granting idealizations like infinite context, etc.)

Netcob3mo ago

Exactly, same as all real-world computers.

Although to be fair, nothing above regular (that I'm aware of, it's been a while) requires infinite space, just unlimited space... you can always make a real Turing machine as long as you keep adding more tape whenever it needs it.

pohl3mo ago

Yeah, meant to say unbounded

guerrilla3mo ago

This is a trivial category mistake.

aghilmort3mo ago

there’s decent work on computational reasoning power of transformers, SSMs, etc.

some approximate snippets that come to mind are that decoder-only transformers recognize AC^0 and think in TC^0, that encoder-decoders are strictly more powerful than decoder-only, etc.

Person with last name Miller iric if poke around on arXiv, a few others, been a while since was current top of mind so ymmv on exact correctness of above snippets

ctoa3mo ago

You are probably thinking of Merrill (whose work is referenced towards the end of the article).

aghilmort3mo ago

ah yes Merrill thx!

weinzierl3mo ago

"This covers practically all programming languages, because they conform to a syntax tree where each branch is an application of such a production rule."

Except that practically all practically used programming languages are not context-free.

xg153mo ago

Is that so?

Programming languages usually have a number of "higher-level semantic layers" on top of the syntax tree that impose further constrains - e.g. that identifiers must have consistency between usage and definition, that all expressions are valid according to the type system, etc.

But all that stuff happens after an AST is created and so doesn't really affect the basic grammar.

I think if you ignore those higher-level constraints, a number of languages have valid context-free grammars.

weinzierl3mo ago

In C, the token sequence

  T * x;

can be interpreted either as a declaration or as an expression (multiplication). Resolving this ambiguity requires context, namely whether T is a typedef name or an object name, which cannot be determined from syntax alone.

One could argue that it is acceptable to produce an ambiguous parse and defer the decision, and in practice compilers do rely on semantic information to resolve such cases.

However, this is not what context-free means in the Chomsky sense: a context-free grammar must be able to determine the syntactic structure without reference to prior declarations or semantic state.

1 more reply

krackers3mo ago

There is at least one turing complete programming language that is trivially context-free, Brainfuck, so at least for the purposes of expressivity it seems OK to consider context-free ones.

fi-leOP3mo ago

Thank you, corrected!

ursAxZA3mo ago

Honest question: if this hierarchy isn’t suitable for comparing humans with humans, what does it mean to apply it to machines? What, exactly, is being measured?

───

I’m not arguing it’s useless — I’m trying to understand its scope here.

Vosporos3mo ago

On Little Saint James, with Chomsky himself.

Animats3mo ago

> Let's grant our GPT an infinite context-window.

No. If you have infinite memory, you get into decidability issues, which the author does. Lots of people have been down this rathole. Penrose is probably the most prominent. It doesn't help.

The real issues are combinatoric. How hard is this? Is there an upper bound? Is this one of those many problems where the worst case limit is exponential but the average limit is much lower? Or where a "good enough" solution is far cheaper computationally than a perfect one. The Traveling Salesman problem is the classic example, but it comes up in other optimization contexts, such as the simplex method of linear programming.

Maxatar3mo ago

So unless I'm mistaken, the TL;DR is that GPTs inherently can not be Turing complete because they always terminate, ie. there is always a non-zero probability of the end-of-token character to be generated.

Seems like a fair argument to raise, although not too insightful.

As a nitpick, it really is not the case that most programming languages are context-free, but I can sympathize with what the author is trying to get at. Most programming languages have a context-free superset that is useful to use as one of many stages in the parsing pipeline, before type checking and name resolution etc... but apart from perhaps some dynamically typed programming languages, the vast majority of programming languages are context-sensitive, not context-free.

onraglanroad3mo ago

> GPTs inherently can not be Turing complete because they always terminate

I'm not sure that was intended entirely seriously. After all, humans always terminate too...

bigfishrunning3mo ago

To follow on logically, maybe humans aren't Turing complete? we are not equivalent to Turing machines...

4 more replies

arjvik3mo ago

As a quick hack, is this flaw fixed by using top-p or top-k or min-p or whatever the current hottest logit sampling algorithm is?

FeepingCreature3mo ago

The "flaw" is also fixed by simply not recognizing the end-of-sampling token. Even the limited size of the context window can be worked around with tool calls.

fi-leOP3mo ago

Thank you for the nitpick, corrected!

enriquepablo3mo ago

The Chomsky hierarchy is a classification of grammars. So it's about Syntax.

GPT is purely about semantics, about truth. To the extent that I think it can be said to be fundamentally informal in its dedication to truth, since it can easily be prompted to produce syntactically incorrect output - wrt the syntax of whatever language whose semantics determine the truth GPT is asked to reflect.

So I think this is a categorically incorrect question.

I mean, what is the fundamental syntax of the output of an LLM? It would have to be that any sequence of utf8 chars is valid.

suddenlybananas3mo ago

LLMs have absolutely nothing to do with truth. If anything the analogy is closer to syntax as LLMs are pure form, just language in absence of any context or reference.

enriquepablo3mo ago

LLMs tell you one thing and not its negation, this choice is semantic. "just language in absence of any context or reference" is a set that includes every valid text. The usefulness of LLMs is their ability to mostly respect whatever semantics you establish in the prompt in their choice of output.

1 more reply

babypistol3mo ago

What is up with this argument that LLMs cannot be Turing complete?

> There is a simple argument showing that GPTs, even with an infinite context window, cannot be Turing complete. Since the vocabulary size is finite, the rows of the embedding matrix are bounded, meaning each embedding is in a compact subset of Rn. By Tychonoff's Theorem, the product of the compact spaces the input embeddings live in is also compact. Because the transformer architecture does a bounded number of continuous operations, the output probabilities are bounded from above and below. In particular, the probability of emitting the end-of-sequence token (the special token that halts the completion) is bounded from below by some constant, so the probability of stopping in finite time is 1.

Even if the conclussion might be correct, the argument is pure nonsense.

First of all, why do we need to invoke compactness and Tychonoff's Theorem? Any serious discussion of computability must avoid discussing real numbers. And furthermore, here it is completely unnecessary. All sets are actually finite, so any mention of compactness is just fancy wording for finite in this case.

Second, a probability argument doesn't prove anything about turing completeness. Realistically we must consider LLM's as deterministic machines. The fact that we like to interpret the output of an ML model doesn't have anything to do with actual probability of what happens during execution of an LLM (again, it ia deterministic).

Third, even if we accept the probability argument, we can easily resolve it by just removing the end-of-sequence token and guarantee that the process never stops.

I think the real argument against Turing completeness of GPTs (even with infinite context) would be more along the lines of the fact that it must operate with finite number representations. And a consequence of this would be that it maps an infinite context into a finite set of states, which must result in periodic output at some point.

Finally, as a meta-point, why do so many people enjoy making such quasi-mathematical arguments? Often invoking theorems which obviously can't apply, if they just took a bit of time to clearly define the question. To me it seems like ML people know just enough math to notice some superficial correspondence, but lack the rigor to be able to trully evaluate it.

wolfgangbabad3mo ago

Next to Epstein.

j / k navigate · click thread line to collapse

76 comments

PaulHoule3mo ago

The Chomsky Hierarchy isn't the right way to think about, in fact the Chomsky paradigm probably held back progress in language technology over the past 70 years.

nabla93mo ago

You saw name "Noam Chomsky" and that started a process in your mind that generated the standard spiel about Syntactic Structures.

Type-3 grammar ≡ finite-state automaton

Type-2 grammar ≡ Non-deterministic pushdown automaton

Type-1 grammar ≡ Linear-bounded non-deterministic Turing machine

Type-0 grammar ≡ Turing machine

PaulHoule3mo ago

An LLM is not type 0. It always finishes in finite time so it is not Turing complete.

I asked Copilot

   answer yes or no,  is the following Java method syntactically well formed
   
   static int aMethod() { return "5"; }

and got what I thought was the wrong answer

   No.
   It’s syntactically valid as Java code, but it will not compile because it returns a String where 
   an int is required.

   here is a set of rules: ...  is the production ... in the grammar?

And that's true about everything they do, one could argue that "in an exact sense, LLM can't do anything at all") They'll make a try at a weird question like

   what percent of North Americans would recognize the kitsune hand gesture?

which is a public opinion research question similar in character to

   what is lowest mass eigenstate of the neutrino?

in that it could be answered rigorously (but still in terms of probability, even hep results have p-values)

[1] javac implements type 1 behavior in the java language which is a substrate

1 more reply

xg153mo ago

I think it could be useful to combine the two paradigms to maybe get a better understanding of what transformers can and cannot learn.

("Construct" meaning directly setting the weights, without any iterative training process)

1 more reply

yongjik3mo ago

I think LLMs have put an end to any hope that formal language theory (in the style of Chomsky Hierarchy) will be relevant to understanding human languages.

1 more reply

Legend24403mo ago

Ultimately these are categories of formal languages, and natural language is an entirely different kind of thing.

1 more reply

czzr3mo ago

ogogmad3mo ago

Maxatar3mo ago

1 more reply

suddenlybananas3mo ago

Yeah you just need 10 trillion tokens of data.

jesuslop3mo ago

Shalgren, Distributional Legacy: The Unreasonable Effectiveness of Harris’s Distributional Program. https://doi.org/10.1080%2F00437956.2024.2414515

DiscourseFan3mo ago

lukev3mo ago

Well, technically, they're more structuralist than post-structuralist. Arguably they even prove that Saussure's original structuralist hypothesis was correct.

2 more replies

pohl3mo ago

Regular...albeit astronomically large (unless we're granting idealizations like infinite context, etc.)

Netcob3mo ago

Exactly, same as all real-world computers.

pohl3mo ago

Yeah, meant to say unbounded

guerrilla3mo ago

This is a trivial category mistake.

aghilmort3mo ago

there’s decent work on computational reasoning power of transformers, SSMs, etc.

some approximate snippets that come to mind are that decoder-only transformers recognize AC^0 and think in TC^0, that encoder-decoders are strictly more powerful than decoder-only, etc.

Person with last name Miller iric if poke around on arXiv, a few others, been a while since was current top of mind so ymmv on exact correctness of above snippets

ctoa3mo ago

You are probably thinking of Merrill (whose work is referenced towards the end of the article).

aghilmort3mo ago

ah yes Merrill thx!

weinzierl3mo ago

"This covers practically all programming languages, because they conform to a syntax tree where each branch is an application of such a production rule."

Except that practically all practically used programming languages are not context-free.

xg153mo ago

Is that so?

But all that stuff happens after an AST is created and so doesn't really affect the basic grammar.

I think if you ignore those higher-level constraints, a number of languages have valid context-free grammars.

weinzierl3mo ago

In C, the token sequence

  T * x;

One could argue that it is acceptable to produce an ambiguous parse and defer the decision, and in practice compilers do rely on semantic information to resolve such cases.

However, this is not what context-free means in the Chomsky sense: a context-free grammar must be able to determine the syntactic structure without reference to prior declarations or semantic state.

1 more reply

krackers3mo ago

There is at least one turing complete programming language that is trivially context-free, Brainfuck, so at least for the purposes of expressivity it seems OK to consider context-free ones.

fi-leOP3mo ago

Thank you, corrected!

ursAxZA3mo ago

Honest question: if this hierarchy isn’t suitable for comparing humans with humans, what does it mean to apply it to machines? What, exactly, is being measured?

───

I’m not arguing it’s useless — I’m trying to understand its scope here.

Vosporos3mo ago

On Little Saint James, with Chomsky himself.

Animats3mo ago

> Let's grant our GPT an infinite context-window.

No. If you have infinite memory, you get into decidability issues, which the author does. Lots of people have been down this rathole. Penrose is probably the most prominent. It doesn't help.

Maxatar3mo ago

Seems like a fair argument to raise, although not too insightful.

onraglanroad3mo ago

> GPTs inherently can not be Turing complete because they always terminate

I'm not sure that was intended entirely seriously. After all, humans always terminate too...

bigfishrunning3mo ago

To follow on logically, maybe humans aren't Turing complete? we are not equivalent to Turing machines...

4 more replies

arjvik3mo ago

As a quick hack, is this flaw fixed by using top-p or top-k or min-p or whatever the current hottest logit sampling algorithm is?

FeepingCreature3mo ago

The "flaw" is also fixed by simply not recognizing the end-of-sampling token. Even the limited size of the context window can be worked around with tool calls.

fi-leOP3mo ago

Thank you for the nitpick, corrected!

enriquepablo3mo ago

The Chomsky hierarchy is a classification of grammars. So it's about Syntax.

So I think this is a categorically incorrect question.

I mean, what is the fundamental syntax of the output of an LLM? It would have to be that any sequence of utf8 chars is valid.

suddenlybananas3mo ago

LLMs have absolutely nothing to do with truth. If anything the analogy is closer to syntax as LLMs are pure form, just language in absence of any context or reference.

enriquepablo3mo ago

1 more reply

babypistol3mo ago

What is up with this argument that LLMs cannot be Turing complete?

Even if the conclussion might be correct, the argument is pure nonsense.

Third, even if we accept the probability argument, we can easily resolve it by just removing the end-of-sequence token and guarantee that the process never stops.

wolfgangbabad3mo ago

Next to Epstein.

j / k navigate · click thread line to collapse