Theoretical limitations of multi-layer Transformer (opens in new tab)

(arxiv.org)

107 pointsfovc1y ago22 comments

22 comments

16 comments · 5 top-level

cs7021y ago· 10 in thread

Huh. I just skimmed this and quickly concluded that it's definitely not light reading.

It sure looks and smells like good work, so I've added it to my reading list.

Nowadays I feel like my reading list is growing faster than I can go through it.

Matthyze1y ago

I'd humbly like to ask people who've read the paper whether it's worth trying to understand it without a great math background. The paper looks intersting but daunting, and I'd hate to sink a lot of time into it and leave defeated.

It sometimes sucks being in ML with 'only' a CS background. Feels like all the math and physics grads are running around having fun with their fancy mathematics, while I stand here, feeling dimwitted.

nighthawk4541y ago

A nice trick for most papers is to skip the middle (at first). Just don’t read most of the lines of math. Focus on the inputs and outputs.

If the premise and conclusion don’t make sense on fundamentals the math isn’t likely to fix it. Most lines are literally equals signs - just walking you through some equivalencies as proof. A large statement saying “If ABC, then … (and then … and then … and then …) and finally XYZ”

The middle ‘and then’s aren’t really that important if the conclusion XYZ isn’t interesting. Or much more commonly, the ABC premise is false anyway so who cares.

Most readers I’d wager are not sitting here deciphering opaque gradient derivations every single paper. Just skip it unless it proves worthy

robotresearcher1y ago

Good advice. I ran an academic lab for a long time, read and reviewed a lot of papers, and trained students to read them. My process is as follows.

In order, and quickly, I look at the abstract, the picture at the top of page 2 (unless it was on page 1 like vision and graphics papers tend to do), the references, the conclusion, the lit review, if I’m still interested I try to scan the whole thing to decide what the main point of the paper is. Then if I care to, I start again and read it linearly start to finish.

If I’d agreed to review a paper, I don’t get to abort the process. Otherwise I can bail at any step.

2 more replies

gessha1y ago

Most of the math in ML papers is smoke and mirrors. Some good theoretical papers exist but too many others cover up accidental discovery with vague mathy explanations.

If you really want to go deep on a topic in ML, take a subtopic paper and read related works until you can connect the basics to the paper. This consists of not only writing down the math but also coding the concepts. ML students go through this process when they choose a topic and start working on it. Focus on one topic and learn it. Present it to yourself or others. Find holes in the literature and try exploring them. Science!

svachalek1y ago

I'm not a math person either, but I'm familiar with some of the terminology. The two things I got out of this paper were:

1. Depth is more important than breadth for making transformers smarter. That is, for a given size of model it will be more powerful with more, smaller layers than it would be with less, bigger ones. Interestingly, Mistral just updated their small model yesterday with a big reduction in layers in order to improve performance. Among the many ways they say it more technically they do say directly "depth plays a more critical role than width in reasoning and composition tasks".

2. As I understand it, they are claiming to be able to prove mathematically that Chain of Thought such as seen in the new DeepSeek R1 and GPT o1/o3 models creates results that wouldn't be possible without it. The thought chains effectively act as additional layers, and per the previous point, the more layers the better. "From a theoretical view, CoT provides Transformer with extra computation space, and previous work ... proved that log-precision Transformer with CoT could simulate any polynomial-time algorithm. Therefore, by further assuming certain complexity conjecture ... their results imply that constant depth Transformer with CoT could simulate poly-time algorithm, while constant depth Transform ... itself can not solve P-complete task."

1 more reply

aketchum1y ago

Most of it is linear algebra and convex optimization. You can learn a lot of it with free resources from MIT, Stanford, Georgia Tech, or YouTube. If you want more of a school style learning environment you can enroll in the Georgia Tech OMSCS program and just take the classes related to the math etc that you are interested in. No reason you have to graduate and it is maybe $800 a course.

Matthyze1y ago

Thanks! Now might actually be a great time for me to pick up the material. If anyone has suggestions about particularly good free/online mathematics courses for ML, I'd really love to hear it. Or books!

2 more replies

DoctorOetker1y ago

It's never too late to study physics or mathematics. I would however recommend studying physics since it will force you to study the slices of mathematics that was motivated by necessity. A lot of mathematics is genuinely interesting in and of itself, or results in mysteriously rich structures, but has not necessarily found many applications. A lot of mathematics was in fact motivated and (often sloppily) invented by physicists, before being put on more rigorous footing by mathematicians...

Studying physics without learning math is impossible, but physics results in a curricular selection of math which is highly relevant to machine learning.

szundi1y ago

Would be fun to ask o3

drodgers1y ago

It looks like o3-mini doesn't yet support PDF input, but here's what Claude opus had to say:

    This paper proves the first unconditional limitations of multi-layer decoder-only transformers, which are the dominant architecture used in large language models (LLMs) today. The key takeaways are:
    
    1. Any constant-depth L-layer decoder-only transformer requires a polynomial model dimension of at least n^(2^(-4L)) to perform sequential composition of L functions over an input of n tokens. This is an exponentially larger model size than what an (L+1)-layer transformer needs.
    
    2. There are certain tasks, like L-step sequential function composition, that are exponentially harder for L-layer transformers compared to (L+1)-layer transformers. This proves a depth-width tradeoff.
    
    3. The same sequential composition tasks used above can be solved by an exponentially shallower and smaller encoder-only transformer compared to decoder-only transformers. This unconditionally separates the computational power of encoders vs decoders.
    
    4. Sequential composition tasks that are hard for L-layer transformers become exponentially easier when allowing L steps of "chain of thought" reasoning. This provides the first provable benefit of chain of thought prompting.
    
    The results rely on a new theoretical model called the "autoregressive communication model" that abstracts the key aspects of decoder-only attention. The lower bounds are proved by analyzing the ability of this model to perform multi-step function composition.
    
    The proofs appear technically sound, relying on information-theoretic arguments without unproven assumptions. The separation between encoders and decoders was previously only known assuming unproven complexity conjectures.
    
    The key limitation is that the theoretical model makes some simplifications compared to practical transformers. But overall, this looks like an important foundational result that rigorously demonstrates limitations of the currently dominant LLM architecture and provides theoretical motivation for alternative approaches like encoder-decoder models, chain of thought, and deep but narrow models. The unconditional nature of the results makes them more definitive than prior work relying on complexity assumptions.
        The sequential function composition problem that the paper shows to be hard for constant-depth decoder-only transformers essentially requires iteratively applying a sequence of functions, where each function takes as input the output of the previous function application. 
    
    In terms of practical everyday reasoning tasks, this could potentially capture things like:
    
    1. Multi-step arithmetic: A task like "What is 5 plus 3, minus 2, times 4, divided by 8" requires sequentially applying the arithmetic operations. Each operation takes as input the result of the previous one.
    
    2. Chained logical inference: Deriving a conclusion from a sequence of logical premises, where each inference step builds on the previous conclusions. For example: "If A then B. If B then C. If C then D. A is true. Therefore D is true."  Reaching the final conclusion requires iteratively applying the logical rules.
    
    3. Complex query decomposition: Answering a complex query that requires multiple steps of information retrieval and computation. Like "How many countries have a population larger than Russia's GDP per capita?" This might involve retrieving Russia's GDP, dividing by population to get per capita, then comparing each country's population to this value and counting matches.
    
    4. Multi-hop reasoning over knowledge: Drawing inferences that require combining multiple facts from a knowledge base. E.g. "The tallest building in the city with the largest population in John's home state" would require finding John's state, its largest city, and the tallest building there, using separate knowledge lookups.
    
    5. Sequential decision making: Planning tasks that require a sequence of decisions, where each decision depends on the state resulting from previous decisions.
    
    The paper suggests that constant-depth transformers, the current architecture of choice for LLMs, may struggle with tasks like these that require longer chains of computation, despite their strong performance on many benchmarks. Exploring alternative architectures like deeper models or encoder-decoder structures may help address these limitations.
    
    Of course, the actual difficulty of any given task for an LLM depends on many factors beyond just the abstract computational structure. But this result provides a valuable lens for understanding the types of everyday reasoning that may be challenging for the current paradigm and motivates further research into more compositional and reasoning-oriented architectures.

cubefox1y ago· 1 in thread

Loosely related thought: A year ago, there was a lot of talk about the Mamba SSM architecture replacing transformers. Apparently that didn't happen so far.

thesz1y ago

Just like with neural networks and Adam [1], LLMs evolve to make transformers their best building block.

[1] https://parameterfree.com/2020/12/06/neural-network-maybe-ev...

thesz1y ago

  > ...our results give: ... (3) a provable advantage of chain-of-thought, exhibiting a task that becomes exponentially easier with chain-of-thought.

It would be good to also prove that there is no task that becomes exponentially harder with chain-of-thought.

hochstenbach1y ago

Quanta magazine has an article that explains in plain words what the researchers were trying to do : https://www.quantamagazine.org/chatbot-software-begins-to-fa...

byyoung31y ago

those lemmas are wild

j / k navigate · click thread line to collapse

22 comments

16 comments · 5 top-level

cs7021y ago· 10 in thread

Huh. I just skimmed this and quickly concluded that it's definitely not light reading.

It sure looks and smells like good work, so I've added it to my reading list.

Nowadays I feel like my reading list is growing faster than I can go through it.

Matthyze1y ago

nighthawk4541y ago

A nice trick for most papers is to skip the middle (at first). Just don’t read most of the lines of math. Focus on the inputs and outputs.

The middle ‘and then’s aren’t really that important if the conclusion XYZ isn’t interesting. Or much more commonly, the ABC premise is false anyway so who cares.

Most readers I’d wager are not sitting here deciphering opaque gradient derivations every single paper. Just skip it unless it proves worthy

robotresearcher1y ago

Good advice. I ran an academic lab for a long time, read and reviewed a lot of papers, and trained students to read them. My process is as follows.

If I’d agreed to review a paper, I don’t get to abort the process. Otherwise I can bail at any step.

2 more replies

gessha1y ago

Most of the math in ML papers is smoke and mirrors. Some good theoretical papers exist but too many others cover up accidental discovery with vague mathy explanations.

svachalek1y ago

I'm not a math person either, but I'm familiar with some of the terminology. The two things I got out of this paper were:

1 more reply

aketchum1y ago

Matthyze1y ago

2 more replies

DoctorOetker1y ago

Studying physics without learning math is impossible, but physics results in a curricular selection of math which is highly relevant to machine learning.

szundi1y ago

Would be fun to ask o3

drodgers1y ago

It looks like o3-mini doesn't yet support PDF input, but here's what Claude opus had to say:

    This paper proves the first unconditional limitations of multi-layer decoder-only transformers, which are the dominant architecture used in large language models (LLMs) today. The key takeaways are:
    
    1. Any constant-depth L-layer decoder-only transformer requires a polynomial model dimension of at least n^(2^(-4L)) to perform sequential composition of L functions over an input of n tokens. This is an exponentially larger model size than what an (L+1)-layer transformer needs.
    
    2. There are certain tasks, like L-step sequential function composition, that are exponentially harder for L-layer transformers compared to (L+1)-layer transformers. This proves a depth-width tradeoff.
    
    3. The same sequential composition tasks used above can be solved by an exponentially shallower and smaller encoder-only transformer compared to decoder-only transformers. This unconditionally separates the computational power of encoders vs decoders.
    
    4. Sequential composition tasks that are hard for L-layer transformers become exponentially easier when allowing L steps of "chain of thought" reasoning. This provides the first provable benefit of chain of thought prompting.
    
    The results rely on a new theoretical model called the "autoregressive communication model" that abstracts the key aspects of decoder-only attention. The lower bounds are proved by analyzing the ability of this model to perform multi-step function composition.
    
    The proofs appear technically sound, relying on information-theoretic arguments without unproven assumptions. The separation between encoders and decoders was previously only known assuming unproven complexity conjectures.
    
    The key limitation is that the theoretical model makes some simplifications compared to practical transformers. But overall, this looks like an important foundational result that rigorously demonstrates limitations of the currently dominant LLM architecture and provides theoretical motivation for alternative approaches like encoder-decoder models, chain of thought, and deep but narrow models. The unconditional nature of the results makes them more definitive than prior work relying on complexity assumptions.
        The sequential function composition problem that the paper shows to be hard for constant-depth decoder-only transformers essentially requires iteratively applying a sequence of functions, where each function takes as input the output of the previous function application. 
    
    In terms of practical everyday reasoning tasks, this could potentially capture things like:
    
    1. Multi-step arithmetic: A task like "What is 5 plus 3, minus 2, times 4, divided by 8" requires sequentially applying the arithmetic operations. Each operation takes as input the result of the previous one.
    
    2. Chained logical inference: Deriving a conclusion from a sequence of logical premises, where each inference step builds on the previous conclusions. For example: "If A then B. If B then C. If C then D. A is true. Therefore D is true."  Reaching the final conclusion requires iteratively applying the logical rules.
    
    3. Complex query decomposition: Answering a complex query that requires multiple steps of information retrieval and computation. Like "How many countries have a population larger than Russia's GDP per capita?" This might involve retrieving Russia's GDP, dividing by population to get per capita, then comparing each country's population to this value and counting matches.
    
    4. Multi-hop reasoning over knowledge: Drawing inferences that require combining multiple facts from a knowledge base. E.g. "The tallest building in the city with the largest population in John's home state" would require finding John's state, its largest city, and the tallest building there, using separate knowledge lookups.
    
    5. Sequential decision making: Planning tasks that require a sequence of decisions, where each decision depends on the state resulting from previous decisions.
    
    The paper suggests that constant-depth transformers, the current architecture of choice for LLMs, may struggle with tasks like these that require longer chains of computation, despite their strong performance on many benchmarks. Exploring alternative architectures like deeper models or encoder-decoder structures may help address these limitations.
    
    Of course, the actual difficulty of any given task for an LLM depends on many factors beyond just the abstract computational structure. But this result provides a valuable lens for understanding the types of everyday reasoning that may be challenging for the current paradigm and motivates further research into more compositional and reasoning-oriented architectures.

cubefox1y ago· 1 in thread

Loosely related thought: A year ago, there was a lot of talk about the Mamba SSM architecture replacing transformers. Apparently that didn't happen so far.

thesz1y ago

Just like with neural networks and Adam [1], LLMs evolve to make transformers their best building block.

[1] https://parameterfree.com/2020/12/06/neural-network-maybe-ev...

thesz1y ago

  > ...our results give: ... (3) a provable advantage of chain-of-thought, exhibiting a task that becomes exponentially easier with chain-of-thought.

It would be good to also prove that there is no task that becomes exponentially harder with chain-of-thought.

hochstenbach1y ago

Quanta magazine has an article that explains in plain words what the researchers were trying to do : https://www.quantamagazine.org/chatbot-software-begins-to-fa...

byyoung31y ago

those lemmas are wild

j / k navigate · click thread line to collapse