> ...our results give: ... (3) a provable advantage of chain-of-thought, exhibiting a task that becomes exponentially easier with chain-of-thought.
It would be good to also prove that there is no task that becomes exponentially harder with chain-of-thought.[1] https://parameterfree.com/2020/12/06/neural-network-maybe-ev...
It sure looks and smells like good work, so I've added it to my reading list.
Nowadays I feel like my reading list is growing faster than I can go through it.
It sometimes sucks being in ML with 'only' a CS background. Feels like all the math and physics grads are running around having fun with their fancy mathematics, while I stand here, feeling dimwitted.
If the premise and conclusion don’t make sense on fundamentals the math isn’t likely to fix it. Most lines are literally equals signs - just walking you through some equivalencies as proof. A large statement saying “If ABC, then … (and then … and then … and then …) and finally XYZ”
The middle ‘and then’s aren’t really that important if the conclusion XYZ isn’t interesting. Or much more commonly, the ABC premise is false anyway so who cares.
Most readers I’d wager are not sitting here deciphering opaque gradient derivations every single paper. Just skip it unless it proves worthy
If you really want to go deep on a topic in ML, take a subtopic paper and read related works until you can connect the basics to the paper. This consists of not only writing down the math but also coding the concepts. ML students go through this process when they choose a topic and start working on it. Focus on one topic and learn it. Present it to yourself or others. Find holes in the literature and try exploring them. Science!
1. Depth is more important than breadth for making transformers smarter. That is, for a given size of model it will be more powerful with more, smaller layers than it would be with less, bigger ones. Interestingly, Mistral just updated their small model yesterday with a big reduction in layers in order to improve performance. Among the many ways they say it more technically they do say directly "depth plays a more critical role than width in reasoning and composition tasks".
2. As I understand it, they are claiming to be able to prove mathematically that Chain of Thought such as seen in the new DeepSeek R1 and GPT o1/o3 models creates results that wouldn't be possible without it. The thought chains effectively act as additional layers, and per the previous point, the more layers the better. "From a theoretical view, CoT provides Transformer with extra computation space, and previous work ... proved that log-precision Transformer with CoT could simulate any polynomial-time algorithm. Therefore, by further assuming certain complexity conjecture ... their results imply that constant depth Transformer with CoT could simulate poly-time algorithm, while constant depth Transform ... itself can not solve P-complete task."
Studying physics without learning math is impossible, but physics results in a curricular selection of math which is highly relevant to machine learning.