Makes you wonder if we're training LLMs the hard way. For example, if computers had been invented before Calculus, we'd have been using "Numerical Integration" (iterating the differential squares to sum up areas, etc) and "Numerical Differentiation" (ditto for calculating slopes).
So I wonder if we're simply in a pre-Calculus-like phase of NN/Perceptrons, where we haven't yet realized there's a mathematical way to "solve" a bunch of equations simultaneously and arrive at the best (or some local minima) model weights for a given NN architecture and set of training data.
From a theoretical standpoint it IS a black box problem like this where the set of training data goes in, and an array of model weights comes out. If I were to guess I'd bet there'll be some kind of "random seed" we can add as input, and for each seed we'll get a different (local minima/maxima for model weights).
But I'm not a mathematician and there may be some sort of PROOF that what I just said can definitely never be done?
Maybe our only hope of doing LLM training runs in a tiny amount of time will be from Quantum Computing or even Photonic (wave-based) Computing.
I don't understand. The benefit of SSMs is better scalability than self-attention. Now this adds self-attention back?
So they're sort of reinventing the discrete-time differentiator from signal processing, but parameterized neurally?
See this video for a good discussion: https://youtu.be/-yo2672UikU
Is there a non-autoregressive future?
My own mental model for what Transformers must necessarily be doing, in order to be able to compute what they compute, given:
1. the primitives they're made of (for Transformers: matmul a learned matrix; vector-add a learned bias vector; normalize; softmax)
2. what those primitives can compute over a single layer
3. the low-ish total number of layers in a Transformer model
...is that they were already effectively "state space models" in practice. So this doesn't really surprise me!
(To be explicit, my assertion is that, for a given latent space between layers N and N+1 in a Transformer model, that latent space encodes a set of state variables [think CPU registers] used by the Nth serial computation steps of an arbitrary set of learned algorithms — where these algorithms are limited to those where every computation step is possible to encode in the form of a fused-matmul-plus-vadd, such that the algorithm itself can be learned as a depthwise-extruded sequence of weights across the layers; and where the learned algorithms can and do share state variables, both as inputs and as outputs; and where these state variables are all attenuated by an activation probability [in a Transformer: attention] such that the algorithms' outputs form a pre-multiplied conditional probability of the output given the confidence of the inputs — in turn such that the same state variable can be a low-confidence output for one algorithm, and a high-confidence output for another algorithm, and the high-confidence component of the output will swamp the low-confidence output.)
> While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.