> RNNs are particularly suitable for sequence modelling settings such as those involving time series, natural language processing, and other sequential tasks where context from previous steps informs the current prediction.
I would like to draw an analogy to digital signal processing. If you think of the recurrent-style architectures as IIR filters and feedforward-only architectures as FIR filters, you will likely find many parallels.
The most obvious to me being that IIR filters typically require far fewer elements to produce the same response as an equivalent FIR filter. Granted, the FIR filter is often easier to implement/control/measure in practical terms (fixed-point arithmetic hardware == ML architectures that can run on GPUs).
I don't think we get to the exponential scary part of AI without some fundamentally recurrent architecture. I think things like LSTM are kind of an in-between hack in this DSP analogy - You could look at it as FIR with dynamic coefficients. Neuromorphic approaches seem like the best long term bet to me in terms of efficiency.
FIR filters are way simpler to design and can capture memory without hacks.
The most compelling and obvious one to me is hardware purpose-built to simulate spiking neural networks. In the happy case, SNNs are extremely efficient. Basically consuming no energy. You could fool yourself into thinking we can just do this on the CPU due to the sparsity of activations. I think there is even a set of problems this works well for. But, in the unhappy cases SNNs are impossible to simulate on existing hardware. Neuronal avalanches follow power law distribution and meaningfully-large ones would require very clever techniques to simulate with any reasonable fidelity.
> the system isn't just simulating neurons but involves a variety of methods and interactions across "agents" or sub-systems.
I think the line between "neuron" and "agent" starts to get blurry in this arena.
We have NO idea how the brain produces intelligence, and as long as that doesn't change, "neuromorphic" is merely a marketing term, like Neurotypical, Neurodivergent, Neurodiverse, Neuroethics, Neuroeconomics, Neuromarketing, Neurolaw, Neurosecurity, Neurotheology, Neuro-Linguistic Programming: the "neuro-" prefix is suggesting a deep scientific insight to fool the audience. There is no hope of us cracking the question of how the human brain produces high-level intelligence in the next decade or so.
Neuromorphic does work for some special purpose applications.
I would highly recommend it to people who love a good "near future" scifi book.
I’ve been thinking the same for a while, but I’m starting to wonder if giant context windows are good enough to get us there. I think recurrency is more neuromorphic, and possibly important in the longer run, but maybe not required for SI.
I’m also just a layman with just a surface level understanding of these things, so I may be completely ignorant and wrong.
And things have improved a lot since then.
I'm honestly a bit envious of future engineers who will be tackling these kinds of problems with a 100-line Jupyter notebook on a laptop years from now. If we discovered the right method or algorithm for these long-horizon problems, a 2B-parameter model might even outperform current models on everything except short, extreme reasoning problems.
The only solution I've ever considered for this is expanding a model's dimensionality over time, rather than focusing on perfect weights. The higher dimensionality you can provide to a model, the greater its theoretical storage capacity. This could resemble a two-layer model—one layer acting as a superposition of multiple ideal points, and the other layer knowing how to use them.
When you think about the loss landscape, imagine it with many minima for a given task. If we could create a method that navigates these minima by reconfiguring the model when needed, we could theoretically develop a single model with near-infinite local minima—and therefore, higher-dimensional memory. This may sound wild, but consider the fact that the human brain potentially creates and disconnects thousands of new connections in a single day. Could it be that these connections steer our internal loss landscape between different minima we need throughout the day?
Models that change size as needed have been experimented with, but they are either too inefficient or difficult to optimize at a limited power budget. However, I agree that they are likely what is needed if we want to continue to scale upward in size.
I suspect the real bottleneck is a breakthrough in training itself. Backpropagation loss is too simplistic to optimize our current models perfectly, let alone future larger ones. But there is no guarantee a better alternative exists which may create a fixed limit to current ML approaches.
What I’m advocating is a substantial increase in this aspect—keeping model size the same while expanding dimensionality. The "curse of dimensionality" illustrates how a modest increase in dimensions leads to a significantly larger volume.
While I agree that backpropagation isn’t a complete solution, it’s ultimately just a stochastic search method. The key point here is that expanding the dimensionality of a model’s space is likely the only viable long-term direction. To achieve this, backpropagation needs to work within an increasingly multidimensional space.
A useful analogy is training a small model on random versus structured data. With structured data, we can learn an extensive amount, but with random data, we hit a hard limit imposed by the network. Why is that?
"Interesting work on reviving RNNs. https://arxiv.org/abs/2410.01201 -- in general the fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)
Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape. As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime."
I have almost the opposite take. We've had a lot of datasets for ages, but all the progress in the last decade has come from advances how curves are architected and fit to the dataset (including applying more computing power).
Maybe there's some theoretical sense in which older models could have solved newer problems just as well if only we applied 1000000x the computing power, so the new models are 'just' an optimisation, but that's like dismissing the importance of complexity analysis in algorithm design, and thus insisting that bogosort and quicksort are equivalent.
When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.
However it's still an interesting observation that many architectures can arrive at the same performance (even though the training requirements are different).
Naively, you wouldn't expect eg 'x -> a * x + b' to fit the same data as 'x -> a * sin x + b' about equally well. But that's an observation from low dimensions. It seems once you add enough parameters, the exact model doesn't matter too much for practical expressiveness.
I'm faintly reminded of the Church-Turing Thesis; the differences between different computing architectures are both 'real' but also 'just an optimisation'.
> When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.
You are right, these normalisation techniques help you economise on training data, not just on compute. Some of these techniques can be done independent of the model, eg augmenting your training data with noise. But some others are very model dependent.
I'm not sure how the 'agentic' approaches fit here.
I'm sure there are optimizations from the model shape as well, but I don't think that running the best algorithms we have today with hardware from five-ten years ago would have worked in any reasonable amount of time/money.
I haven't fully ingested the paper yet, but it looks like it's focused more on compute optimization than the size of the dataset:
> ... and (2) are fully parallelizable during training (175x faster for a sequence of length 512
Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.
This! Not just fastest but with the lowest resources in total.
Fully connected neural networks are universal functions. Technically we don’t need anything but a FNN, but memory requirements and speed would be abysmal far beyond the realm of practicality.
Not to him, he runs the ARC challenge. He wants a new approach entirely. Something capable of few-shot learning out of distribution patterns .... somehow
For example when CNNs took over computer vision that wasn't because they were doing something that dense networks couldn't do. It was because they removed a lot of edges that didn't really matter, allowing us to spend our training budget on deeper networks. Similarly transformers are great because they allow us to train gigantic networks somewhat efficiently. And this paper finds that if we make RNNs a lot faster to train they are actually pretty good. Training speed and efficiency remains the big bottleneck, not the actual expressiveness of the architecture
Some links if interested:
[1] https://gpt3experiments.substack.com/p/understanding-neural-...
[2] https://gpt3experiments.substack.com/p/building-a-vector-dat...
On the other hand, while "As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime." is true, a sufficiently expressive mechanism may not be computationally or memory efficient. As both are constraints on what you can actually build, it's not whether the architecture can produce the result, but whether a feasible/practical instantiation of that architecture can produce the result.
You may be referring to Aidan Gomez (CEO of Cohere and contributor to the transformer architecture) during his Machine Learning Street Talk podcast interview. I agree, if as much attention had been put towards the RNN during the initial transformer hype, we may have very well seen these advancements earlier.
(Somewhat) fun and (somewhat) related fact: there's a whole cottage industry of "is all you need" papers https://arxiv.org/search/?query=%22is+all+you+need%22&search...
Another thing about the architecture is we inherently bias it with the way we structure the data. For instance, take a dataset of (car) traffic patterns. If you only track the date as a feature, you miss that some events follow not just the day-of-year pattern but also holiday patterns. You could learn this with deep learning with enough data, but if we bake it into the dataset, you can build a model on it _much_ simpler and faster.
So, architecture matters. Data/feature representation matters.
I think you need a hidden layer. I’ve never seen a universal approximation theorem for a single layer network.
I think probably future NNs will be maybe more adaptive than this perhaps where some Perceptrons use sine wave functions, or other kinds of math functions, beyond just linear "y=mx+b"
It's astounding that we DID get the emergent intelligence from just doing this "curve fitting" onto "lines" rather than actual "curves".
Ax = y
Adding another layer is just multiplying a different set of weights times the output of the first, so
B(Ax)= y
If you remember your linear algebra course, you might see the problem: that can be simplified
(BA)x = y
Cx = y
Completely indistinguishable from a single layer, thus only capable of modeling linear relationships.
To prevent this collapse, a non linear function must be introduced between each layer.
In Ye Olden days (the 90’s) we used to approximate non-linear models using splines or seperate slopes models - fit by hand. They were still linear, but with the right choice of splines you could approximate a non-linear model to whatever degree of accuracy you wanted.
Neural networks “just” do this automatically, and faster.
Here's why.
A user of an LLM might give the model some long text and then say "Translate this into German please". A Transformer can look back at its whole history. But what is an RNN to do? While the length of its context is unlimited, the amount of information the model retains about it is bounded by whatever is in its hidden state at any given time.
Relevant: https://arxiv.org/abs/2402.01032
This is no different than a transformer, which, after all, is bound by a finite state, just organized in a different manner.
It's not just a matter of organizing things differently. Suppose your network dimension and sequence length are both X.
Then your memory usage (per layer) will be O(X^2), while your training update cost will be O(X^3). That's for both Transformers and RNNs.
However, at the end of the sequence, a Transformer layer can look back see O(X^2) numbers, while an RNN can only see O(X) numbers.
Which isn't necessary. If you say "translate the following to german." Instead, all it needs is to remember the task at hand and a much smaller amount of recent input. Well, and the ability to output in parallel with processing input.
A model can decide to forget something that turns out to be important for some future prediction. A human can go back and re-read/listen etc, A transformer is always re-reading but a RNN can't and is fucked.
Consider the flood fill algorithm or union-find algorithm, which feels magical upon first exposure.
https://en.wikipedia.org/wiki/Hoshen%E2%80%93Kopelman_algori...
Having 2 passes can enable so much more than a single pass.
Another alternative could be to have a first pass make notes in a separate buffer while parsing the input. The bandwidth of the note taking and reading can be much much lower than that required for fetching the billions of parameters.
But it's trained on translations, rather than the whole Internet.
Do we have solutions for these two problems now?
RNN are constantly updating and overwriting their memory. It means they need to be able to predict what is going to be useful in order to store it for later.
This is a massive advantage for Transformers in interactive use cases like in ChatGPT. You give it context and ask questions in multiple turns. Which part of the context was important for a given question only becomes known later in the token sequence.
To be more precise, I should say it's an advantage of Attention-based models, because there are also hybrid models successfully mixing both approaches, like Jamba.
A great thing about RNNs is they can easily fork the state and generate trees, it would be possible to backtrack and work on combinatorial search problems.
Also easier to cache demonstrations for free in the initial state, a model that has seen lots of data is not using more memory than a model starting from scratch.
https://www.semanticscholar.org/paper/Long-Short-Term-Memory...
I find it interesting that this knowledge seems to be all but forgotten now. Back in the day, ca. 2014, LSTMs were all the rage, e.g. see:
I see this stuff everywhere online and it's often taught this way so I don't blame folks for repeating it, but I think it's likely promulgated by folks who don't train LSTMs with long contexts.
LSTMs do add something like a "skip-connection" (before that term was a thing) which helps deal with the catastrophic vanishing gradients you get from e.g. Jordan RNNs right from the jump.
However (!), while this stops us from seeing vanishing gradients after e.g. 10s or 100s of time-steps, when you start seeing multiple 1000s of tokens, the wheels start falling off. I saw this in my own research, training on amino acid sequences of 3,000 length led to a huge amount of instability. It was only after tokenizing the amino acid sequences (which was uncommon at the time) which got us down to ~1500 timesteps on average, did we start seeing stable losses at training. Check-out the ablation at [0].
You can think of ResNets by analogy. ResNets didn't "solve" vanishing gradients, there's a practical limit of the depth of networks, but it did go a long way towards dealing with it.
EDIT: I wanted to add, while I was trying to troubleshoot this for myself, it was super hard to find evidence online of why I was seeing instability. Everything pertaining to "vanishing gradients" and LSTMs were blog posts and pre-prints which just merrily repeated "LSTMs solve the problem of vanishing gradients". That made it hard for me, a junior PhD at the time, to suss out the fact that LSTMs do demonstrably and reliably suffer from vanishing gradients at longer contexts.
[0] https://academic.oup.com/bioinformatics/article/38/16/3958/6...
Recent comments from him have said that any architecture can achieve transformer accuracy and recall, but we have devoted energy to refining transformers, due to the early successes.
I don't want to downplay the value of these models. Some people seem to be under the perception that transformers replaced or made them obsolete, which is faar from the truth.
This is very clever and very interesting. The paper continuously calls it a "decade-old architecture," but in practice, it's still used massively, thanks to its simplicity in adapting to different domains. Placing it as a "competitor" to transformers is also not quite fully fair, as transformers and RNNs are not mutually exclusive, and there are many methods that merge them.
Improvement in RNNs is an improvement in a lot of other surprising places. A very interesting read.
class MinGRU(nn.Module):
def __init__(self, token_size, hidden_state_size):
self.token_to_proposal = nn.Linear(token_size, hidden_size)
self.token_to_mix_factors = nn.Linear(token_size, hidden_size)
def forward(self, previous_hidden_state, current_token):
proposed_hidden_state = self.token_to_proposal(current_token)
mix_factors = torch.sigmoid(self.token_to_mix_factors(current_token))
return torch.lerp(proposed_hidden_state, previous_hidden_state, mix_factors)
And since the proposed hidden states and mix factors for each layer are both only dependent on the current token, you can compute all of them in parallel if you know the whole sequence ahead of time (like during training), and then combine them in linear time using parallel scan.The fact that this is competitive with transformers and state-space models in their small-scale experiments is gratifying to the "best PRs are the ones that delete code" side of me. That said, we won't know for sure if this is a capital-B Breakthrough until someone tries scaling it up to parameter and data counts comparable to SOTA models.
One detail I found really interesting is that they seem to do all their calculations in log-space, according to the Appendix. They say it's for numerical stability, which is curious to me—I'm not sure I have a good intuition for why running everything in log-space makes the model more stable. Is it because they removed the tanh from the output, making it possible for values to explode if calculations are done in linear space?
EDIT: Another thought—it's kind of fascinating that this sort of sequence modeling works at all. It's like if I gave you all the pages of a book individually torn out and in a random order, and asked you to try to make a vector representation for each page as well as instructions for how to mix that vector with the vector representing all previous pages — except you have zero knowledge of those previous pages. Then, I take all your page vectors, sequentially mix them together in-order, and grade you based on how good of a whole-book summary the final vector represents. Wild stuff.
FURTHER EDIT: Yet another thought—right now, they're just using two dense linear layers to transform the token into the proposed hidden state and the lerp mix factors. I'm curious what would happen if you made those transforms MLPs instead of singular linear layers.
Mine worked, but it was very simple and dog slow, running on my old laptop. Nothing was ever going to run fast on that thing, but I remember my RNN being substantially slower than a feed-forward network would have been.
I was so confident that this was dead technology -- an academic curiosity from the 1980s and 1990s. It was bizarre to see how quickly that changed.
Emphasis on not necessarily.
>> The main conclusion is that RNN class networks can be trained as efficiently as modern alternatives but the resulting performance is only competitive at small scale.
Shouldn't the conclusion be "the resulting competitive performance has only been confirmed at small scale"?
We were able to build generators that could replicate any dataset they were trained on, and would produce unique deviations, but match the statistical underpinnings of the original datasets.
https://medium.com/capital-one-tech/why-you-dont-necessarily...
We built several text generators for bots that similarly had very good results. The introduction of the transformer improved the speed and reduced the training / data requirements, but honestly the accuracy changed minimal.
Compare with one human brain. Far more sophisticated, even beyond our knowledge. What does it take to power it for a day? Some vegetables and rice. Still fine for a while if you supply pure junk food -- it'll still perform.
Clearly we have a long, long way to go in terms of the energy efficiency of AI approaches. Our so-called neural nets clearly don't resemble the energy efficiency of actual biological neurons.
Based on my own experience, I would struggle to generate that much text without fries and a drink.
[1] https://www.theverge.com/24066646/ai-electricity-energy-watt...
But yes, food energy could be useful for AI. A little dystopian potentially too, if you think about it. Like DARPA's EATR robot, able to run on plant biomass (although potentially animal biomass too, including human remains):
https://en.wikipedia.org/wiki/Energetically_Autonomous_Tacti...
Maybe the future of AI is in organic neurons?
It's obvious why the newest toy from openai can solve problems better mostly by just being allowed to "talk to itself" for a moment before starting the answer that human sees.
Given that, modern incarnation of RNN can be vastly cheaper than transformers provided that they can be trained.
Convolutional neural networks get more visual understanding by "reusing" their capacity across the area of the image. RNN's and transformers can have better understanding of a given problem by "reusing" their capacity to learn and infer across time (across steps of iterative process really).
When it comes to transformer architecture the attention is a red herring. It's just more or less arbitrary way to partition the network so it can be parallelized. The only bit of potential magic is with "shortcut" links between non adjacent layers that help propagate learning back through many layers.
Basically the optimal network is deep, dense (all neurons connect with all belonging to all preceding layers) that is ran in some form of recurrence.
But we don't have enough compute to train that. So we need to arbitrarily sever some connections so the whole thing is easier to parallelized. It really doesn't matter which unless we do in some obviously stupid way.
Actual inventive magic part of LLMs possibly happens in token and positional encoders.
From theory the answer to the question should be "yes", they are Turing complete.
The real question is about how to train them, and the paper is about that.
In more practical terms, you would imagine that an advanced model contains some semblance of a CPU to be able to truly reason. Given that CPUs can be all NAND gates (which take 2 neurons to represent), and are structured in a recurrent way, you fundamentally have to rethink how to train such a network, because backprop obviously won't work to capture things like binary decision points.
1: https://www.sciencedirect.com/science/article/pii/0893965991...
2: https://en.wikipedia.org/wiki/Rice%27s_theorem?useskin=vecto...
Can RNNs be as good as Transformers at recalling information from previous tokens in a sequence?
Transformers excel at recalling info, likely because they keep all previous context around in an ever-growing KV cache.
Unless proponents of RNNs conclusively demonstrate that RNNs can recall info from previous context at least as well as Transformers, I'll stick with the latter.
This is obvious when one considers the connections between Transformers, RNNs, Hopfield networks and the Ising model, a model from statistical mechanics which is solved by calculating the partition function.
This interpretation provides us with some very powerful tools that are commonplace in math and physics but which are not talked about in CS & ML.
I'm working on a startup http://traceoid.ai which takes this exact view. Our approach enables faster training and inference, interpretability and also scalable energy-based models, the Holy Grail of machine learning.
Join the discord https://discord.com/invite/mr9TAhpyBW or follow me on twitter https://twitter.com/adamnemecek1
BPTT was their problem
1) Yoshua's reputation would take a hit if this paper were bullshit, so he has extrinsic motivation to make it good 2) Yoshua has enough experience in the field to know what is going on in the field, you don't have to ask if he forgot about a certain architecture or the work of a certain research group which would contradict his findings-- if such work exists and is credible, it is very likely to be discussed in the paper. 3) This test answers something a leader in the field thinks is important enough for them to work on, else he wouldn't be involved.
Also note, the poster said the paper shouldn't be taken lightly. That doesn't mean we need to take it blindly. It only means we cannot dismiss it out of hand, if we have a different view we would need substantive arguments to defend our view.
I've overturned the field leader several times in science, but that's only because I acknowledged what they got right and that they were indeed the person who got it right.
In 2016 my team from Salesforce Research published our work on the Quasi-Recurrent Neural Network[1] (QRNN). The QRNN variants we describe are near identical (minGRU) or highly similar (minLSTM) to the work here.
The QRNN was used, many years ago now, in the first version of Baidu's speech recognition system (Deep Voice [6]) and as part of Google's handwriting recognition system in Gboard[5] (2019).
Even if there are expressivity trade-offs when using parallelizable RNNs they've shown historically they can work well and are low resource and incredibly fast. Very few of the possibilities regarding distillation, hardware optimization, etc, have been explored.
Even if you need "exact" recall, various works have shown that even a single layer of attention with a parallelizable RNN can yield strong results. Distillation down to such a model is quite promising.
Other recent fast RNN variants such as the RWKV, S4, Mamba et al. include citations to QRNN (2016) and SRU (2017) for a richer history + better context.
The SRU work has also had additions in recent years (SRU++), doing well in speech recognition and LM tasks where they found similar speed benefits over Transformers.
I note this primarily as the more data points, especially when strongly relevant, the better positioned the research is. A number of the "new" findings from this paper have been previously explored - and do certainly show promise! This makes sure we're asking new questions with new insights (with all the benefit of additional research from ~8 years ago) versus missing the work from those earlier.
[1] QRNN paper: https://arxiv.org/abs/1611.01576
[2] SRU paper: https://arxiv.org/abs/1709.02755
[3]: SRU++ for speech recognition: https://arxiv.org/abs/2110.05571
[4]: SRU++ for language modeling: https://arxiv.org/abs/2102.12459
[5]: https://research.google/blog/rnn-based-handwriting-recogniti...
An AI Winter is not a great an idea, but an AI Autumn may be beneficial.
Just have no major AI conferences for ‘25, perhaps only accept really high tier literature reviews.
Opinions probably differ, for example, on John Backus's paper "Can programming be liberated from the Von Neumann style?" Many fans of functional programming would say the answer is yes, but Backus himself expressed less enthusiasm in interviews later in his life.
I think the important point, though, is that academic papers and newspaper articles are not the same, and titles in the form of questions function differently in the two domains. Journalists tend to use titles like these to dissemble and sensationalize. When academics use these kinds of titles for peer-reviewed articles, it's because they really are asking an honest question. Backus was doing it in his paper. The authors of this paper are doing the same. They end the paper by re-iterating the question before launching into a discussion of the limitations that prevent them from reaching any firm conclusions on the answer to this question.
Everything else is just details.
What LLMs have shown both Neuroscience and Computer Science is that reasoning is a mechanical process (or can be simulated by mechanical processes) and is not purely associated only with consciousness.