The supposedly dynamic/temporal nature of the model seems to be not applied for GPU execution, collapsing it into a single static computation equivalent to just applying a pre-calculated sparsity mask.
Perhaps a bit cynical of me, but it feels like wrapping standard sparse computing and operator fusion in complex, biological jargon...
There is something interesting in this post, namely that it's based on non-Nvidia GPUs, in this case MetaX [2]. I don't know how competitive MetaX are today, but I would not bet against China in the longer term.
[1] https://cointelegraph.com/news/neuromorphic-computing-breakt...
[2] https://en.wikipedia.org/wiki/MetaX
[3] K. S. Kendler, A history of metaphorical brain talk in psychiatry. https://www.nature.com/articles/s41380-025-03053-6
There is something refreshingly consistent in a VC that is laser focused on enterprise CRM dashboard for dashboards workflow optimization ChatGPT wrappers that also filters out the neuromorphicists.
Reminds me of how the Samurai were so used to ritual dueling and reading their lineages before battle but when the Mongolians encountered them they just shot the samurai mid-speech.
If we just look at spikes as a different numerical representation, then they are clearly inferior. For example, consider that encoding the number 7 will require seven consecutive pulses on a single spiking line. Encoding the number in binary will require one pulse on three parallel lines.
Binary encoding wins 7x in speed and 7/3=2.333x in power efficiency...
On the other hand, if we assume that we are able to encode information in the gaps between pulses, then things quickly change.
In the brain the relative timing/ordering of different neurons asynchronously activating (A before B, or B before A) is also used (spike-timing-dependent plasticity - STDP) as a learning signal to strengthen or weaken connection strengths, presumably to learn sequence prediction in this asynchronous environment.
STDP also doesn't imply that spikes or single neuron spike train inter-spike timings are necessary - an activation event with a strength and timestamp would seem to be enough to implement a digital dataflow design, although ultimately a custom analog design may be more efficient.
Also known as a serial interface. They are very successful: PCIe lane, SATA, USB.
Brain research showed that's happening, too. You'll see many models like this if you DuckDuckGo for "spiking" "temporal" "encoding" or subtitute "time" for temporal. You can further use "neural" "network" or "brain" focus it on sub-fields.
The brain is doing shit like this.
Shouldn’t one bold the better numbers?
Isn't that in essence very similar to Quantization Aware Training (QaT)?
But I understand that they simulate the spikes as integer events in the forward pass (as described here https://github.com/BICLab/Int2Spike) and calculate a continuous gradient based on high resolution weights for the backward pass.
This seems to be very similar to the straight-through-estimator (STE) approach that us usually used for quantization aware training. I may be wrong though.
https://en.wikipedia.org/wiki/MetaX
They have GPU manufacturers that nobody in the west has ever heard of.
/s
> Our architectural choices are closely aligned with principles observed in biological brains.
How? They point out three design choices: linear attention, MoE layers, and spike coding.
Apparently linear attention is brain-inspired because it can be viewed as a "simplified abstraction of dendritic dynamics with multi-branch morphology." Who knows what that means exactly [1]. They don't discuss it further. MoE layers apparently reflect "a principle of modular specialization." Fine, whatever.
Now, using a dozen attention variants + MoE is bog standard. The real novelty would be spike coding. Page 11 is dedicated to the different ways they could turn signals into spike trains, including such biologically-inspired mechanisms as using two's complement. However, they don't actually do spike coding in a time domain. In their implementation, "spike coding" apparently means to turn activations into integers. Section 3.3.3 claims that this lets us simulate an underlying spiking neural network, so we can validate the spiking approach without using special hardware. But if your SNN can be simulated faithfully on a GPU by turning things into integers, isn't that a bit of a depressing SNN?
Either I'm missing something, or this is just just dressing standard techniques with loads of meaningless jargon. Of course that’s a very popular way to operate in deep learning nowadays.
[1] Like, attention can draw from multiple tokens, sort of like how different spines of a dendrite can draw from multiple axons? Can’t make this stuff up.