In other words, I don't really care if it scales. I almost hope it doesn't.
What is your point?
Wouldn’t weights between models be completely different? And then there are architecture differences on top of that.
One of the more surprising things is that you can actually repeat layers to improve model performance, ie 1-1-2-2 instead of 1-2. That’s how you get models with higher parameter counts than the original.
However, maybe this is not the case. I have a bit of a history of messing with residuals in neural networks, seeing more work on it is good. Fast training networks of course are a very slightly mild obsession of mine as well, and very useful to the field. Here's hoping it pans out as a motif, curious to see where it goes.
If that is the case, then it may well be possible to fix some of the scaling issues more apparent with smaller transformer models (maybe not, though). This is at least some of the reasoning that I've been applying when developing hlb-gpt, for example. It's partially also why I think changing how we use nonlinearities within the network might impact scaling, due to some of the activation spikes used in more linear regions of the network to control network behavior in a way not originally intended.
Agreed that it does require a ton of resources though. But I do think that the problem can be solved on a smaller scale. If we don't have a cleanly logarithmic curve, then I think that something is dearly wrong with our base architecture. (However, of course, I may entirely be missing something here).
I only glanced the paper, but they don't seem to softmax ⍺_i for normalization?
2. The difference seems to diminish with scale. Real life transformers obviously are much larger and train on many more tokens.
3. A very significant part of training transformer models are the throughoutput and memory optimizations. I wonder how their model would work with such fused kernels or specialized paged KV cache schemes. Or activation checkpointing, if run locally.
4. Indeed they claim no memory impact, but their code shows that their experiments are conducted with a special optimized version which requires all activations to reside in a single tensor at all times. Not sure this would work with 3d parallelism on multiple nodes etc.
> This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.
I found this particularly charming.