P(next_word|previous_words) is ridiculously hard to estimate in a way that is actually useful. Remember how bad text generation used to be before GPT? There is innovation in discovering an architecture that makes it possible to learn P(next_word|previous_words), in addition to the computing techniques and hardware improvements required to make it work.
There is no innovation here in the sense of a brand new algorithm for modelling conditional probabilities -- the innovation is in adapting the algorithm for GPU training on text/etc.
This is only partially true. I wouldn't say you could use *any* NN architecture for sequence-to-sequence prediction. You either have to model them as a potentially infinite sequence with an RNN of some sort (e.g. LSTM), or, depending on the sequence type, model them as a hierarchy of sub-sequences, using something like a multi-layered convolution or transformer.
The transformer is certainly well suited to current massively parallel hardware architectures, and this was also a large part of the motivation for the design.
While the transformer isn't the only way to do seq-2-seq with neural nets, I think the reason it is so successful is more than simply being scalable and well matched to the available training hardware. Other techniques just don't work as well. From the mechanistic interpretability work that has been done so far, it seems that learnt "induction heads", utilizing the key-based attention, and layered architecture, are what give transformers their power.
I'm trying my best to inform people who are interested in being informed, against an entire media ecosystem being played like a puppet-on-a-string by ad companies. The strategy of these companies is to exploit how easy is it to strap anthropomorphic interfaces over models of word frequencies and have everyone lose their minds.
Present the same models as a statistical dashboard, and few would be so adamant that their sci-fi fantasy is the reality.
Sure, we're agreeing. I'm just being less specific.
Isn't that essentially what mjburgess said in the parent post?
> LLMs are a hardware innovation: they make it possible for GPUs to compute this at scale across TBs of data... The algorithm isnt doing anything other than aligning computation to hardware
If it were just a matter of doing that, we would be fine with fully-connected MLP. And maybe that would work with orders of magnitude more data and compute than we currently throw at these models. But we are already pushing the cutting edge of those things to get useful results out of the specialized architecture.
Choosing the right NN architecture is like feature engineering: the exact details don't matter that much, but getting the right overall structure can be the difference between learning a working model and failing to learn a working model, from the same source data with the same information content. Clearly our choice of inductive bias matters, and the transformer architecture is clearly an improvement over other designs.
Surely you wouldn't argue that a CNN is "just" aligning computation to hardware, right? Transformers are clearly showing themselves as a reliably effective model architecture for text in the same way that CNNs are reliably effective for images.
It was an innovation, in the 80s, to map image structure to weight structure that underpins CNNs. That isnt what made CNNs trainable though.. that was alexnet, and just go read the paper... its pretty upfront about how the NN architecture is designed to fit the GPU... that's the point of it