undefined | Better HN

0 pointsnerdponx2y ago0 comments

I think you're downplaying the importance of the attention/transformer architecture here. If it was "just" a matter of throwing compute at probabilities, then we wouldn't need any special architecture at all.

P(next_word|previous_words) is ridiculously hard to estimate in a way that is actually useful. Remember how bad text generation used to be before GPT? There is innovation in discovering an architecture that makes it possible to learn P(next_word|previous_words), in addition to the computing techniques and hardware improvements required to make it work.

0 comments

12 comments · 2 top-level

mjburgess2y ago· 7 in thread

Yes, it's really hard -- the innovation is aligning the really basic dot-product similarity mechanism to hardware. You can use basically any NN structure to do the same task, the issue is that they're untrainable because they arent parallizable.

There is no innovation here in the sense of a brand new algorithm for modelling conditional probabilities -- the innovation is in adapting the algorithm for GPU training on text/etc.

HarHarVeryFunny2y ago

> Yes, it's really hard -- the innovation is aligning the really basic dot-product similarity mechanism to hardware. You can use basically any NN structure to do the same task, the issue is that they're untrainable because they arent parallizable.

This is only partially true. I wouldn't say you could use *any* NN architecture for sequence-to-sequence prediction. You either have to model them as a potentially infinite sequence with an RNN of some sort (e.g. LSTM), or, depending on the sequence type, model them as a hierarchy of sub-sequences, using something like a multi-layered convolution or transformer.

The transformer is certainly well suited to current massively parallel hardware architectures, and this was also a large part of the motivation for the design.

While the transformer isn't the only way to do seq-2-seq with neural nets, I think the reason it is so successful is more than simply being scalable and well matched to the available training hardware. Other techniques just don't work as well. From the mechanistic interpretability work that has been done so far, it seems that learnt "induction heads", utilizing the key-based attention, and layered architecture, are what give transformers their power.

bruce3434342y ago

I don't know why you seem to have such a bone to pick with transformers but imo it's still interesting to learn about it, and reading your dismissively toned drivel of "just" and "simply" makes me tired. You're barking up the wrong tree man, what are you on about.

mjburgess2y ago

No issue with transformers -- the entire field of statistical learning, decision trees to NNs, do the same thing... there's no mystery here. No person with any formal training in mathematical finance, applied statistics, hard experimental sciences on complex domains... etc. would be taken in here.

I'm trying my best to inform people who are interested in being informed, against an entire media ecosystem being played like a puppet-on-a-string by ad companies. The strategy of these companies is to exploit how easy is it to strap anthropomorphic interfaces over models of word frequencies and have everyone lose their minds.

Present the same models as a statistical dashboard, and few would be so adamant that their sci-fi fantasy is the reality.

5 more replies

kordlessagain2y ago

Somebody's judgment weights need to be updated to include emoji embeddings.

YetAnotherNick2y ago

No. This is blatantly false. The belief that recurrent model can't be scaled is untrue. People have recently trained MAMBA with billions of parameters. The fundamental reason why transformers changed the field is that they are lot more scalable context length wise, and LSTM, LRU etc doesn't come close.

HarHarVeryFunny2y ago

Yes, but pure Mamba doesn't perform as well as a transformer (and neither did LTSMs). This is why you see hybrid architectures like Jamba = Mamba + transformer. The ability to attend to specific tokens is really key, and what is lost in recurrent models where sequence history is munged into a single state.

1 more reply

mjburgess2y ago

> they are lot more scalable context length wise

Sure, we're agreeing. I'm just being less specific.

1 more reply

JeremyNT2y ago· 3 in thread

> There is innovation in discovering an architecture that makes it possible to learn P(next_word|previous_words), in addition to the computing techniques and hardware improvements required to make it work.

Isn't that essentially what mjburgess said in the parent post?

> LLMs are a hardware innovation: they make it possible for GPUs to compute this at scale across TBs of data... The algorithm isnt doing anything other than aligning computation to hardware

nerdponxOP2y ago

Not really, and no. Torch and CUDA align computation to hardware.

If it were just a matter of doing that, we would be fine with fully-connected MLP. And maybe that would work with orders of magnitude more data and compute than we currently throw at these models. But we are already pushing the cutting edge of those things to get useful results out of the specialized architecture.

Choosing the right NN architecture is like feature engineering: the exact details don't matter that much, but getting the right overall structure can be the difference between learning a working model and failing to learn a working model, from the same source data with the same information content. Clearly our choice of inductive bias matters, and the transformer architecture is clearly an improvement over other designs.

Surely you wouldn't argue that a CNN is "just" aligning computation to hardware, right? Transformers are clearly showing themselves as a reliably effective model architecture for text in the same way that CNNs are reliably effective for images.

rsfern2y ago

There’s some interesting work replacing scaled dot product attention and position embeddings with fixed format MLPs [0] - so I tend to lean towards thinking of classic transformers as having a reasonable enough inductive bias and the scalability to actually realize the amount of compute that’s needed

0: https://arxiv.org/abs/2105.08050

mjburgess2y ago

Err... no. MLPs are fundamentally sequential algorithms (backprop weight updating). All major innovations in NN design have been to find ways of designing the architecture to fit GPU compute paradigms.

It was an innovation, in the 80s, to map image structure to weight structure that underpins CNNs. That isnt what made CNNs trainable though.. that was alexnet, and just go read the paper... its pretty upfront about how the NN architecture is designed to fit the GPU... that's the point of it

j / k navigate · click thread line to collapse

0 comments

12 comments · 2 top-level

mjburgess2y ago· 7 in thread

There is no innovation here in the sense of a brand new algorithm for modelling conditional probabilities -- the innovation is in adapting the algorithm for GPU training on text/etc.

HarHarVeryFunny2y ago

The transformer is certainly well suited to current massively parallel hardware architectures, and this was also a large part of the motivation for the design.

bruce3434342y ago

mjburgess2y ago

Present the same models as a statistical dashboard, and few would be so adamant that their sci-fi fantasy is the reality.

5 more replies

kordlessagain2y ago

Somebody's judgment weights need to be updated to include emoji embeddings.

YetAnotherNick2y ago

HarHarVeryFunny2y ago

1 more reply

mjburgess2y ago

> they are lot more scalable context length wise

Sure, we're agreeing. I'm just being less specific.

1 more reply

JeremyNT2y ago· 3 in thread

Isn't that essentially what mjburgess said in the parent post?

> LLMs are a hardware innovation: they make it possible for GPUs to compute this at scale across TBs of data... The algorithm isnt doing anything other than aligning computation to hardware

nerdponxOP2y ago

Not really, and no. Torch and CUDA align computation to hardware.

rsfern2y ago

0: https://arxiv.org/abs/2105.08050

mjburgess2y ago

j / k navigate · click thread line to collapse