undefined | Better HN

0 pointshodgehog117mo ago0 comments

Working in the theory, I can say this is incredibly unlikely. At scale, once appropriately trained, all architectures begin to converge in performance.

It's not architectures that matter anymore, it's unlocking new objectives and modalities that open another axis to scale on.

0 comments

viraptor7mo ago

Do we really have the data on this? I mean, it does happen on a smaller scale, but where's the 300B version of RWKV? Where's hybrid symbolic/LLM? Where are other experiments? I only see larger companies doing relatively small tweaks to the standard transformers, where the context size still explodes the memory use - they're not even addressing that part.

hodgehog11OP7mo ago

True, we can't say for certain. But there is a lot of theoretical evidence too, as the leading theoretical models for neural scaling laws suggest finer properties of the architecture class play a very limited role in the exponent.

We know that transformers have the smallest constant in the neural scaling laws, so it seems irresponsible to scale another architecture class to extreme parameter sizes without a very good reason.

fdsjgfklsfd7mo ago

Do you mean "all variants of the same stacked transformer architecture converge in performance"? Or do you know of tests against some other architecture? The diffusion-based LLMs?

highfrequency7mo ago

Could you elaborate with a few more paragraphs? What do you mean by “working in the theory?”

hodgehog11OP7mo ago

People often talk in terms of performance curves or "neural scaling laws". Every model architecture class exhibits a very similar scaling exponent because the data and the training procedures are playing the dominant role (every theoretical model which replicates the scaling laws exhibit this property). There are some discrepancies across model architecture classes, but there are hard limits on this.

Theoretical models for neural scaling laws are still preliminary of course, but all of this seems to be supported by experiments at smaller scales.

j / k navigate · click thread line to collapse

0 comments

viraptor7mo ago

hodgehog11OP7mo ago

We know that transformers have the smallest constant in the neural scaling laws, so it seems irresponsible to scale another architecture class to extreme parameter sizes without a very good reason.

fdsjgfklsfd7mo ago

Do you mean "all variants of the same stacked transformer architecture converge in performance"? Or do you know of tests against some other architecture? The diffusion-based LLMs?

highfrequency7mo ago

Could you elaborate with a few more paragraphs? What do you mean by “working in the theory?”

hodgehog11OP7mo ago

Theoretical models for neural scaling laws are still preliminary of course, but all of this seems to be supported by experiments at smaller scales.

j / k navigate · click thread line to collapse