undefined | Better HN

0 pointsembedding-shape9d ago0 comments

As mentioned, I've just finished the implementation and started playing around with it, seems to be doing similarly well inside of my own agent harness as similarly sized "traditional" LLMs. Of course, neither come close to SOTA models, but I suppose if we can figure out the scaling issues you mention, we'd get a bit closer. The performance just feels like it's too good to quickly ditch diffusion. Do you have more info what those "can't be trained beyond low/mid size" issues are in practice today?

0 comments

3 comments · 1 top-level

zozbot2349d ago· 2 in thread

The issues around training diffusion models are well known among researchers. They're likely to not be feasibly scalable far beyond the 26B size of DiffusionGemma itself, and their lower quality compared to an equally-sized auto-regressive model (the usual one-token-at-a-time flow) is also a matter of broad consensus.

embedding-shapeOP9d ago

> They're likely to not be feasibly scalable far beyond the 26B size of DiffusionGemma itself

I think people used to say the same about the 8B text-diffusion models too when they came out, like LLaDA. LLaDA2.0 seemingly claims 100B total / 6.1B active MoE diffusion (DiffusionGemma is also MoE). Not saying you're wrong about the current consensus, but it has a way of changing over time, might be a bit early to claim it's infeasible to scale them, especially considering the final artifact being much more suitable for local usage.

famouswaffles9d ago

Difficulty of scaling is not the only issue. Nobody is going to be particularly invested in scaling an architecture that has:

- consistently proven behind their auto-regressive counterparts in quality. Look at the dgemma benchmarks - pretty steep dropoffs and the more difficult the benchmark the worse the dropoff. That's not a good look and it's not like its some artifact of google's release. Every dllm is like this.

- And whose inference benefits are negated at scale. Transformers are still cheaper if you want to serve lots of users.

>"DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs"

Put yourself in the shoes of all the labs, even open source ones. Why would you put much effort into this ?

2 more replies

j / k navigate · click thread line to collapse