undefined | Better HN

0 pointsbyearthithatius1y ago0 comments

Both are conditional distributions on the context of which they were requested so like you said in the second paragraph, the difference is not significant. I see what you mean though and maybe there are use cases then where Diffusion is preferable. To me it seems the context conditional and internal model is sufficient where this problem doesn't really occur.

0 comments

2 comments · 1 top-level

nullc1y ago· 1 in thread

::nods:: in the case of diffusion though "conditional on its own (eventual) output" is more transparent and explicit.

As an example of one place that might make a difference is that some external syntax restriction in the sampler is going to enforce the next character after a space is "{".

Your normal AR LLM doesn't know about this restriction and may pick the tokens leading up to the "{" in a way which is regrettable given that there is going to be a {. The diffusion, OTOH, can avoid that error.

In the case where there isn't an artificial constraint on the sampler this doesn't come up because when its outputting the earlier tokens the AR model knows in some sense about it's own probability of outputting a { later on.

But in practice pretty much everyone engages in some amount of sampler twiddling, even if just cutting off low probability tokens.

As far as the internal model being sufficient, clearly it is or AR LLMs could hardly produce coherent English. But although it's sufficient it may not be particularly training or weight efficient.

I don't really know how these diffusion text models are trained so I can't really speculate, but it does seem to me that getting to make multiple passes might allow it less circuit depth. I think of it in terms of every AR step must expend effort predicting something about the following next few steps in order to output something sensible here, this has to be done over and over again, even though it doesn't change.

nullc1y ago

Totally separate from this line of discussion is that if you want to use an LLM for, say, copyediting it's pretty obvious to me how a diffusion model could get much better results.

Like if you take your existing document and measure the probability of your actual word vs an AR model's output, varrious words are going to show up as erroneously improbable even when the following text makes them obvious. A diffusion model should just be able to score up the entire text conditioned on the entire text rather than just the text in front of it.

j / k navigate · click thread line to collapse