::nods:: in the case of diffusion though "conditional on its own (eventual) output" is more transparent and explicit.
As an example of one place that might make a difference is that some external syntax restriction in the sampler is going to enforce the next character after a space is "{".
Your normal AR LLM doesn't know about this restriction and may pick the tokens leading up to the "{" in a way which is regrettable given that there is going to be a {. The diffusion, OTOH, can avoid that error.
In the case where there isn't an artificial constraint on the sampler this doesn't come up because when its outputting the earlier tokens the AR model knows in some sense about it's own probability of outputting a { later on.
But in practice pretty much everyone engages in some amount of sampler twiddling, even if just cutting off low probability tokens.
As far as the internal model being sufficient, clearly it is or AR LLMs could hardly produce coherent English. But although it's sufficient it may not be particularly training or weight efficient.
I don't really know how these diffusion text models are trained so I can't really speculate, but it does seem to me that getting to make multiple passes might allow it less circuit depth. I think of it in terms of every AR step must expend effort predicting something about the following next few steps in order to output something sensible here, this has to be done over and over again, even though it doesn't change.