Thought I'd share this for others interested. I've modified txt2img to save an image after each step. (actually quite easy as you can specify an img_callback to sampler)
Interestingly, both of these runs are using the same seed and prompt, yet they yield different final images, the only difference is the number of ddim sampling steps. I'd love to understand why if anyone has any idea.
Maybe it's doing something fancy with the total number of steps, beyond just stopping after the count is reached.
I may try understanding both StableDiffusion and Python enough to do it, but if you already solved it - that'll be appreciated :)
You could definitely modify it to output at each step, but the output step takes a relatively long time, so it would slow down the process.
It's basically just fallen into a local minima in the latent space and nothing will ever change, no matter how many steps you add.
The benefit of this kind of approach technically is that you can add a frame-to-frame diff as you're generating and stop early once you've hit a steady state, instead of having to pick an arbitrary number of steps.
I would expect both the scheduler and prompt in-use to have a significant effect on this.
On these pictures. Depending on your prompt, more steps can be beneficial—let's say you're trying to make infinitely detailed fractal monster landscapes—or it might hurt, especially with DDIM, which seems to overfit a lot.
How Stable Diffusion works [1] as a whole is not really hard to comprehend at a high level - but you'll need some prereqs - probability theory underlying this is explained in Variational Autoencoders [2], then Diffusion Models [3] sort of made a really cool "deep variational" autoencoder that uses small noise-denoise steps, but largely the same math (variational inference), but they were unwieldy because operated in pixel space, after that Latent Diffusion Models [4] democratized the thing by vastly reducing the amount of computation needed - operating in latent space (btw that's why the images in this HN post look so cool - the denoising is not in the pixel space!).
[0] https://jalammar.github.io/illustrated-transformer/
[1] https://huggingface.co/blog/stable_diffusion
[2] https://arxiv.org/abs/1906.02691
If the answer is just that the textual embeddings are also fed as simple inputs to the network, I already understand then.
Edit: I thought you were concerned about the model changing decision; the model has a defined amount of steps it can take, and this affects the amount that the diffusion process can shrink the vector from the Unet (the neural network).
https://dsp.stackexchange.com/questions/1637/what-does-frequ...
(padded with 1 s of the final frame)
No way to prevent Imgur from reencoding... whatever
(I was experimenting with repeating words, which does seem to amplify the effect each repeat with some keywords)
Seed is: 948574399
Caveat: While I believe there's nothing nefarious about this notebook... I am unaware whether or not there are security risks with random colab notebooks.
Would a little noise added around here make a subtly different result, widely diverge, or simply slow the refining process?