Stable Diffusion forming images from text: image snapshots at each step (opens in new tab)

(postimg.cc)

146 pointsTheMiddleMan3y ago44 comments

44 comments

38 comments · 11 top-level

TheMiddleManOP3y ago· 8 in thread

Another gallery with 80 ddim steps: https://postimg.cc/gallery/b1kn7yd

Thought I'd share this for others interested. I've modified txt2img to save an image after each step. (actually quite easy as you can specify an img_callback to sampler)

Interestingly, both of these runs are using the same seed and prompt, yet they yield different final images, the only difference is the number of ddim sampling steps. I'd love to understand why if anyone has any idea.

sp3323y ago

A couple of replies to https://news.ycombinator.com/item?id=32634807 suggest some sources of non-determinism.

TheMiddleManOP3y ago

Interesting. I suppose GPUs could calculate things differently. I just checked, I can rerun both 40/80 step runs and the final images are bit-identical to the first runs. So at least in my scenario the same parameters are deterministic, but changing the number of ddim sampling steps changes the result.

Maybe it's doing something fancy with the total number of steps, beyond just stopping after the count is reached.

1 more reply

beecafe3y ago

Each sampling step runs at a specific scale, fewer steps would skip some of the intermediate scales

TheMiddleManOP3y ago

Ah I see, it makes bigger leaps each step to try to get to the same end result in less total steps. That makes sense, assuming I have it right.

mpaepper3y ago

Most GPUs are non-deterministic - learned this the hard way in deep learning on pathology data. This is for optimization purposes. In fact, you can set a flag in Pytorch / Cuda to disable this which comes at the cost of performance.

riedel3y ago

Can you explain? How much does it actually affect results in extreme cases? The source of non-determinism does seem the GPU but parallelism and dynamic allocation in the frameworks. (Also seems that some parts of pytorch still return runtime error if you request a deterministic version). Are there other more performant deterministic DL frameworks?

mgarciaisaia3y ago

Do you have a diff/patch of the change to do this?

I may try understanding both StableDiffusion and Python enough to do it, but if you already solved it - that'll be appreciated :)

gbear6053y ago

You can set both the seed and the number of inference steps when running StableDiffusion locally (or in Google CoLab). I assumed that they just set a seed and then generated the image at each inference step. With a decent GPU, it’s only going to take a few minutes.

You could definitely modify it to output at each step, but the output step takes a relatively long time, so it would slow down the process.

wokwokwok3y ago· 4 in thread

The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.

It's basically just fallen into a local minima in the latent space and nothing will ever change, no matter how many steps you add.

The benefit of this kind of approach technically is that you can add a frame-to-frame diff as you're generating and stop early once you've hit a steady state, instead of having to pick an arbitrary number of steps.

beecafe3y ago

You don't even need to do a diff, the model itself actually predicts the diff (which is how it samples the image) so you could just stop once the model is predicting close to 0

stavros3y ago

Is there a parameter for this, to stop when subsequent iterations do nothing?

arecurrence3y ago

> The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.

I would expect both the scheduler and prompt in-use to have a significant effect on this.

Filligree3y ago

> The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.

On these pictures. Depending on your prompt, more steps can be beneficial—let's say you're trying to make infinitely detailed fractal monster landscapes—or it might hurt, especially with DDIM, which seems to overfit a lot.

alok-g3y ago· 4 in thread

Does someone have an easy explanation how the text prompt is fed into the image.

Dzugaru3y ago

It is fed using a fascinating mechanism called "cross-attention" that originated in the NN architecture called Transformer - which was used to achieve state-of-the-art translations. It uses something like associative memory, where a NN inside Stable Diffusion, that generates image (UNet, working in latent space), at almost each step "asks" the whole encoded prompt to provide data at various positions using query vector Q that is matched against key vectors K and value vectors V [0].

How Stable Diffusion works [1] as a whole is not really hard to comprehend at a high level - but you'll need some prereqs - probability theory underlying this is explained in Variational Autoencoders [2], then Diffusion Models [3] sort of made a really cool "deep variational" autoencoder that uses small noise-denoise steps, but largely the same math (variational inference), but they were unwieldy because operated in pixel space, after that Latent Diffusion Models [4] democratized the thing by vastly reducing the amount of computation needed - operating in latent space (btw that's why the images in this HN post look so cool - the denoising is not in the pixel space!).

[0] https://jalammar.github.io/illustrated-transformer/

[1] https://huggingface.co/blog/stable_diffusion

[2] https://arxiv.org/abs/1906.02691

[3] https://arxiv.org/abs/2006.11239

[4] https://arxiv.org/abs/2112.10752

alok-g3y ago

Thanks!

andsens3y ago

Uhm. You’re basically asking how the entire NN works. There is no easy explanation for that.

alok-g3y ago

I understand neural networks, embeddings, convolutions, etc. The part that's unclear to me is specifically how textual embeddings are linked into the img-to-img network trying to reduce the noise. In other words, am missing how the process is 'conditioned upon' the text. (I lack a understanding the same for conditional GANs as well.)

If the answer is just that the textual embeddings are also fed as simple inputs to the network, I already understand then.

1 more reply

saurik3y ago· 3 in thread

I have a couple examples of such--though I only go through ~20 steps--in a talk I gave a couple days ago (one in the middle with a horse, and one at the very end generating a person).

https://twitter.com/saurik/status/1565728123705966592?s=21

eshack943y ago

Saurik, spotted in the wild. Off-topic, but thanks for all the work you've done over the years for the iOS jailbreak community. Hope all is well.

sogen3y ago

Saurik!

aaaaaaaaata3y ago

Wonder if he enjoys this.

GaggiX3y ago· 3 in thread

An even better example using Midjourney beta (so Stable Diffusion): https://www.reddit.com/r/deepdream/comments/ww7ubl/the_genes...

SV_BubbleTime3y ago

I do t really understand, what is the process or mechanism for when it is happy with something? At some point it “liked” the way the armor was coming together and refined it is small amounts only.

GaggiX3y ago

The neural network always tries to predict the final image, but the diffusion process takes the vector and shrinks it, then turns it into a distribution by adding Gaussian noise, so if the model makes a decision, it may not be the final one.

Edit: I thought you were concerned about the model changing decision; the model has a defined amount of steps it can take, and this affects the amount that the diffusion process can shrink the vector from the Unet (the neural network).

ShamelessC3y ago

What is happening is that the noiser/earlier timesteps are responsible for low-frequency features, while the final timesteps are responsible for high-frequency features.

https://dsp.stackexchange.com/questions/1637/what-does-frequ...

gus_massa3y ago· 3 in thread

Can you make a video with it that shows how it's improves?

mgdlbp3y ago

https://imgur.com/a/b7Bw7HB

(padded with 1 s of the final frame)

No way to prevent Imgur from reencoding... whatever

gus_massa3y ago

Nice!

Why does the border change in each frame?

1 more reply

dd363y ago

A GIF would be great

pontifier3y ago· 1 in thread

I'm astonished at the evolution of the image and reminded of a documentary I saw about Picasso ( I think). He would paint the same painting again and again tweaking it slightly each time until he was satisfied.

drcongo3y ago

I went to a Picasso exhibition in Madrid once where they had an entire, huge, long room filled with every sketch and painting he'd done in preparation for Guernica, plus Guernica itself of course. It was eye-opening to say the least, especially as so many of the prep-work pieces were not in his typical style. Some were in an absolutely beautifully detailed realist style, and at that point I'd never seen a Picasso that wasn't cubist so it really stuck with me.

carrolldunham3y ago· 1 in thread

the prompt would be informative

TheMiddleManOP3y ago

Prompt is: monkey astronaut, bright, bright, bright, bright

(I was experimenting with repeating words, which does seem to amplify the effect each repeat with some keywords)

Seed is: 948574399

hwers3y ago

Here’s another one in gif form (right image) https://twitter.com/johnowhitaker/status/1565710033463156739 These seem really useful to get intuition about im2im

arecurrence3y ago

A colab notebook shared on discord has some interesting exploration of this https://colab.research.google.com/drive/1dlgggNa5Mz8sEAGU0wF...

Caveat: While I believe there's nothing nefarious about this notebook... I am unaware whether or not there are security risks with random colab notebooks.

Lerc3y ago

I would be interested to see what varying levels of noise added to these intermediate steps produces. In this example it looks like it kind-of decided what it was drawing in the transition from steps 11 to 14 and then refined.

Would a little noise added around here make a subtly different result, widely diverge, or simply slow the refining process?

j / k navigate · click thread line to collapse

44 comments

38 comments · 11 top-level

TheMiddleManOP3y ago· 8 in thread

Another gallery with 80 ddim steps: https://postimg.cc/gallery/b1kn7yd

Thought I'd share this for others interested. I've modified txt2img to save an image after each step. (actually quite easy as you can specify an img_callback to sampler)

sp3323y ago

A couple of replies to https://news.ycombinator.com/item?id=32634807 suggest some sources of non-determinism.

TheMiddleManOP3y ago

Maybe it's doing something fancy with the total number of steps, beyond just stopping after the count is reached.

1 more reply

beecafe3y ago

Each sampling step runs at a specific scale, fewer steps would skip some of the intermediate scales

TheMiddleManOP3y ago

Ah I see, it makes bigger leaps each step to try to get to the same end result in less total steps. That makes sense, assuming I have it right.

mpaepper3y ago

riedel3y ago

mgarciaisaia3y ago

Do you have a diff/patch of the change to do this?

I may try understanding both StableDiffusion and Python enough to do it, but if you already solved it - that'll be appreciated :)

gbear6053y ago

You could definitely modify it to output at each step, but the output step takes a relatively long time, so it would slow down the process.

wokwokwok3y ago· 4 in thread

The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.

It's basically just fallen into a local minima in the latent space and nothing will ever change, no matter how many steps you add.

beecafe3y ago

You don't even need to do a diff, the model itself actually predicts the diff (which is how it samples the image) so you could just stop once the model is predicting close to 0

stavros3y ago

Is there a parameter for this, to stop when subsequent iterations do nothing?

arecurrence3y ago

> The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.

I would expect both the scheduler and prompt in-use to have a significant effect on this.

Filligree3y ago

> The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.

alok-g3y ago· 4 in thread

Does someone have an easy explanation how the text prompt is fed into the image.

Dzugaru3y ago

[0] https://jalammar.github.io/illustrated-transformer/

[1] https://huggingface.co/blog/stable_diffusion

[2] https://arxiv.org/abs/1906.02691

[3] https://arxiv.org/abs/2006.11239

[4] https://arxiv.org/abs/2112.10752

alok-g3y ago

Thanks!

andsens3y ago

Uhm. You’re basically asking how the entire NN works. There is no easy explanation for that.

alok-g3y ago

If the answer is just that the textual embeddings are also fed as simple inputs to the network, I already understand then.

1 more reply

saurik3y ago· 3 in thread

I have a couple examples of such--though I only go through ~20 steps--in a talk I gave a couple days ago (one in the middle with a horse, and one at the very end generating a person).

https://twitter.com/saurik/status/1565728123705966592?s=21

eshack943y ago

Saurik, spotted in the wild. Off-topic, but thanks for all the work you've done over the years for the iOS jailbreak community. Hope all is well.

sogen3y ago

Saurik!

aaaaaaaaata3y ago

Wonder if he enjoys this.

GaggiX3y ago· 3 in thread

An even better example using Midjourney beta (so Stable Diffusion): https://www.reddit.com/r/deepdream/comments/ww7ubl/the_genes...

SV_BubbleTime3y ago

GaggiX3y ago

ShamelessC3y ago

What is happening is that the noiser/earlier timesteps are responsible for low-frequency features, while the final timesteps are responsible for high-frequency features.

https://dsp.stackexchange.com/questions/1637/what-does-frequ...

gus_massa3y ago· 3 in thread

Can you make a video with it that shows how it's improves?

mgdlbp3y ago

https://imgur.com/a/b7Bw7HB

(padded with 1 s of the final frame)

No way to prevent Imgur from reencoding... whatever

gus_massa3y ago

Nice!

Why does the border change in each frame?

1 more reply

dd363y ago

A GIF would be great

pontifier3y ago· 1 in thread

drcongo3y ago

carrolldunham3y ago· 1 in thread

the prompt would be informative

TheMiddleManOP3y ago

Prompt is: monkey astronaut, bright, bright, bright, bright

(I was experimenting with repeating words, which does seem to amplify the effect each repeat with some keywords)

Seed is: 948574399

hwers3y ago

Here’s another one in gif form (right image) https://twitter.com/johnowhitaker/status/1565710033463156739 These seem really useful to get intuition about im2im

arecurrence3y ago

A colab notebook shared on discord has some interesting exploration of this https://colab.research.google.com/drive/1dlgggNa5Mz8sEAGU0wF...

Caveat: While I believe there's nothing nefarious about this notebook... I am unaware whether or not there are security risks with random colab notebooks.

Lerc3y ago

Would a little noise added around here make a subtly different result, widely diverge, or simply slow the refining process?

j / k navigate · click thread line to collapse