undefined | Better HN

0 pointsgodelski1y ago0 comments

Oh yeah, they're still alive but you won't see them getting published as often due to both most people switching to diffusion and the self fulfilling prophesy of considering things dead. But yeah, if you look at any diffusion platform like Automatic1111 you'll find that GANs are a popular choice of upscaler[0,1]. So you use them together to try to benefit from each of their advantages.

Also, if you look at the top scores for FFHQ at the 256[2] or 1024[3] resolutions, you see GANs winning, and by a good amount. The best diffusion model is #4, and LDM (Stable Diffusion) is #25. Most diffusion research has avoided this dataset due to scale, but this is changing. Probably worth noting that StyleSAN is about a method, not an architecture. Also the #2 on [2] looks to be a smaller lab and they complain about limited compute and spend time arguing about why they think if they scale they'd perform better. They do have some compelling evidence given their FFHQ success is beating much bigger models. But they don't seem to have as much success on LSUN. They also are less successful on 1024, but they again claim limited compute so hard to say. They don't appear to be published in a conference, so I guess they are in fact stuck.

  > What did you mean by your last sentence, what is the limiting compute factor for GANs?

Sorry, I meant the limiting compute factor for diffusion. Why GANs are faster. I felt it was worth mentioning since I mentioned that Karras wrote custom cuda kernels for StyleGAN, and this does have a significant impact on speed. In the appendix of StyleGAN2 at the end of B under "Performance optimizations" they mention that their kernels result in an improvement of 30% in training time and 20% improvements on memory footprint.

But the limiting factor between diffusion models and (typical) GANs is that GANs are typically formulated with just a decoder. On the other hand, Diffusion has a full encoder decoder network. This is even true for Latent Diffusion models (i.e. Stable Diffusion), which specifically was designed to tackle the compute challenges of a standard diffusion model. The backbone is a UNet (almost a VAE + residual connections), which is an autoencoder and decoder (there are ViT based backbones, but these are still in the same parameter ballpark). So it is just a challenging architecture to reduce in size. There are clear benefits for doing so, but when it comes to practical applications you have to consider a wide variety of factors. I mean think about the computational costs of generating a 256x256 image with SD, that's a few gigs on your GPU. You need procumer hardware to get 1024 and I can tell you that on a 4080S that images are not instantaneously generated lol. So you're not going to use that in a compute constrained environment like gaming. But on the other hand, I can generate 60 imgs/s on a 2080Ti with StyleGAN2 (haven't checked on my 4080S). There are things like ArtSpew, that start getting closer but the image quality is crap (this is being improved FWIW). But also PGAN is crazy fast...

For more specifics I'm not sure how to accurately explain without getting into the math and a conversation about density estimation. But I don't think that is well suited for a HN conversation. This should be enough to point you in the right direction for that though.

[0] https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki...

[1] https://medium.com/rendernet/using-hires-fix-to-upscale-your...

[2] https://paperswithcode.com/sota/image-generation-on-ffhq-256...

[3] https://paperswithcode.com/sota/image-generation-on-ffhq-102...

0 comments

3 comments · 1 top-level

Two_hands1y ago· 2 in thread

> the self fulfilling prophesy of considering things dead

This is quite sad, GANs are an amazing piece of tech and it doesn't seem like they are finished yet. The rule in ML is that it's never over for a method, so maybe someone somewhere will get GANs fashionable again. There's many things like this in ML though...

On the FFHQ point, are you saying currently GANs are better at benchmarks like FFHQ where the target is realistic looking images? Or better at representing the training data?

> Karras wrote custom cuda kernels for StyleGAN

I didnt know they wrote custom kernels, perhaps for my StyleGAN post I can try triton and write a custom kernel for the operations. However, I've never looked into this.

What does it mean to have a backbone? Does it just mean the underlying architecture used in the method? Also, on the decoder only vs encoder-decoder point, taken that way it's very difficult (almost impossible) to have diffusion models have a better efficiency than GANs?

Thanks for the detailed comment, you've given me a lot to think about.

godelskiOP1y ago

  > The rule in ML

There's definitely attempts to revive things (in the general sense, not just GANs), but most successes appear to come from large labs dedicating equal computational resources to the older models and often by changing names. This can make things more confusing and make things appear to be changing faster, but once you can see this, you'll have an easier time keeping up (so being new, watch out for this). I'll give some examples that are easier to read[0-2] (i.e. don't need expert knowledge to understand the nuances).

As an insider (ML researcher), my complaint isn't so much about that we have a large proportion of people chasing one specific avenue, it is that we gate keep newer methods. I think you can see a similar effect on HN when new models are proposed. They are trashed due to lack of beating existing models (this is true even beyond ML!). There will always be reasons to critique works and I don't want discourage criticism, but I do want to discourage dismissal. It hinders progress, because progress is made in small steps, not leaps and bounds. I think this can get confusing for someone entering the field (I'm sorry if I've misjudged, I'm inferring from the comments).

  > On the FFHQ point

This is an excellent question that unfortunately I don't know the answer to. I think you'll find this work helpful[3], it has the largest human study. But despite its name, StyleNAT performed best in specifically FFHQ. What I would say is that there are good arguments to make the diffusion models are better at representing a diversity of images (making them well suited for things like art generation) but theoretically GANs are approximating your full density distribution. There's some talks by Goodfellow discussing this but I don't recall which ones.

  > I didnt know they wrote custom kernels

As you've probably found, the StyleGAN code is not the easiest to read lol. Since you're using pytorch, you can find them here[4]. I encourage you to look at these, especially if you've never seen CUDA code before. Because the biggest takeaway will be that you'll see how easy it is to add a custom kernel, and given the earlier comment you'll see the utility ^__^

I'm not going to discourage you from trying triton, but I'll note that pytorch's compile goes a long way. Definitely _start_ there (see TensorRT).

  > What does it mean to have a backbone? Does it just mean the underlying architecture used in the method?

Exactly! So in the case of a diffusion model it is the UNet (the neural network part, and specifically this part estimates the parameters for the probability distribution. If this doesn't make sense now, it will later. If you are struggling to understand diffusion models after spending some time reading the papers, come back to this comment). You'll also find the term "backbone" used in application based models such as in Semantic Segmentation, Object Detection, Pose Estimation, and much more. In those cases, these are typically pretrained, so recognize this as a hyper-parameter.

  > Also, on the decoder only vs encoder-decoder point

I'm going to say something frustrating. In short: yes. If we get a but more nuanced: no. If we get really nuanced: yes. I know this isn't a great answer, but it can be really difficult to understand. On the surface, yes because you need to encode the variable and your model needs to transform a dimension starting at R^N and ending in R^N. While a (I have to stress, colloquial[5]) GAN transforms a R^M space to R^N where M << N. With more nuance you can argue it is the backbone. But to be detailed, you'll find that there are fundamental factors placing computational bounds on the theoretical performance of these architectures. To get there you'll need to carefully study Goodfellow's original paper (some follow-ups expanding on the analysis) as well as the "original" diffusion paper by Sohl-Dickstein[6] (quotes because this is debatable, but the claim has reasonable merit), and you should become familiar with Aapo Hyvärinen[7]. The last is by far the hardest part and the confusion is normal. I know quite a few well renowned and intelligent people who struggled (personally I went through the "this is hard", "this is easy", "ah fuck, I actually don't understand anything" cycle for a bit. But that's just a signal that you're progressing :).

  > Thanks for the detailed comment, you've given me a lot to think about.

Great! I hope this can help provide direction (I read your other post). The best and worth thing about machine learning is that there's so much depth. It can both be intimidating and easy to miss. But if you're passionate about learning (as it appears) you'll find the knowledge gained is highly rewarding, if unfortunately hard to gain.

And I apologize for being so verbose. It is a bad habit.

[0] Diffusion Models Beat GANs on Image Synthesis (https://arxiv.org/abs/2105.05233) The two authors are rockstars. Given your blog post, I think you'll enjoy a lot of their work. Which includes diffusion.

[1] ResNet strikes back: An improved training procedure in timm (https://arxiv.org/abs/2110.00476) Again, all three are great people to follow. You won't see many papers by Wightman, but you'll see his work with (now) Hugging Face. Notably he's one of the most important players in ViTs.

[2] A ConvNet for the 2020s (https://arxiv.org/abs/2201.03545) and ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders (https://arxiv.org/abs/2301.00808)

[3] Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models (https://arxiv.org/abs/2306.04675)

[4] https://github.com/NVlabs/stylegan2-ada-pytorch/tree/main/to...

[5] I'm sorry, I still have difficulties explaining this, especially simply. There's a lot of points here. But one is easy to understand and what I mentioned before: GAN is a training method, not an architecture. A bit more nuance will be found by reading this far underappreciated work: https://arxiv.org/abs/1912.03263. The last point I want to mention is to never forget that "generative" is a general term and these models are good for generating __data__. Images are data, but to think this is the only type of data a GAN (or any model. I literally mean any[3]) can generate is naive. All of this gets harder to explain and I don't have the skill to do so in a simple manner and am afraid it will just come off as a rant.

[6] Deep Unsupervised Learning using Nonequilibrium Thermodynamics (https://arxiv.org/abs/1503.03585)

[7] "Score matching" and "Noise-contrastive estimation" will be the most beneficial https://www.cs.helsinki.fi/u/ahyvarin/papers/

Two_hands1y ago

> because progress is made in small steps,

This seems easily forgotten by a large number of people. I try to remind myself to step back from the hype and explore the lesser travelled paths.

> I'll give some examples that are easier to read[0-2]

I need to reach ResNet strikes back, it was one the first networks I implemented and it is cool to see it still being worked on.

I'll check out [3]. I've wondered recently how you could get a GAN to generate things out of distribution but that still look like the training data, if that even makes sense.

> the StyleGAN code is not the easiest to read lol

Yup, even the official PGGAN code was quite hard to understand. I'll try out the PyTorch compile I've heard a lot about it recently. I had thought TensorRT was for LLMs I suppose it's applicable in other areas too?

> so recognize this as a hyper-parameter

Okay that makes sense. I'll reread this after exploring Diffusion models too in the future.

> carefully study Goodfellow's original paper

This is something I have not done, my current workflow is just to understand how best to implement what is written. I think deep exploration is the next step, no matter how many "I know nothings" I will experience. This side of GANs I had not considered (the theoretical, it looked interesting but very complex).

> I hope this can help provide direction

It certainly will, I imagine I'll come back to this comment many times. Thanks for taking the time to read my posts and provide so much material for further study.

> if unfortunately hard to gain

I agree it is rewarding and I hope I can purvey some of this knowledge in my blog for others too! That was why I started it, so much knowledge is locked away and hard to access or understand without some guidance.

1 more reply

j / k navigate · click thread line to collapse