We still use GANs a lot. They're way faster than diffusion models. Good luck getting a diffusion model to perform upscaling and denoising on a real time video call. I'm sure we'll get there, but right now you can do this with a GAN on cheap consumer hardware. You don't need a 4080, DLSS was released with the 20 series cards. They are just naturally computationally cheaper, but yeah, they do have trade-offs (though arguable since ML goes through hype phases and everyone jumps ship from one thing to another and few revisit. But when revisits happen, they tend to be competitive. See ResNets Strike Back for even CNNs vs ViTs. But there's more nuance here).
There is a reason your upscaling model is a GAN. Sure, diffusion can do this too. But why is everyone using ESRGAN? There's a reason for this.
Also, I think it is important to remember that GAN is really about a technique, not about generating images. You have a model generating things, and another model telling you something is a good output or not. LLM people... does this sound familiar?
To the author: I think it is worth pointing to Tero Karras's Nivida page. This group defined the status quo of GANs. You'll find that the vast majority of GAN research built off of their research. As quite a large portion of are literal forks. Though a fair amount of this is due to the great optimization they did, with custom cuda kernels (this is not the limiting compute factor in diffusion). https://research.nvidia.com/person/tero-karras
As a technique I think it’s quite stunning, from an ML perspective. Hence why I’ve decided to write these blog posts. The GAN just has something about which makes it riveting to work with.
I’ve realised that Tero Karras made major contributions, I can across the PGGAN from the StyleGAN2. What did you mean by your last sentence, what is the limiting compute factor for GANs?
Their still in use literally by every latent diffusion image generator, they typically target the latents of a GAN-trained decoder.
Same for audio, most audio models generate some representation that is converted to audio by a GAN-trained codec.
Also, if you look at the top scores for FFHQ at the 256[2] or 1024[3] resolutions, you see GANs winning, and by a good amount. The best diffusion model is #4, and LDM (Stable Diffusion) is #25. Most diffusion research has avoided this dataset due to scale, but this is changing. Probably worth noting that StyleSAN is about a method, not an architecture. Also the #2 on [2] looks to be a smaller lab and they complain about limited compute and spend time arguing about why they think if they scale they'd perform better. They do have some compelling evidence given their FFHQ success is beating much bigger models. But they don't seem to have as much success on LSUN. They also are less successful on 1024, but they again claim limited compute so hard to say. They don't appear to be published in a conference, so I guess they are in fact stuck.
> What did you mean by your last sentence, what is the limiting compute factor for GANs?
Sorry, I meant the limiting compute factor for diffusion. Why GANs are faster. I felt it was worth mentioning since I mentioned that Karras wrote custom cuda kernels for StyleGAN, and this does have a significant impact on speed. In the appendix of StyleGAN2 at the end of B under "Performance optimizations" they mention that their kernels result in an improvement of 30% in training time and 20% improvements on memory footprint.But the limiting factor between diffusion models and (typical) GANs is that GANs are typically formulated with just a decoder. On the other hand, Diffusion has a full encoder decoder network. This is even true for Latent Diffusion models (i.e. Stable Diffusion), which specifically was designed to tackle the compute challenges of a standard diffusion model. The backbone is a UNet (almost a VAE + residual connections), which is an autoencoder and decoder (there are ViT based backbones, but these are still in the same parameter ballpark). So it is just a challenging architecture to reduce in size. There are clear benefits for doing so, but when it comes to practical applications you have to consider a wide variety of factors. I mean think about the computational costs of generating a 256x256 image with SD, that's a few gigs on your GPU. You need procumer hardware to get 1024 and I can tell you that on a 4080S that images are not instantaneously generated lol. So you're not going to use that in a compute constrained environment like gaming. But on the other hand, I can generate 60 imgs/s on a 2080Ti with StyleGAN2 (haven't checked on my 4080S). There are things like ArtSpew, that start getting closer but the image quality is crap (this is being improved FWIW). But also PGAN is crazy fast...
For more specifics I'm not sure how to accurately explain without getting into the math and a conversation about density estimation. But I don't think that is well suited for a HN conversation. This should be enough to point you in the right direction for that though.
[0] https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki...
[1] https://medium.com/rendernet/using-hires-fix-to-upscale-your...
[2] https://paperswithcode.com/sota/image-generation-on-ffhq-256...
[3] https://paperswithcode.com/sota/image-generation-on-ffhq-102...