FYI, when most people say "diffusion" they are referring to "latent diffusion" (which is identical to "stable diffusion"). As for GAN's role, it's more like what I reference in the other comment. I wouldn't call them "part" of every (latent) diffusion model, but I would say they're a common part of the pipeline to the production of quality images (so I'll not deny "part").
As for audio, the above comment is true. This is typically at (as referenced) the end stage of the model. You'll also find Normalizing Flows commonly used in the middle of the model and used so you can have interpretable control over your latent space. NFs are a commonly overlooked architecture, but if you get to learning about Neural ODEs (NODEs), SDEs, Schrodinger Bridges, etc, then you'll find these are in the same family of models. If you like math you'll likely fall in love with these types of models.