VQGAN (being a "GAN") is already two networks - one Generates things, and the other is Adversarial and judges if the other network is good enough, then you train them both at once and they fight.
CLIP+VQGAN generation IIRC works by replacing the adversarial network with CLIP, so it understands text prompts, then retraining it for a while towards the prompted target, then generating whatever it's learned from that.
GANs are a silly idea that shouldn't work but somehow do. There's some attempts to replace the idea: https://www.microsoft.com/en-us/research/blog/unlocking-new-...