I couldn’t quite tell from the announcement, but is there still a separate TTS step, where GPT is generating tones/pitches that are to be used, or is it completely end to end where GPT is generating the output sounds directly?
Very exciting, would love to read more about how the architecture of the image generation works. Is it still a diffusion model that has been integrated with a transformer somehow, or an entirely new architecture that is not diffusion based?