undefined | Better HN

0 pointsjacobsimon1y ago0 comments

I couldn’t quite tell from the announcement, but is there still a separate TTS step, where GPT is generating tones/pitches that are to be used, or is it completely end to end where GPT is generating the output sounds directly?

0 comments

derac1y ago

It's one model with text/audio/image input and output.

jacobsimonOP1y ago

Very exciting, would love to read more about how the architecture of the image generation works. Is it still a diffusion model that has been integrated with a transformer somehow, or an entirely new architecture that is not diffusion based?

j / k navigate · click thread line to collapse

0 comments

derac1y ago

It's one model with text/audio/image input and output.

jacobsimonOP1y ago

j / k navigate · click thread line to collapse