undefined | Better HN

Skip to content

Top Best Ask Show New Jobs

0 pointschristianqchung2y ago0 comments

Does anyone know how they're doing the audio part where Mark breaths too hard? Does his breathing get turned into all-caps text (AA EE OO) and that GPT4-o interprets that as him breathing too hard, or is there something more going on?

0 comments

8 comments · 3 top-level

modeless2y ago· 5 in thread

There is no text. The model understands ingests audio directly and also outputs audio directly.

reisse2y ago

So they retrained the whole model on audio datasets and the tokens are now sounds, not words/part of words?

They trained on text and audio and images. The model accepts tokens of all three types. And it can directly output audio as well as text.

dclowd99012y ago

Is it a stretch to think this thing could accurately "talk" with animals?

Yes? Why would it be able to do that?

benlivengood2y ago

Not really a stretch in my mind. https://www.earthspecies.org/ and others are working on it already.

Jordan-1172y ago

That's how it used to do it, but my understanding is that this new model processes audio directly. If it were a music generator, the original would have generated sheet music to send to a synthesizer (text to speech), while now it can create the raw waveform from scratch.

GalaxyNova2y ago

It can natively interpret voice now.

j / k navigate · click thread line to collapse