undefined | Better HN

Skip to content

Top Best Ask Show New Jobs

0 pointsbigzyg33k8mo ago0 comments

advanced voice mode operates on audio tokens directly, it doesn't transcribe them into "text tokens" as an intermediate step like the original version of voice mode did.

0 comments

6 comments · 3 top-level

cubefox8mo ago· 2 in thread

But they behave just like models which use text tokens internally, which is also pointed out at the end of the above article.

bigzyg33kOP8mo ago

we don't know if that's due to inherent limitations of the tokenisation of audio, or a byproduct of reinforcement learning. In my own usage, I noticed a significant degradation in capabilities over time from when they initially released advanced voice mode. The model used to be able to sing, whisper, imitate sounds and tone just fine, but I imagine this was not intended and has subsequently been stunted via reinforcement learning.

I don't find the articles argument that this is due to tokenisation convincing.

cubefox8mo ago

They didn't say it's due to tokenization.

> This is likely because they’re trained on a lot of data generated synthetically with text-to-speech and/or because understanding the tone of the voice (apparently) doesn’t help the models make more accurate predictions.

oezi8mo ago· 1 in thread

Absolutely correct! My simple test is if it can tell American and British English Tomato and Potato apart. So far it can't.

fragmede8mo ago

Which "it" are you referring to? There are models that can.

sbrother8mo ago

right, but either whatever audio tokenization it's doing doesn't seem to encode pitch, or there's ~nothing where pitch is relevant in the training set.

j / k navigate · click thread line to collapse