undefined | Better HN

0 pointssillysaurusx3y ago0 comments

AGI is closer to tokenization than you might think. I realized this recently when trying to do audio prediction.

There was recently a project called riffusion which generates spectrograms, then recovers audio from the spectrograms.

You might be tempted to apply this to predict speech. But speech isn’t like music. We’re communicating in language, using a sequence of tones. It’s why most speech codecs use linear predictive coding. Predicting the waveforms won’t get you anywhere; no semantic understanding of language.

So the next step up is to divide speech into a series of tones, and try to predict those sounds rather than raw waveforms.

Except… that’s literally tokenization. And there’s some evidence that this is precisely what our brains are doing.

0 comments

2 comments · 2 top-level

espadrine3y ago

There is definitely something symbol-adjacent that needs to happen inside of the model; this is what I assume happens in the brain. But it is not purely symbolic.

For instance, consider voicing the end of a letter: “I will definitely not be stabbed in the bac…” (where the word "back" quickly devolves into a line that crosses through the rest of the letter). It goes from symbolic to contextual, implying that the author was stabbed midway through writing it, so the voicing must end with a yell of playful agony.

The same goes for calligraphic art, such as the Al Jazeera logo, for instance, which is intended to be understood as both a sequence of Arabic letters, and a depiction of a fire. A model seeing this image for the first time, needs to see it both ways at the same time.

But it’s true that we can’t just throw a transformer at the problem, train it from scratch with video inputs and audio outputs, coupled with a sporadic reward, and suddenly have it be able to solve scans of civil engineering exams. The brain can do it, but not silicon (yet). It is easier to combine models that were trained on simpler losses (tokenized cross-entropy) on simpler problems (next-token prediction), and combine them. Not true AGI learning, but eventually it will fool people into believing it is.

bravura3y ago

Actually there's a whole new subfield called textless NLP doing just that: Learning language models from raw audio. https://ai.facebook.com/blog/textless-nlp-generating-express...

j / k navigate · click thread line to collapse