undefined | Better HN

0 pointsCGMthrowaway5mo ago0 comments

> The key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens. A single window of audio is usually somewhere between 10ms and 100ms. It's difficult to squish all that information down to a single "token" that represents the semantic and acoustic content for that window.

I read the article and confess some of the modeling parts were above my comprehension. But I would like to add that as an audio engineer, the "key question" you describe is solved, just not applied to transformer models (?).

An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently. And with tools like Melodyne - which already quantize audio semantically - they can identify (and manipulate) pitch and formants as well, turning an O vowel into an E vowel, or changing the inflection of a phrase (up-speak vs down-speak, for example).

I don't know how to apply this to a neural codec, but it seems like it shouldn't be that hard (that's my naivete coming through)

0 comments

PaulDavisThe1st5mo ago

> An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently.

As an experienced DAW author, I very, very much doubt this.

What can be done relatively easy is to "see" or rather "follow along" in the waveform when listening to the audio. But I read your claim as being that someone could look at the waveform (which is already decimated from the original) and identify words or phonemes without hearing the associated audio. I am extremely skeptical that there is anyone anywhere in the world who can do this.

CGMthrowawayOP5mo ago

I started in music but have since edited thousands of hours of podcasts. I cannot transcribe a track by looking at the waveform, except the word "um" haha. But without playing the audio I can tell you where words start and end, whether a peak is a B or a T or an A or an I sound... And melodyne can add layers to that and tell me the pitch, formants (vowels), quantize the syllables etc. If I can do all this, a computer ought to be able to do the same and more

spudlyo5mo ago

Hundreds of hours here, and I can't even always reliably spot my own ums. I edit as many out as I possibly can for myself, my co-host and guest, as well as eliminating continuation signaling phrases like "you know" and "like". I also remove uninteresting asides and bits of dead air. This is boring and tedious work but it makes the end result considerably better I think.

I feel like there should be a model that can do much of this for me but I haven't really looked into it, ironically due to laziness, but also because I edit across multiple tracks at this stage, and I'm afraid to feed the model an already mixed stereo track. I'm curious why you still do it manually, if you still do and if you've looked into alternatives.

2 more replies

jampekka5mo ago

> An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently.

DAWs' rendered waveforms have so little information that such identification is likely impossible even in theory. Telling apart plosives and vowels maybe, but not much more than that.

I work with phoneticians and they can (sometimes) read even words from suitably scaled spectrograms, but that's a lot more information than in waveforms.

j / k navigate · click thread line to collapse

0 pointsCGMthrowaway5mo ago0 comments

I don't know how to apply this to a neural codec, but it seems like it shouldn't be that hard (that's my naivete coming through)

0 comments

PaulDavisThe1st5mo ago

> An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently.

As an experienced DAW author, I very, very much doubt this.

CGMthrowawayOP5mo ago

spudlyo5mo ago

2 more replies

jampekka5mo ago

> An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently.

DAWs' rendered waveforms have so little information that such identification is likely impossible even in theory. Telling apart plosives and vowels maybe, but not much more than that.

I work with phoneticians and they can (sometimes) read even words from suitably scaled spectrograms, but that's a lot more information than in waveforms.

j / k navigate · click thread line to collapse