I read the article and confess some of the modeling parts were above my comprehension. But I would like to add that as an audio engineer, the "key question" you describe is solved, just not applied to transformer models (?).
An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently. And with tools like Melodyne - which already quantize audio semantically - they can identify (and manipulate) pitch and formants as well, turning an O vowel into an E vowel, or changing the inflection of a phrase (up-speak vs down-speak, for example).
I don't know how to apply this to a neural codec, but it seems like it shouldn't be that hard (that's my naivete coming through)