Adding audio data as a token, in and of itself, would
dramatically increase training size, cost, and time for very little benefit. Neural networks also generally tend to function less effectively with highly correlated inputs, which I can only assume is still an issue for LLMs. And adding combined audio training would introduce rather large scale correlations in the inputs.
I would wager like 100:1 that this is just introducing some TTS/STT layers. The video processing layer is probably also doing something similarly, by taking an extremely limited number of 'screenshots', carrying out typical image captioning using another layer, and then feeding that as an input. So the demo, to me, seems most likely to just be 3 separate 'plugins' operating in unison - text to speech, speech to text, and image to text.
The interjections are likely just the software being programmed to aggressively begin output following any lull after an input pattern. Note in basically all the videos, the speakers have to repeatedly cut off the LLM as it starts speaking in conversationally inappropriate locations. In the main video which is just an extremely superficial interaction, the speaker made sure to be constantly speaking when interacting, only pausing once to take a breath that I noticed. He also struggled with the timing of his own responses as the LLM still seems to be attached to its typical, and frequently inappropriate, rambling verbosity (though perhaps I'm not one to critique that).