I guess different approaches could be applicable for client to server vs server to client.
For client to server you want low latency, don't care about pauses introduced by communications (the model doesn't care), and could certainly tolerate a fallback to lower bandwidth text only (local SST) or more heavily compressed voice.
For server to client it needs to be high quality voice without pauses, but as the parent was suggesting you could potentially hide response latency (whether due to server or communication degradation) by using a human-like conversational "trick" of at least making some sound before brain is engaged and generating a response. "That's absolutely right! ..." would be a tad annoying, but "Hmm..." might be OK, especially if not done all the time, just as a locally initiated conversational filler when the server is slow to respond.