Google provides renditions in two codec pairs for most video - vp9 + Opus and h.264 + AAC. Assuming the manifest for audio only keeps you using the same codec family of renditions, this should work just fine (minus, as you said, the player not showing anything until the next video+audio segment is fetched) as long as we're talking VOD content. Live content should also be able to switch from source to audio only with no re-fetch, but live transcoded renditions won't be in line with either rendition, so that'll cause a burp.