You just said the models pretty much all work the same way, then you said doing what I described won't help. I'm confused. Apple and Google both offer real time, on device transcription these days, so
something clearly works. And if you say the models already all do this, then running it 30x as often isn't a problem anyways, since again... people are used to that.
I doubt people run online transcription for long periods of time on their phone very often, so the battery impact is irrelevant, and the model is ideally running (mostly) on a low power, high performance inference accelerator anyways, which is common to many SoCs these days.