A lot of models we currently use seem to do the same thing. The model will transcribe a "best effort" interpretation in real time, then as you can continue speaking, you'll see it go back and make corrections. I'm sure you can feed the first X seconds you have into the model, followed by (30-X) seconds of silence, and it will do real time transcription just fine... it would be weird if this broke anything. Then, as you get more speech, you continue getting better transcription of the first 30 seconds, then you switch to a 30 second sliding window.
Maybe I'm missing something, but I don't see the problem here.