That's entirely an implementation limitation from humans. There's no reason to believe a reasoning model could NOT be trained to stream multimodal input and perform a burst of reasoning on each step, interjecting when it feels appropriate.
We simply haven't.