Because currently models output a stream of tokens directly which are the performance and billing unit. Better models can do a better job at producing reasonable output but there is a limit to what can be done "on the fly".
Some models like openai o1 started employing internal "thinking" tokens which may or may not be equivalent to performing multiple passes with the same or different models but it has a similar effect.
One way to look at it is that if you want better results you have to put more computational resources in thinking. Also, just like humans, a team effort yields better results in producing well rounded results because you combine the strengths and you offset the weaknesses of different team members.
You can technically wrap all this into a single black box and have it converse with you as if it was one single entity that internally uses multiple models to think and cross check etc.
The output is likely not going to be in real-time though and real time conversation was until now a very important feature.
In future we may on one hand relax the real time constraint and accept that for some tasks accuracy is more important than real time results.
Or we may eventually have faster machines or more clever algorithms that may "think" more in shorter amounts of time.
(Or a combination of the two)