Of course there are other really advanced use cases and alternative ways to go about this but that would go a very long way.
Also FWIW Nvidia Triton Inference server is even more performant than vLLM and supports dynamic batching, quantization, paging, KV cache, blah blah blah in addition to being able to load multiple models today whether they be LLMs, ONNX, whatever across all of the available backends.
Significantly more complex in terms of deployment but wanted to mention it in terms of being able to load multiple models concurrently in an efficient and performant manner.