If you're using an efficient inference engine like VLLM, you're adding compilation into the mix, and not all of that is fully cached yet.
If that kind of latency isn't acceptable to you, you have to keep the models loaded.
This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.