Who is this for?
I understand the ease of use but using ollama (an easy to use wrapper for llama.cpp) in production is in my experience as someone who deploys this stuff a very bad idea.
I understand it's "highly scalable" thanks to the tooling but at the end of the day on a resource utilization basis vLLM, HF TGI, etc are going to walk all over llama.cpp which IMO is the completely wrong tool for the job.
vLLM and HF TGI are containerized and run very well with nothing other than a HuggingFace model name as an argument/environment variable.
In the days of GPU shortages, very high costs, and CPU being unacceptably slow (only advantage of llama.cpp) using vLLM or similar cuts hosting costs in half if not more while providing more management tools, higher TPS, lower time to first token, etc.
When your hardware or cloud hosting costs are multiples higher using this vs these real serving frameworks a little extra ease of use on the frontend combined with the impossibility to really compete on performance makes this approach a tough proposition all around.
Another advantage of Ollama is it can easily run locally, so does the wasm plugin. Accomplishing the goal of local development environment which uses dreamland.
That's great feedback. I was thinking about fixing the concurrency issue myself, but creating a vLLM wasm plugin is a better idea. The user code won't need to change as long as the plugin exports as the same wasm host module.
Of course there are other really advanced use cases and alternative ways to go about this but that would go a very long way.
Also FWIW Nvidia Triton Inference server is even more performant than vLLM and supports dynamic batching, quantization, paging, KV cache, blah blah blah in addition to being able to load multiple models today whether they be LLMs, ONNX, whatever across all of the available backends.
Significantly more complex in terms of deployment but wanted to mention it in terms of being able to load multiple models concurrently in an efficient and performant manner.