undefined | Better HN

0 pointskkielhofner2y ago0 comments

I'm with you. I think having an optimized production inference serving framework on the deployment side has the potential to make your project the best of all worlds, essentially.

Of course there are other really advanced use cases and alternative ways to go about this but that would go a very long way.

Also FWIW Nvidia Triton Inference server is even more performant than vLLM and supports dynamic batching, quantization, paging, KV cache, blah blah blah in addition to being able to load multiple models today whether they be LLMs, ONNX, whatever across all of the available backends.

Significantly more complex in terms of deployment but wanted to mention it in terms of being able to load multiple models concurrently in an efficient and performant manner.

0 comments

1 comments · 1 top-level

dsamy2y ago

I took a look at Nvidia Triton Inference server, and it might be a good option for production especially as it has c++ api.

Amazing feedback, Thanks!

j / k navigate · click thread line to collapse