However when looking at inference optimisations there are basically more frameworks than their are applications. Lot of abstracting of inference engine like vLLM. But that means that vLLM is now unintentionally gate keeping what people can do with LLMs in production. How did we end up here?
To prevent introducing yet-another (TM) inference framework. I can't help but wonder why did nobody figure out to run TensorRT-LLM on K8s. Or why didn't NVIDIA figure out that if you build a tool that potentially has the most feature rich inference implementation of LLMs. But it doesn't seem to be picked up quite as much you might need to work on the easy of implementation. Somebody please enable me to use TensorRT-LLM on k8s and call me.
No comments yet.