undefined | Better HN

0 pointssimonw1y ago0 comments

Last I looked vLLM didn't work on a Mac.

0 comments

2 comments · 1 top-level

mitjam1y ago· 1 in thread

Afaik vllm is for concurrent serving with batched inference for higher throughput, not single-user inference. I doubt inference throughput is higher with single prompts at a time than Ollama. Update: this is a good Intro to continuous batching in llm inference: https://www.anyscale.com/blog/continuous-batching-llm-infere...

Der_Einzige1y ago

It is much faster on single prompts than ollama. 3X is not unheard of

j / k navigate · click thread line to collapse