HEAD of ollama with Q8_0 vs vLLM with BF16 and FP8 after.
BF16 predictably bad. Surprised FP8 performed so poorly, but I might not have things tuned that well. New at this.
┌─────────┬───────────┬──────────┬───────────┐
│ │ vLLM BF16 │ vLLM FP8 │ Ollama Q8 │
├─────────┼───────────┼──────────┼───────────┤
│ Tok/sec │ 13-17 │ 11-19 │ 32 │
├─────────┼───────────┼──────────┼───────────┤
│ Memory │ ~62GB │ ~28GB │ ~32GB │
└─────────┴───────────┴──────────┴───────────┘
Most importantly, it actually worked nice in opencode, which I couldn't get Nemotron to do.No comments yet.