LLM inference engine from scratch in C++ – why output tokens cost 5x (opens in new tab)

(anirudhsathiya.com)

9 pointsani171mo ago3 comments

3 comments

ani17OP1mo ago

Author here. A bit more context: By day I'm a systems engineer building AI networking infrastructure. So I kept ending up in conversations where I'm not exactly able to wrap my head on the latest inference magic trick.

Like when someone mentioned vLLM's paged attention, I knew virtual memory paging, but had no idea someone had applied the same idea to KV cache allocation on GPUs.

Github link to the project: https://github.com/Anirudh171202/WhiteLotus

ani17OP1mo ago

The blog walks through why your first token is always the slowest, why output tokens cost 5x more, and how stuff like speculative decoding and chunked prefill actually work, from the perspective of a systems engineer!

brownianmotion11mo ago

> float bodyWeight = 67.5f; // who needs 32 bits to store a weight??

UHHHH...

j / k navigate · click thread line to collapse

3 comments

ani17OP1mo ago

Like when someone mentioned vLLM's paged attention, I knew virtual memory paging, but had no idea someone had applied the same idea to KV cache allocation on GPUs.

Github link to the project: https://github.com/Anirudh171202/WhiteLotus

ani17OP1mo ago

brownianmotion11mo ago

> float bodyWeight = 67.5f; // who needs 32 bits to store a weight??

UHHHH...

j / k navigate · click thread line to collapse