Author here. A bit more context: By day I'm a systems engineer building AI networking infrastructure. So I kept ending up in conversations where I'm not exactly able to wrap my head on the latest inference magic trick.
Like when someone mentioned vLLM's paged attention, I knew virtual memory paging, but had no idea someone had applied the same idea to KV cache allocation on GPUs.
Github link to the project: https://github.com/Anirudh171202/WhiteLotus