The other part of the paper handling the sliding KV cache compares favourably with prefix caching, sure, but then we moved away prefix caching for serving since a while now, with paged attention (which should really have been called paged KV cache but oh well) offering a lot of interesting improvement in that area including supporting extremely well parallel decoding.
And I do not care enough to compare the streaming cache with the paged attention cache directly, first because it's work they should have done and not I, second because dropping token silently is something that confuses and frustrated users significantly enough that it puts me down from wanting to investigate further.
But it can still be useful. Imagine this use case, where you have a chat conversation between Assistant and User. Assume that the inputs to get the next assistant response are just the past conversation turns (cut off to fit context window).
So for turn 1 the input is:
User: (user turn 1)
For turn 2 the input is: User: (user turn 1)
Assistant: (assistant turn 1)
User: (user turn 2)
Etc.Now, what this allows you to do is reuse the attention computed from the previous turns (since the prefix is the same).
In practice, people often have a system prompt before the conversation history, which (as far a I can tell) makes this technique not applicable (the input prefix will change as soon as the conversation history is long enough that we need to start dropping the oldest turns, otherwise the system prompt would get ignored).
In such case, what you could do is to cache at least the system prompt. This is also possible with https://github.com/OpenNMT/CTranslate2/blob/2203ad5c8baf878a...
"Their method cleverly exploits the LLMs' tendency to use initial tokens as "attention sinks" to anchor the distribution of attention scores. By caching initial tokens alongside recent ones, StreamingLLM restored perplexity and achieved up to 22x faster decoding than prior techniques." [1]
"We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more." [2]
"we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment." [2]
"StreamingLLM achieves an impressive speedup, reaching up to 22.2× per token. Despite its reduced latency, StreamingLLM sustains a memory footprint consistent with the re-computation baseline." [2]
[1] https://notes.aimodels.fyi/llm-infinite-context-window-strea...
Would be interesting to see an application for this where you can have a more fluid conversation with the ability to interrupt each other mid sentence. I suppose this would require retraining or finetuning on transcribed natural vocal conversations between two people. It would probably also require a different structure than the current chat based methods.