Efficient streaming language models with attention sinks (opens in new tab)

(github.com)

421 pointsguywithabowtie2y ago65 comments

65 comments

58 comments · 23 top-level

doctoboggan2y ago· 9 in thread

How do any of these sliding window techniques handle instructions that are non expected and only show up at the end? For example imagine feeding a book to the model and the last sentence being the instruction “return the count of the letter m in the previous input”. A human would handle this by first letting out an exasperated sigh but then restarting the reading while counting. An LLM has no ability to loop back and re-read the input. (Ignore LLM issues with character counting for this example). It seems like to solve this problem for real the LLM needs to be able to loop and jump arbitrarily, but I’m sure that would introduce a whole new host of issues and possibly require a new architecture all together.

namibj2y ago

On a similar note, I can't wait for LLMs to digest _all_ the research papers readable enough for them and accessible, "take notes" in an index-suitable format/structure, and then act similar to a human who'd done that over an obviously more limited corpus: respond to questions by translating them into relevant key words, looking them up, _skimming the contents again,_ and finding relevant information. Might not be useful, and thus necessitate further visits to the index/library.

With the needed preprocessing, a LLM that can "go and do some research to adequately respond" could be extremely powerful.

We've spent the last ~10 millennia improving knowledge management technology to scale beyond the capacity/time of individual brains. Let the language model use actual research on this and pre-digest, not just Bing search. No need for it's short term memory to remember what say piece of code did something, just tag it when reading and rely on scalable shared indexing of tags.

Though the more I think about it, the more it sounds like normal LLM pretraining with the knowledge index being the giant chunk of LLM weights.

IanCal2y ago

One option would be similar to function calling, give the llm an output it can make that changes how the context is parsed. That's a layer on top rather than changing how the llm itself works.

omneity2y ago

Does an LLM need to loop back to re-read its input, even in a regular (read non-sliding) context window?

Maybe I'm misunderstanding, but doesn't the hidden state solve the "lookup" problem in this case? In the sense that the LLM needs to ingest your entire input anyway before answering, then whether your instruction is at the front or at the end carries little impact besides on attention.

doctoboggan2y ago

It's my understanding that in regular non-sliding window context models the llm is able to pay attention to any part of the input when generating the output. The attention head is essentially able to jump back and forward to any point in its context window. This is what differentiates the attention mechanism from other models that use token proximity as a proxy for relevance.

tornato72y ago

Is it so hard to ask the user to put instructions at the beginning? Claude 100K asks users to put instructions at the end.

Or you just use a quick model to check if there area instructions at the end and bring it to the beginning.

Tostino2y ago

The fact that people are still treating it like entirely raw text input is insane to me. If you have a document, have a separate input for the user to paste/upload data, and then another for the user's instruction.

That allows you to do things like chunk the document while leaving the rest of their instruction alone, or do a sliding window of just the document while your instruction stays static.

alex_duf2y ago

The example seems like a weird edge case. I don't even know if current models are capable of this in a short input.

doctoboggan2y ago

Ignore the specific example of counting characters, I was just quickly coming up with a situation where the instruction is at the end of the input. Here is a better example:

Input the full text of a novel, then ask for a minor detail (eg color of a car that is briefly mentioned in the middle of the book). Again a human can do this by flipping back to the relevant section but LLMs have no mechanism for this when using a sliding window attention scheme.

If the full input can fit in the context window then any LLM today would be able to extract the color of the car.

refulgentis2y ago

I agree, even just tokenization screws you here, I'm 95% sure. I.e. the raw input isn't letters but one of 100K integers that represent some set of letters.

That being said, probably a naive take, since we're seeing them do so much. & I bet we could get it to count correctly with at least some short input, and given infinite runs, probably trivial. (I.e. for N characters, split into N inputs, for each one "say true if it is an M, false otherwise,)

1 more reply

iandanforth2y ago· 5 in thread

My somewhat facetious take is that LLMs are trying really hard to reinvent RNNs and would do so if we just gave them the tools to do so.

obblekk2y ago

RNNs are the correct solution, but infeasibly expensive to run.

A different way to think about it is Transformer models are trying to predict which part of the RNN network is "worth" keeping given a resource constraint.

Transformers use a simple heuristic today (and this result makes the heuristic better). Just like many NP complete problems, there might be approximations that are not perfectly correct but still useful. Transformers prove that is the case for neural networks.

tkellogg2y ago

One such project is RWKV[1]. On the open source leaderboard it lived in the middle of the board for a while, so it really is a legit approach, it's just not hot.

[1]: https://huggingface.co/blog/rwkv

swyx2y ago

side note - do you think the open source leaderboard is a fair representation of the diversity of OSS models?

1 more reply

anon2912y ago

I think many people believe you. The main advantage of transformers over RNNs is training parallelization. RNNs are hard because training suffers from vanishing gradients and also because it's hard to get full utilization (needs large batches to get good utilization).

The existence of models like RWKV indicates that there is potentially a future in training like a transformer but inferring like an RNN.

Nevermark2y ago

Yes, indeedy.

Many things learned over the last three decades with smaller (the current terminology is "extremely tiny"! :) neural networks are being revisited for these large models.

cs7022y ago· 3 in thread

On a first quick pass, this looks so good that I'm wondering if it's too good to be true!

But the work looks to be of decent quality and the technique is remarkably straightforward:

The idea is to apply attention over the first token and a sliding context window, ignoring everything in-between, in each layer.

By implication, each layer must be gradually shifting relevant information forward in the sequence, enabling the top layer's ending sliding attention window to see it.

The only caveat I can think of is that the sliding windows won't be able to shift all important information forward when the span of all sliding windows isn't sufficient to span the entire sequence -- for example, when model depth × window length < sequence length, if all windows have the same length.

3abiton2y ago

Can't wait for the github repo adaptation of the method!

Nevermark2y ago

The end of the sequence could be padded with constant "neutral" values?

cs7022y ago

Wouldn't work. Imagine a sequence with 100 tokens, fed to a model with 10 layers, each with a sliding attention window spanning 5 tokens. The top layer's final sliding window can only see 5 trailing tokens, each of which can only see 5 trailing tokens in the previous layer, and so on, for a total of 50 trailing tokens (plus the initial token) of maximum trailing context in the top layer.

It's an inherent limitation of this approach.

1 more reply

guywithabowtieOP2y ago· 3 in thread

We introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.

stavros2y ago

Sorry, what does "up to 4 million tokens and more" mean? It seems like a contradiction.

catskul22y ago

Not really a contradiction so much as redundant/poorly worded. Should have said, "at least 4 million tokens".

jamesblonde2y ago

Here's a reference describing what a context window for LLMs is:

https://www.hopsworks.ai/dictionary/context-window-for-llms

bluecoconut2y ago· 2 in thread

I think people are misreading this work, and assuming this is equivalent to full dense-attention. This is just saying its an efficiency gain over sliding window re-computation, where instead of computing the L^2 cost over and over (T times), you can re-use a cache and maintain perplexity. I don't think they are claiming that this allows for attending to content that was far away.

They tested by running concatenating and measuring -> `Q A Q A Q A Q A...` not by doing `Q Q Q Q A A A A...`

They also measure perplexity, showing that it produces "readable text" (coherent, locally viable); not that it is "extracting anything" from the big-triangle-gap of no-attention.

I think this would fail to be given a book, then write the first word of every paragraph. Or, given a book, write a 1 sentence summary of each chapter. I might be wrong, because they didn't test tasks like this, but I'd be very very surprised.

bluecoconut2y ago

EDIT: the authors have updated the readme to add a clarified FAQ section that directly addresses this: https://github.com/mit-han-lab/streaming-llm#faq

Just tested it - this definitely doesn't seem to be giving enhanced context length. It does run quickly though, can confirm it was using about 35 GB of an A100 RAM and pinned the usage for the entire duration.

I ran through by getting a book from project gutenberg, splitting it into paragraphs, and feeding them in paragraph by paragraph (asking it to say "okay" each paragraph), then at the end, asked some questions. It entirely hallucinated its answers. (also note: in the ~10 min of playing with this, i couldn't get the base model (lmsys/vicuna-13b-v1.3) to respond in english...)

https://gist.github.com/bluecoconut/9cae9e91fe3b1616ed650a96...

fpgaminer2y ago

Correct, but to be fair to readers (like me) the use of the term "infinite-length inputs" is misleading.

Still, really interesting work. The most salient bit is the discovery shown in Figure 2, summarized as:

> (1) The attention maps in the first two layers (layers 0 and 1) exhibit the "local" pattern, with recent tokens receiving more attention. (2) Beyond the bottom two layers, the model heavily attends to the initial token across all layers and heads.

> surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task, as visualized in Figure 2. We term these tokens “attention sinks". Despite their lack of semantic significance, they collect significant attention scores. We attribute the reason to the Softmax operation, which requires attention scores to sum up to one for all contextual tokens. Thus, even when the current query does not have a strong match in many previous tokens, the model still needs to allocate these unneeded attention values somewhere so it sums up to one. The reason behind initial tokens as sink tokens is intuitive: initial tokens are visible to almost all subsequent tokens because of the autoregressive language modeling nature, making them more readily trained to serve as attention sinks.

StreamingLLM is basically a "hack" that fixes this odd behavior when we go around butchering the LLM's attention window.

This actually isn't the first time cracks have been shown in the usage of softmax and it makes me wonder if a different function might be better if we want context-length flexible LLMs.

huevosabio2y ago· 2 in thread

This seems to be largely enabled by the observation that Softmax has to add up to one. From quick a glance [1], the model tends to use the first token as a placeholder for cases when you don't need to attend any of the prior tokens.

The first time I read about this issue, that Softmax is somewhat flawed, was in a HN post by Evan Miller [2] where he observes that forcing attention heads to allocate all attention to prior tokens is wrong, and we should allow them to "not attend" by adding one to the softmax denominator.

I love that they found a way to capitalize on this observation without having to retrain models. However, I wonder how the models would look like if they followed Evan's suggestion!

[1] Their description of attention sinks:

```

To understand the failure of window attention, we find an interesting phenomenon of autoregressive LLMs: a surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task, as visualized in Figure 2. We term these tokens “attention sinks". Despite their lack of semantic significance, they collect significant attention scores. We attribute the reason to the Softmax operation, which requires attention scores to sum up to one for all contextual tokens. Thus, even when the current query does not have a strong match in many previous tokens, the model still needs to allocate these unneeded attention values somewhere so it sums up to one. The reason behind initial tokens as sink tokens is intuitive: initial tokens are visible to almost all subsequent tokens because of the autoregressive language modeling nature, making them more readily trained to serve as attention sinks.

```

[2] https://news.ycombinator.com/item?id=36851494

huevosabio2y ago

Actually, seems like they did try the suggestion out, basically by training a model with a dedicated sink token with all zeros.

The verdict seems to be that you still end up with other initial tokens being used as sinks, so it is better to have a dedicated sink token.

fpgaminer2y ago

That was the first time I'd read about it on HN, but as pointed out on that HN post it wasn't the first time Softmax + 1 was proposed. And, AFAIK, it has never resulted in better performance in practice. Maybe Softmax + 1 works better for fiddling with the attention window after training, but I don't know if anyone has tested that at scale.

Van_Chopiszt2y ago· 2 in thread

The authors just uploaded a FAQ section, which may clarify some of the confusions: https://github.com/mit-han-lab/streaming-llm/blob/main/READM...

bluecoconut2y ago

Nice update. I think the key question they added that clarifies a lot is #3 (quoted below)

    Can I input an extensive text, like a book, into StreamingLLM for summarization?

    While you can input a lengthy text, the model will only recognize the latest tokens. Thus, if a book is an input, StreamingLLM might only summarize the concluding paragraphs, which might not be very insightful. As emphasized earlier, we neither expand the LLMs' context window nor enhance their long-term memory. StreamingLLM's strength lies in generating fluent text from recent tokens without needing a cache refresh.

antupis2y ago

So instead of chunks of tokens, we can input stream of tokens and then some point say "LLM take a wheel". So it is very nice but not revolutionary.

WhatsName2y ago· 2 in thread

So I can let llama2 summarize books now or are there any non-obvious caveats to this approach?

Tostino2y ago

If you want to do that, I have a model trained specifically on a dataset of building recursive summaries. Some of my training documents are 40-50k tokens.

https://huggingface.co/Tostino/Inkbot-13B-8k-0.2

Just chunk your document up, and pass in the prior summary along with this chunk of text, you can mention that it is chunk X of Y if you want (which can help with how it starts the summary often).

Sharlin2y ago

No. This does nothing to the context length itself which is still a sliding window.

ilovefood2y ago· 2 in thread

This is working relatively well, the code is really worth a read. If you run it locally, consider the open PR and install sentencepiece as well. It's been generating text for the past 10 minutes now :D

Some of the instructions are ignored though so I'd be careful there, one instruction is to rewrite the previous response by "starting every sentence with the letter A" which is a bit of a hit or miss right now.

guywithabowtieOP2y ago

How is the content quality ?

ilovefood2y ago

It's okay I have to say. I just ran out of memory on my 4090, so I had to retry on an A100. Here's an extract: https://pastebin.com/pzLfCFWt

I think something might be off with the example. Can't wait for this stuff to work on llama.cpp. Going to try it with mistral & stable lm now, thankfully tomorrow is a holiday in Germany :)

smeeth2y ago· 1 in thread

Adding attention cache memory is an extremely interesting solution to this problem.

If anyone is curious, there was another paper [0] that came out a few days ago that made a related observation in Vision Transformers. Transformer models appear to pick tokens to store global information in - they need tokens to "think". You can eek some performance improvements (and cool explanation images) by providing the model with specific tokens for this purpose.

[0] https://arxiv.org/pdf/2309.16588.pdf

Nevermark2y ago

It would be an interesting place to add additional units to an already trained model, to continue training and get better performance, or to fine tuning.

For tuning, keep the original model parameters fixed, and only let the model adjust parameters to and from new "tuning" cache units.

This would allow different tuning unit sets to be swapped in, or even used together. Foul language avoidance units + specific terminology units + be concise units, etc.

Mix and match tuned unit sets, like super prompts.

If the number of new parameters is low enough, higher order optimization (requiring higher memory) might be a possibility for very fast and effective tuning.

And maybe grow the sequence length, and number of units, during training. A few units for short sequences. Then increase training sequence length, add more units, continue training, and so on.

Perhaps some kind of performance or gradient analysis could govern cache expansion, so an arbitrary schedule is not required.

Filligree2y ago· 1 in thread

Okay, what's the downside this time?

a_wild_dandan2y ago

Allegedly not “efficiency or performance”, though I’m skeptical. Will dig into this later and update my comment (if I remember).

13years2y ago· 1 in thread

So can it now understand and write complete applications?

Jeff_Brown2y ago

It seems hard to imagine, if its training has been on small chunks of text, that the model has a way of understanding a large codebase.

But this stuff keeps on surprising me.

choeger2y ago· 1 in thread

Did anyone ever attempt a recursive architecture?

So you take the first window or logical separatation (chapter, paragraph) and let the model summarize it into one or two sentences. Then you repeat that with the next window (and that derived sentence as context) and create a new logical separatation out out of a fixed number of sentences. Rinse and repeat until the result fits into your window.

I have a hunch that this is somewhat how the brain works when reading.

brrrrrm2y ago

Hierarchical attention gets you close https://arxiv.org/abs/2210.05529

kridsdale32y ago· 1 in thread

What if my "Favorite LLM" is GPT4? I don't want to use Llama or anything like that. Does this GitHub code let me use the OpenAI API and run the new memory technique on top of that?

doctoboggan2y ago

No, it does not

foota2y ago

I could be wrong, but I'm not sure this is about what people seem to think it is, e.g., letting LLMs reference content past the trained length

I think it may just be about the performance of the model with longer texts (on the things still within the context window?). It sounds like they're arguing that the model is essentially learning to stick some baggage in the attention to the initial tokens of the text, and break when that isn't within the window anymore for reasons I'm not sure I understand (after all, isn't text in the middle just as good as text at the start for non instruction inputs?)

__rito__2y ago

Relevant: the eponymous Professor Han at MIT is teaching a TinyML course that is open to the public. See:

- https://news.ycombinator.com/item?id=37620507

- https://efficientml.ai

refulgentis2y ago

This looks fantastic. Also answers the relevancy of the "off-by-one" softmax*

My naive question is...does it work? But that sounds dismissive. At length:

It shows that the model can't respond after a certain length versus a proposed model that does continue to respond.

But can a model that continues to respond retrieve information far "in the past"?

The demo video is too low-level, at least to my brain. It shows one model stops responding but the proposed one continues.

I spent about 5 minutes going frame by frame to see if the proposed model attempts to have to "recall" information from further back, but it looks like no.

Perfection here isn't necessary or even possible AFAIK, i.e. I don't expect it to recall page 1 100% accurately at page 1000. But can it recall _anything_ from it, even if it ignores it?

The great thing about this era and work is we can check. But I hope someone has it up in a HuggingFace space before I figure out how to run it myself. :P

I'm leaning no, based on the sliding window thing. It sounds like there's 4 fixed tokens, then the last context size - 4 tokens, that's it

* at the time, two camps: one, it's some random person saying it and there's prior art on implementations that do the off-by-one. Two, you'd be surprised how much little things go unnoticed by large groups, and do matter.

dheera2y ago

I feel like information theory prevents full information retention for unlimited context lengths and finite compute, but I don't know if we are at information theory limits to invoke this argument. Or rather, I don't know how to make a good analysis of (bits of context information) per (bits of model parameters).

torginus2y ago

Is it just me, or does every approach basically boil down to not wanting to pay the full quadratic cost over the context (usually by selecting which tokens to pay attention to, or using some computationally cheaper substitute for each token).

I feel like all these approaches kind of equivalent to a fully dense attention matrix over a smaller context, but carefully curating what goes into the context, also known to us humans as summarizing each bit of text, or (perhaps less efficiently) going through a textbook with a highlighter.

My intuition is that the winning approach will be a small (ish), lets say 8k context, with efficient an summarization and dynamic information retrieval scheme.

idiotsecant2y ago

This is a big claim, curious to see what the caveats are.

Trapais2y ago

Looks like longformer to me. They just renamed "global attention" into "attention sink" and removed silly parts(distilled attention) and BERT parts([CLS] saw all N tokens, there is no need for BOS to see all tokens)

regularfry2y ago

Anyone got a gut feel as to whether you could use this to transform Whisper into a better streaming model? It's a bit of a hack using it that way at the moment.

heavyarms2y ago

Having only read the abstract, I'm probably way off the mark here, but my first thought was: LLM + LSTM.

1 more reply

j / k navigate · click thread line to collapse

65 comments

58 comments · 23 top-level

doctoboggan2y ago· 9 in thread

namibj2y ago

With the needed preprocessing, a LLM that can "go and do some research to adequately respond" could be extremely powerful.

Though the more I think about it, the more it sounds like normal LLM pretraining with the knowledge index being the giant chunk of LLM weights.

IanCal2y ago

One option would be similar to function calling, give the llm an output it can make that changes how the context is parsed. That's a layer on top rather than changing how the llm itself works.

omneity2y ago

Does an LLM need to loop back to re-read its input, even in a regular (read non-sliding) context window?

doctoboggan2y ago

tornato72y ago

Is it so hard to ask the user to put instructions at the beginning? Claude 100K asks users to put instructions at the end.

Or you just use a quick model to check if there area instructions at the end and bring it to the beginning.

Tostino2y ago

That allows you to do things like chunk the document while leaving the rest of their instruction alone, or do a sliding window of just the document while your instruction stays static.

alex_duf2y ago

The example seems like a weird edge case. I don't even know if current models are capable of this in a short input.

doctoboggan2y ago

Ignore the specific example of counting characters, I was just quickly coming up with a situation where the instruction is at the end of the input. Here is a better example:

If the full input can fit in the context window then any LLM today would be able to extract the color of the car.

refulgentis2y ago

I agree, even just tokenization screws you here, I'm 95% sure. I.e. the raw input isn't letters but one of 100K integers that represent some set of letters.

1 more reply

iandanforth2y ago· 5 in thread

My somewhat facetious take is that LLMs are trying really hard to reinvent RNNs and would do so if we just gave them the tools to do so.

obblekk2y ago

RNNs are the correct solution, but infeasibly expensive to run.

A different way to think about it is Transformer models are trying to predict which part of the RNN network is "worth" keeping given a resource constraint.

tkellogg2y ago

One such project is RWKV[1]. On the open source leaderboard it lived in the middle of the board for a while, so it really is a legit approach, it's just not hot.

[1]: https://huggingface.co/blog/rwkv

swyx2y ago

side note - do you think the open source leaderboard is a fair representation of the diversity of OSS models?

1 more reply

anon2912y ago

The existence of models like RWKV indicates that there is potentially a future in training like a transformer but inferring like an RNN.

Nevermark2y ago

Yes, indeedy.

Many things learned over the last three decades with smaller (the current terminology is "extremely tiny"! :) neural networks are being revisited for these large models.

cs7022y ago· 3 in thread

On a first quick pass, this looks so good that I'm wondering if it's too good to be true!

But the work looks to be of decent quality and the technique is remarkably straightforward:

The idea is to apply attention over the first token and a sliding context window, ignoring everything in-between, in each layer.

By implication, each layer must be gradually shifting relevant information forward in the sequence, enabling the top layer's ending sliding attention window to see it.

3abiton2y ago

Can't wait for the github repo adaptation of the method!

Nevermark2y ago

The end of the sequence could be padded with constant "neutral" values?

cs7022y ago

It's an inherent limitation of this approach.

1 more reply

guywithabowtieOP2y ago· 3 in thread

stavros2y ago

Sorry, what does "up to 4 million tokens and more" mean? It seems like a contradiction.

catskul22y ago

Not really a contradiction so much as redundant/poorly worded. Should have said, "at least 4 million tokens".

jamesblonde2y ago

Here's a reference describing what a context window for LLMs is:

https://www.hopsworks.ai/dictionary/context-window-for-llms

bluecoconut2y ago· 2 in thread

They tested by running concatenating and measuring -> `Q A Q A Q A Q A...` not by doing `Q Q Q Q A A A A...`

They also measure perplexity, showing that it produces "readable text" (coherent, locally viable); not that it is "extracting anything" from the big-triangle-gap of no-attention.

bluecoconut2y ago

EDIT: the authors have updated the readme to add a clarified FAQ section that directly addresses this: https://github.com/mit-han-lab/streaming-llm#faq

https://gist.github.com/bluecoconut/9cae9e91fe3b1616ed650a96...

fpgaminer2y ago

Correct, but to be fair to readers (like me) the use of the term "infinite-length inputs" is misleading.

Still, really interesting work. The most salient bit is the discovery shown in Figure 2, summarized as:

StreamingLLM is basically a "hack" that fixes this odd behavior when we go around butchering the LLM's attention window.

This actually isn't the first time cracks have been shown in the usage of softmax and it makes me wonder if a different function might be better if we want context-length flexible LLMs.

huevosabio2y ago· 2 in thread

I love that they found a way to capitalize on this observation without having to retrain models. However, I wonder how the models would look like if they followed Evan's suggestion!

[1] Their description of attention sinks:

```

[2] https://news.ycombinator.com/item?id=36851494

huevosabio2y ago

Actually, seems like they did try the suggestion out, basically by training a model with a dedicated sink token with all zeros.

The verdict seems to be that you still end up with other initial tokens being used as sinks, so it is better to have a dedicated sink token.

fpgaminer2y ago

Van_Chopiszt2y ago· 2 in thread

The authors just uploaded a FAQ section, which may clarify some of the confusions: https://github.com/mit-han-lab/streaming-llm/blob/main/READM...

bluecoconut2y ago

Nice update. I think the key question they added that clarifies a lot is #3 (quoted below)

    Can I input an extensive text, like a book, into StreamingLLM for summarization?

    While you can input a lengthy text, the model will only recognize the latest tokens. Thus, if a book is an input, StreamingLLM might only summarize the concluding paragraphs, which might not be very insightful. As emphasized earlier, we neither expand the LLMs' context window nor enhance their long-term memory. StreamingLLM's strength lies in generating fluent text from recent tokens without needing a cache refresh.