StreamingLLM: tiny tweak to KV LRU improves long conversations (opens in new tab)

(news.mit.edu)

91 pointslucasluitjes2y ago8 comments

8 comments

8 comments · 4 top-level

TrueDuality2y ago· 2 in thread

There was a really interesting post a while ago about adjusting the softmax function to allow attention heads to not make a choice (https://www.evanmiller.org/attention-is-off-by-one.html). It seems like that might remove the need for these attention sinks entirely. I keep meaning to go in and perform tests on this but boy time gets away from you...

zorgmonkey2y ago

Feel free to mess with it, his tweak to softmax was actually supported by pytorch before the article was written, but off by default. Maybe it needs to be more widely used though, after all good ideas are often independently discovered multiple times. Details are in this tweet https://twitter.com/SamuelMullr/status/1683582347793530884 or if you don't like twitter the option is add_zero_attn for pytorch MultiheadAttention.

magicalhippo2y ago

Interesting! HN discussion of it here: https://news.ycombinator.com/item?id=36851494

gremlinsinc2y ago· 2 in thread

I wonder if it could make sense to maybe have break away bots, where at 10k tokens a new one launches with the first 2k, and the last 1k and a table of contents such that when you go back to something you're handed off to a model where that data is stronger reinforced or something like that. Sort of like mixture of experts but they're only an expert about individual snippets of a long conversational thread.

kgeist2y ago

Here they simply used different models for different turns and apparently it gave more "engaging" results:

https://arxiv.org/abs/2401.02994

joshspankit2y ago

You’re right: A lot of the conversation can be condensed, especially if there are enough cues for the AI to arrive in the same “neuronal neighborhood” as the previous conversation.

popinman3222y ago

Previous discussion, on a link to the implementation: https://news.ycombinator.com/item?id=37740932

Translationaut2y ago

This seems only to work cause large GPTs have redundant, undercomplex attentions. See this issue in BertViz about attention in Llama: https://github.com/jessevig/bertviz/issues/128

j / k navigate · click thread line to collapse