undefined | Better HN

0 pointsvisarga2y ago0 comments

Vanishing gradient was an issue for non-residual deep networks and vanilla RNNs. While the long context memory issues are along sequence dimension, not network depth.

The problem could be some kind of instability of attention as it scales above 10k tokens. A recent paper suggests attention mechanism needs a default value (a "sink"), and its absence produces instability.

https://arxiv.org/abs/2309.17453

Another paper says the middle part is lossy while the beginning and end are better attended.

0 comments

4 comments · 2 top-level

kristopolous2y ago· 2 in thread

That's a really recent paper. Do you actually keep up to date with everything? How do you find the time?

visargaOP2y ago

Just reading a couple papers every day, the most interesting ones, and following up on reddit and twitter to get notified what people are talking about. And I am directly interested in long-context LLMs for a work related task.

I have also been dabbling with neural nets (pre-transformer), especially LSTM which have a "residual" connection, the one I was mentioning. That makes gradients better behaved. Schmidhuber tech.

totoglazer2y ago

Not to denigrate the person you’re responding to, but to add some context: That paper got a decent amount of attention already. Probably one of the more notable in the literature over the last month. Plus compared to the past year everything is slow now.

sandkoan2y ago

For anyone who's curious, the paper in question, entitled, "Lost in the Middle: How Language Models Use Long Contexts" (https://arxiv.org/abs/2307.03172)

j / k navigate · click thread line to collapse