undefined | Better HN

0 pointsjszymborski1y ago0 comments

> They were solved by LSTMs first proposed in 1997.

I see this stuff everywhere online and it's often taught this way so I don't blame folks for repeating it, but I think it's likely promulgated by folks who don't train LSTMs with long contexts.

LSTMs do add something like a "skip-connection" (before that term was a thing) which helps deal with the catastrophic vanishing gradients you get from e.g. Jordan RNNs right from the jump.

However (!), while this stops us from seeing vanishing gradients after e.g. 10s or 100s of time-steps, when you start seeing multiple 1000s of tokens, the wheels start falling off. I saw this in my own research, training on amino acid sequences of 3,000 length led to a huge amount of instability. It was only after tokenizing the amino acid sequences (which was uncommon at the time) which got us down to ~1500 timesteps on average, did we start seeing stable losses at training. Check-out the ablation at [0].

You can think of ResNets by analogy. ResNets didn't "solve" vanishing gradients, there's a practical limit of the depth of networks, but it did go a long way towards dealing with it.

EDIT: I wanted to add, while I was trying to troubleshoot this for myself, it was super hard to find evidence online of why I was seeing instability. Everything pertaining to "vanishing gradients" and LSTMs were blog posts and pre-prints which just merrily repeated "LSTMs solve the problem of vanishing gradients". That made it hard for me, a junior PhD at the time, to suss out the fact that LSTMs do demonstrably and reliably suffer from vanishing gradients at longer contexts.

[0] https://academic.oup.com/bioinformatics/article/38/16/3958/6...

0 comments

YeGoblynQueenne1y ago

>> I see this stuff everywhere online and it's often taught this way so I don't blame folks for repeating it, but I think it's likely promulgated by folks who don't train LSTMs with long contexts.

To clarify, this wasn't taught to me. I studied LSTMs during my MSc in 2014, by my own initiative, because they were popular at the time [1]. I remember there being a hefty amount of literature on LSTMs, and I mean scholarly articles, not just blog posts. Rather at the time I think there were only two blog posts, the ones by Andrey Karpathy and Chris Olah that I link above. The motivation with respect to vanishing gradients is well documented in previous wok by Hochreiter (I think it's his thesis), and maybe a little less so in the 1997 paper that introduces the "constant error carousel".

What kind of "instability" did you see? Vanishing gradients weren't something I noticed in my experiments. If that was because I didn't use a long enough context, as you say, I wouldn't be able to tell but there was a different kind of instability: loss would enter an oscillatory pattern which I put down to the usual behaviour of gradient descent (either it gets stuck on local minima, or in saddle points). Is that what you mean?

_______________

[1] More precisely, our tutor asked us to study an RNN architecture expecting we'd look at something relatively simple like an Elman network but I wanted to try out the hot new stuff. The code and report is here:

https://github.com/stassa/lstm_rnn

There may be errors in the code and I don't know if you'll be able to run it, in case you got really curious. I don't think I really grokked automatic differentiation at the time.

jph001y ago

Highway networks add a skip connection, but LSTMs don't. Btw you might be interested in truncated backprop thru time, which we introduced in our ULMFiT paper.

jszymborskiOP1y ago

I was referring to how the context vectors help avoid vanishing gradients by behaving very similarly to skip-connections, but yes, they aren't skip-connections as-such. That's been my understanding, at least.

We haven't tried truncated BPTT, but we certainly should.

Funnily enough, we adopted AWD-LSTMs, Ranger21, and Mish in the paper I linked after I heard about them through the fast.ai community (we also trialled QRNNs for a bit too). fast.ai has been hugely influential in my work.

j / k navigate · click thread line to collapse