1. I'm not clear on the point of this paper.
There are a lot of buzzwords and an extremely diverse set of references. The heart of the paper seems to be a comparison between Long-Short-Term-Memory (LSTM) recurrent nets and their NTM nets. But they don't expose the network to very long sequences, or sequences broken by arbitrarily long delays which are what LSTM nets are particularly good at. They seem to make the jump from "LSTM nets are theoretically turing complete" to "LSTM nets are a good benchmark for any computational task."
2. The number of training examples seems huge
For many of the tasks they trained over hundreds of thousands of sequences. This seems like very very slow learning. If I'm meant to interpret these results as a network learning a computational rule (copying, sorting etc) is it really that impressive if it takes 200k examples before it gets it right? (Not sarcasm, I really don't know.)
Re: number of training examples, I'm taking the chart on pg 11 to mean the number of training examples shown. Based on that, it looks like the NTM is learning a lot faster than the LSTM. As far as I can tell, it's getting near 0 loss about 20,000 examples in? It depends on the domain for whether learning w/ 20k examples is impressive or not, personally I think it's comparatively impressive.
Re: cherry picking of tasks to highlight perceived strengths of NTM, fair enough. Although this is one I'll be playing around with a bit to find out where that starts and stops...
Any thoughts on how this compares to the approach of HTMs?
- Re: "buzzwords...references": I don't see any buzzwords, in fact the word "deep" doesn't even appear in the text. Regarding references, A typical conference paper references cites a bunch of related papers written by people who might be reviewing it. This paper, on the other hand, cites some seminal work from other fields, which is more interesting and enriching for most readers.
- Re: point of the paper. How to design a learning computer that can access a long-term memory storage of large capacity, which can be optimized by gradient descent. (I.e., everything is differentiable.)
- Re: number of training examples is huge. Training neural networks often takes a huge number of iterations, and the problems considered in the paper are numerically challenging so the iteration count is not surprising. Also, just like the regular Turing machine, the "neural Turing machine" isn't the most efficient architecture, but it's conceptually the simplest one that has the desired properties.
I haven't read this paper fully yet, but it seems to be an attempt at simplifying RNNs by replacing some of the magic internal state, which tends to make them hard to reason about, with a more direct memory architecture.
Also, it's somewhat hard to train the network that it can remember much information. (Not sure if much was done to measure this, but from my gut feeling, I would say, a RNN layer with about 500 neurons can be trained with the standard methods to maybe store 10 bits of information effectively.) The problem is that it can become easily very unstable.
The LSTM cell is already somewhat better in this regard. But of course, this is still finite memory then, and you cannot have much more than 500-1000 LSTM cells in one layer, because training becomes too computational expensive then. (You could introduce some bottleneck projection layer as Google did recently, and then get it to maybe 2000 LSTM cells.) Maybe count one LSTM cell as one bit of information (again, this is very much taken out of air).
This is much less memory than what your PC has (which of course also has finite memory, but it's so huge that you can count it as infinite and as powerful as a Turing machine).