But there was nothing new about architecture, essentially it is the choice of multidimensional function one tries to fit. In physics we have been fitting functions for hundreds of years. If you look at some experimental plot of say interference then you might decide to fit a sinusoid to it plus a background constant etc... the importance of fitting the right kind of function is obviously important, but we didn't know about algorithmic differentiation for hundreds of years (and it surely would have been welcome back then, even if performed by hand, it beats trial and error gradients).
That RM automatic differentiation is simple is easy to say in hindsight!
I don't think a richer diversity of functions is a bad idea, but it's already being used, softmax, exponents, sums, squares, ... why not perform gradient descent over a differentiable family of function that encompass these?
It's really disingenious to pretend RM AD was so very simple and then watch approvingly how someone throws it out the window and reverts to ... genetic programming? You want to let the computer find the best functions? fine, but then give the computer a superfunction which for certain values of an extra parameter differentiably reaches the functions you want to be considered.
Most of the architectures ... end up looking suspiciously much like plain old statistical physics! It's like we repeatedly witness how yet another introductory statistical physics expression turns out to perform well on very general sets of tasks (it really comes across as if everything should be treated like a dumb mole of water, and we never tried before because we simply refused to believe it could be that simple).