On a superficial level it seems like it:
1. Generalizes deep learning to an optimization function on decomposable input, and
2. Reduces the number of parameters required to learn the input by exploiting the structure of the input, thereby making learning more efficient.
Is that correct? Is it completely off? What am I missing? Is there any more meat to the article than this?
Could someone who has upvoted this (and ideally understands the topic well) provide a different explanation of the concept? It would be great if I could see a real world example (even a relatively trivial one) represented in both the traditional matrix computation form and the sexy new differentiable form.
When I mean work well, I mean a family of models which are generally easy to describe as well as be generally easy to train. Nearly all neural network models fit this bill as a they can be described with a loss function. A loss function being something which scores how accurately the current NN performs with a particular set of parameters. This loss function is then usually differentiable, and thanks to autodifferentiation tools, you can obtain the derivative of this function with respect to some parameters in this function. With this derivative, you can go ahead and run SGD to find the optimal parameters. We have now both empirical and theoretical evidence that SGD works fairly well for optimizing functions, particularly the high-dimensional ones neural networks describe.
So, the insight is we need a name to describe all these models which are differentiable and nice to train because of that. So now we can say what really ties CNNs, MLPs, RNNs, etc is not some biological metaphor, but that these are all expressible as some loss function we can find the minimum of using SGD.
Suppose you had a flexible model in whose parameter space you could search. A naive algorithmic approach would be to treat this as a brute force search problem. Maybe you can be slightly more clever algorithmically (better search algorithms, heuristics, etc)
But can you somehow be really clever and exploit the fact that the parameter space is continuous, with a smooth cost function? That's what gradient descent basically is. Lots of bells and whistles on top, to exploit extra problem structure.
Now, let's go back to our family of models. They just need to be transparent to gradient information -- then you could use simple ideas at the heart of adaptive control theory. Suddenly, the fact that autodifferentiation techniques can differentiate arbitrary computational algorithms (sibling comment explains how compositionality helps this) sounds mouth-watering! That's what makes a lot of people excited -- your model family just expanded to programs performing arbitrary computations on continuous parameters (including loops!) -- not just some silly function approximator like a natural network (exaggerating to make my point)
If you have a model and somecparameter space on it your trying to optimise then you can’t do much other than search the whole parameter space or approximate such a search intelligently.
However, if your model is differeniable with respect to its parameters then it means you’ve got gradient ... and if you’ve got gradient then you can cut out a lot of searching because you can use gradient ascent and other such techniques .
So basically: differentiable means can be differentiated which means has gradient which means search less.
The rest of the article basically says that’d be nice because there is a way to think about differentiability modularly and therefore there may be a route to Lego things together from across fields to make uber models which still optimise well but use specialist knowledge.
The parameter space stuff just means that less knowledge means more data and more parameters which is bad.
No. It has nothing to do with diffeomorphisms (which are necessarily between spaces of the same dimension), but with piecewise differentiable functions.
It sounds like you're just describing an evolution of functional programming; where each area of computation in the program is contextually generalizable, because its set of inputs and outputs is smooth and differentiable.
One followup question: are you sure the maps need to be (or are generally intended to be) isomorphic? That strikes me as very limiting. I follow your point about compositions of diffeomorphisms being diffeomorphisms themselves, but do you need that invertibility to make this paradigm work?
No “coding” required.
https://github.com/thunguyenphuoc/RenderNet https://papers.nips.cc/paper/8014-rendernet-a-deep-convoluti...
That's just bs. If you just read the ray-tracer article it clearly states that f(scene parameters) -> image is not differentiable -- simply put because of sharp edges of objects being rendered. Also the word `diffeomorphism` makes no sense at all in this context.
Yes. It has nothing to do with deep learning, and nothing to do with optimization (although it is useful for both). It is just a way to write your functions so that they are easy to differentiate.
I am familiar where Nueral Networks and Convolutional Networks have done well especially around image processing etc.
But I can’t imagine where having differential code would help unless it is just tying multiple neural networks together in a continuous chain of differentiation.
For most programming tasks, I can’t imagine how differentiation would be possible or beneficial.
Is there a possibility that one could start with a series of unit tests and partial results and through gradient descent actually arrive at additional passing test cases? Most of the time in my experience, passing additional test cases like this requires significantly more complex structures that would not be found via differentiation.
Many physics problems involve trying to find a function which minimises something -- energy, entropy, action. Or the state which makes the difference between two things zero. Sometimes adjusting many parameters slowly down the gradient is a good way to find these.
In bayesian statistics, the basic problem is to sample from a distribution, which you know only indirectly, by some kind of monte-carlo method. But the space to sample can be enormous. If I understand right, advanced ways of doing this exploit the gradients (of functions defining the distribution) to try to choose samples efficiently.
People have hacked tensorflow to do all sorts of things which its creators didn't intend. Or written tools specialised for another particular domain (like Stan). I guess the excitement is that instead of re-inventing the wheel in each domain, maybe this can be pushed down to become a language feature which everyone above uses.
Right now these things are done either: 1) Unstructured end-2-end learning, where a deep neural network has to discover the laws of physics unaided by structure. The models are massively over-parametrized, require a lot of data to train and vulnerable to adverse inputs. 2) As independent systems, optimized separately. If the first model is complex, the model that follows must usually reflect this complexity. A simpler global solution might exist, but cannot be found.
It is however super early days. Right now there are hints that going in this direction might be fruitful, but as far as I know, not many concrete wins or a lot of practice. That will take some years still.
The difficulties of course lie in implementing such a framework, verifying whether your program is truly differentiable, and so on. The author of this blog post has written a working prototype framework in Julia called Zygote.jl. In short, it achieves DP through metaprogramming, applying source code transformation techniques at compile time, and it has already been quite successful.
The short answer is that CAS systems work with a "programming language" that doesn't have these features and is therefore a bit too limited for the kinds of models we're interested in.
Sometimes, the symbolic derivatives will be faster to compute, but most often they will be slower than using dual numbers for example. It is just a different tool to solve the problem. Most often, it is easier to use: you do not need a computer algebra system, you just change the type of your numbers and call it a day.
If you're doing numerical work and don't need explicit derivative expressions, automatic differentiation offers a method that is both fast and convenient.
You will generate expressions, and evaluating those expressions leads to the execution of exactly the same elementary operations as with forward mode AD. But with the overhead of needing the Computer Algebra System, and storing the expressions. (Ok, in some cases the CAS is maybe able to simplify the expressions a bit.)
This paragraph from the paper was helpful for me:
Differentiable programming is of joint interest to the machine learning and programming language communities. As deep learning models becomes more and more sophisticated, researchers have noticed that building blocks into a large neural network model is similar to using functions, and that some powerful neural network patterns are analogous to higher-order functions in functional programming [Fong et al. 2017; Olah 2015]. This is also thanks to the development of modern deep learning frameworks which make defining neural networks “very much like a regular program” [Abadi et al. 2017; LeCun 2018].
You can achieve the same effect with classical AD by doing what PyTorch does in its default mode where the dynamic recursion and control flow is normal Python code and large tensor ops are asynchronously scheduled on the target device and an AD tape for that particular dynamic run is constructed each time. But the viability of PyTorch's approach is domain dependent: there are relatively few but large tensor ops, and while host-device synchronization is possible (for making the dynamic control flow dependent on intermediate device-computed results) it's also very costly. Although the first of these would be slightly less critical if the host language executed faster.
But, if you're not trying to achieve the kind of decoupling between host and device which PyTorch and similar frameworks want, then I don't see any particular reason you can't just use normal forward-mode or reverse-mode AD as an alternate scalar execution mode for existing code, which just needs a different interpreter/library, not a different intermediate language. For the purposes of differentiation, the branching and control flow operators don't exist, anyway. E.g. if you represent branching explicitly in your graph/tape as TensorFlow does, the subgradient of if(a, b, c) is just the subgradient of whichever of b or c is selected. The conditional variable a only serves as a selector; if you compute the subgradient with respect to the selector, the answer is always 0 (except for the knife-edge case where it's technically not defined but is treated as 0).
Also, here are two interesting examples of differentiation through physical systems for classification:
https://arxiv.org/pdf/1808.08412.pdf
https://innovate.ee.ucla.edu/wp-content/uploads/2018/07/2018...
The article is correct that "Differentiable Programming" seems to be a rebranding effort that I believe just helped automatic differentiation work from the machine learning world get published in Programming Languages journals. I wouldn't read too much into it.