What Is Differentiable Programming? (opens in new tab)

(fluxml.ai)

133 pointsone-more-minute7y ago41 comments

41 comments

36 comments · 8 top-level

throwawaymath7y ago· 12 in thread

I have to be honest, I don't think this is a good explanation. I don't know what differential programming is, but I'm fairly sure I have the mathematical background to understand it. But I didn't come away from this article with any confidence that I'm following along.

On a superficial level it seems like it:

1. Generalizes deep learning to an optimization function on decomposable input, and

2. Reduces the number of parameters required to learn the input by exploiting the structure of the input, thereby making learning more efficient.

Is that correct? Is it completely off? What am I missing? Is there any more meat to the article than this?

Could someone who has upvoted this (and ideally understands the topic well) provide a different explanation of the concept? It would be great if I could see a real world example (even a relatively trivial one) represented in both the traditional matrix computation form and the sexy new differentiable form.

cf7y ago

Let me try to give a high-level explanation. As we have had success with using neural networks for different tasks, people began to ask what are the properties that make neural networks work so well.

When I mean work well, I mean a family of models which are generally easy to describe as well as be generally easy to train. Nearly all neural network models fit this bill as a they can be described with a loss function. A loss function being something which scores how accurately the current NN performs with a particular set of parameters. This loss function is then usually differentiable, and thanks to autodifferentiation tools, you can obtain the derivative of this function with respect to some parameters in this function. With this derivative, you can go ahead and run SGD to find the optimal parameters. We have now both empirical and theoretical evidence that SGD works fairly well for optimizing functions, particularly the high-dimensional ones neural networks describe.

So, the insight is we need a name to describe all these models which are differentiable and nice to train because of that. So now we can say what really ties CNNs, MLPs, RNNs, etc is not some biological metaphor, but that these are all expressible as some loss function we can find the minimum of using SGD.

ssivark7y ago

Supervised learning is a classic inverse problem -- you know the desired outcome, but not how to get there. (think control systems)

Suppose you had a flexible model in whose parameter space you could search. A naive algorithmic approach would be to treat this as a brute force search problem. Maybe you can be slightly more clever algorithmically (better search algorithms, heuristics, etc)

But can you somehow be really clever and exploit the fact that the parameter space is continuous, with a smooth cost function? That's what gradient descent basically is. Lots of bells and whistles on top, to exploit extra problem structure.

Now, let's go back to our family of models. They just need to be transparent to gradient information -- then you could use simple ideas at the heart of adaptive control theory. Suddenly, the fact that autodifferentiation techniques can differentiate arbitrary computational algorithms (sibling comment explains how compositionality helps this) sounds mouth-watering! That's what makes a lot of people excited -- your model family just expanded to programs performing arbitrary computations on continuous parameters (including loops!) -- not just some silly function approximator like a natural network (exaggerating to make my point)

whatshisface7y ago

Last I checked, differential programming involved exploding matrix sizes at 2^n in the number of memory bytes that the differentiable program could access. Is the situation any better now? If not it seems kind of like training a function approximator map the present cycle's memory and register state to the i+1th state.

1 more reply

0db532a07y ago

How does auto-differentiation apply to loop indices? Is this really any more computationally powerful than techniques used to solve integer linear programmes?

usgroup7y ago

in that regard it’s mostly in the first few paragraphs but it’s subtley stated...

If you have a model and somecparameter space on it your trying to optimise then you can’t do much other than search the whole parameter space or approximate such a search intelligently.

However, if your model is differeniable with respect to its parameters then it means you’ve got gradient ... and if you’ve got gradient then you can cut out a lot of searching because you can use gradient ascent and other such techniques .

So basically: differentiable means can be differentiated which means has gradient which means search less.

The rest of the article basically says that’d be nice because there is a way to think about differentiability modularly and therefore there may be a route to Lego things together from across fields to make uber models which still optimise well but use specialist knowledge.

The parameter space stuff just means that less knowledge means more data and more parameters which is bad.

taktoa7y ago

The basic idea is to have an expressive programming language where all constructs are differentiable. Since the composition of diffeomorphisms is a diffeomorphism, large programs (like ray-tracers) will be differentiable as a result.

knappa7y ago

I don't think you mean to use a term as strong as diffeomorphism. A diffeomorphism is differentiable, sure, but it is also invertable with a differentiable inverse. They have a lot of properties that machine learning honestly does not want, like preservation of dimension.

enriquto7y ago

> The basic idea is to have an expressive programming language where all constructs are differentiable. Since the composition of diffeomorphisms is a diffeomorphism, (...)

No. It has nothing to do with diffeomorphisms (which are necessarily between spaces of the same dimension), but with piecewise differentiable functions.

1 more reply

throwawaymath7y ago

Wow, I wish you'd written the article. Thank you. I can only imagine what you'd be able to explain if you had as much space as the author...

It sounds like you're just describing an evolution of functional programming; where each area of computation in the program is contextually generalizable, because its set of inputs and outputs is smooth and differentiable.

One followup question: are you sure the maps need to be (or are generally intended to be) isomorphic? That strikes me as very limiting. I follow your point about compositions of diffeomorphisms being diffeomorphisms themselves, but do you need that invertibility to make this paradigm work?

1 more reply

sabalaba7y ago

For a concrete example, our recent paper RenderNet learns end to end differentiable ray tracing operations like ambient occlusion and normal maps which can be used in phong shading.

No “coding” required.

https://github.com/thunguyenphuoc/RenderNet https://papers.nips.cc/paper/8014-rendernet-a-deep-convoluti...

stabbles7y ago

> Since the composition of diffeomorphisms is a diffeomorphism, large programs (like ray-tracers) will be differentiable as a result.

That's just bs. If you just read the ray-tracer article it clearly states that f(scene parameters) -> image is not differentiable -- simply put because of sharp edges of objects being rendered. Also the word `diffeomorphism` makes no sense at all in this context.

enriquto7y ago

> Could someone (...) provide a different explanation of the concept?

Yes. It has nothing to do with deep learning, and nothing to do with optimization (although it is useful for both). It is just a way to write your functions so that they are easy to differentiate.

ricksharp7y ago· 5 in thread

Could someone list some practical examples where Differential Programming would be useful?

I am familiar where Nueral Networks and Convolutional Networks have done well especially around image processing etc.

But I can’t imagine where having differential code would help unless it is just tying multiple neural networks together in a continuous chain of differentiation.

For most programming tasks, I can’t imagine how differentiation would be possible or beneficial.

Is there a possibility that one could start with a series of unit tests and partial results and through gradient descent actually arrive at additional passing test cases? Most of the time in my experience, passing additional test cases like this requires significantly more complex structures that would not be found via differentiation.

improbable227y ago

There are examples quite far from neural networks, the ones I can think of are broadly optimisation problems:

Many physics problems involve trying to find a function which minimises something -- energy, entropy, action. Or the state which makes the difference between two things zero. Sometimes adjusting many parameters slowly down the gradient is a good way to find these.

In bayesian statistics, the basic problem is to sample from a distribution, which you know only indirectly, by some kind of monte-carlo method. But the space to sample can be enormous. If I understand right, advanced ways of doing this exploit the gradients (of functions defining the distribution) to try to choose samples efficiently.

People have hacked tensorflow to do all sorts of things which its creators didn't intend. Or written tools specialised for another particular domain (like Stan). I guess the excitement is that instead of re-inventing the wheel in each domain, maybe this can be pushed down to become a language feature which everyone above uses.

jononor7y ago

For instance in robotics, the inverse kinematics could be formulated using the well-known equations, and combined with a CNN for vision, maybe do reinforcement learning across the whole thing. Or in audio classification, could use standard filterbanks construction for creating a time-frequency (spectrogram) representation, then combined with a CNN for classification.

Right now these things are done either: 1) Unstructured end-2-end learning, where a deep neural network has to discover the laws of physics unaided by structure. The models are massively over-parametrized, require a lot of data to train and vulnerable to adverse inputs. 2) As independent systems, optimized separately. If the first model is complex, the model that follows must usually reflect this complexity. A simpler global solution might exist, but cannot be found.

It is however super early days. Right now there are hints that going in this direction might be fruitful, but as far as I know, not many concrete wins or a lot of practice. That will take some years still.

brewmarche7y ago

In quantitative finance automatic differentiation will give you sensitivities (which tell you how the value changes when interest rates and underlying prices and other factors change—very important for risk and hedging) for your products in a Monte Carlo simulation. And even better, you will get it for all times in the simulation.

jondeuce7y ago

Differential programming is a generalization of deep learning, so all the same practical examples from deep learning apply. The difference, however, is that in DP you can write your "models" simply as functions. This means than in an ideal DP framework, the code you write needs no special types and no special syntax. This would be particularly useful if you're using some other library's code which was not written with DP/DL in mind at all (think ODE solvers, optimization routines, etc.); that code can be added to your "model" (again, just a regular function), and it will just work!

The difficulties of course lie in implementing such a framework, verifying whether your program is truly differentiable, and so on. The author of this blog post has written a working prototype framework in Julia called Zygote.jl. In short, it achieves DP through metaprogramming, applying source code transformation techniques at compile time, and it has already been quite successful.

sdenton47y ago

The thought on top of my head is the giant set of default parameters that sit on top of pretty much every program, set by intuition by the original coders and then never touched again. Define a loss function, and you can start to optimize those parameters "for free".

ktpsns7y ago· 5 in thread

Something I don't understand about Automatic Differentiation is: Why not use a Computer Algebra System instead for generating derivatives of given functions?

one-more-minuteOP7y ago

That's essentially what a source-to-source AD does, just with support for the extra features that show up in programming languages. For example, handling variable bindings gets you the typical Wengert list, and handling function calls gets you the Pearlmutter and Siskind style backpropagator (I wrote a bit about the relationships at [0]).

The short answer is that CAS systems work with a "programming language" that doesn't have these features and is therefore a bit too limited for the kinds of models we're interested in.

[0] https://github.com/MikeInnes/diff-zoo

enriquto7y ago

You can still do that, most of the times. But if your function has many conditionals the final expression may get cumbersome.

Sometimes, the symbolic derivatives will be faster to compute, but most often they will be slower than using dual numbers for example. It is just a different tool to solve the problem. Most often, it is easier to use: you do not need a computer algebra system, you just change the type of your numbers and call it a day.

wenc7y ago

CASes generate symbolic derivative expressions. This is typically a fairly expensive operation, and can be slow if the number of expressions are large or if the expressions are complicated. You have to pre-generate these expressions before you can do any work (paying an upfront cost). CAS'es also use a very general symbolic engine, which entails a performance penalty.

If you're doing numerical work and don't need explicit derivative expressions, automatic differentiation offers a method that is both fast and convenient.

sampo7y ago

> Why not use a Computer Algebra System instead for generating derivatives of given functions?

You will generate expressions, and evaluating those expressions leads to the execution of exactly the same elementary operations as with forward mode AD. But with the overhead of needing the Computer Algebra System, and storing the expressions. (Ok, in some cases the CAS is maybe able to simplify the expressions a bit.)

salty_biscuits7y ago

An analytic form of the derivative might not be able to be found by computer algebra or expressible in a compact form.

ricksharp7y ago· 3 in thread

I found this paper that helps answer the question: https://arxiv.org/pdf/1803.10228.pdf

JadeNB7y ago

When possible, one should link to the abstract page, if for no other reason than that it makes version tracking easier: Want, Wu, Essertel, Decker, and Rompf - Demystifying differentiable programming: shift/reset the penultimate backpropagator (https://arxiv.org/abs/1803.10228), a title which is entirely too cute for its own good.

ricksharp7y ago

Thanks Jade, didn’t realize I grabbed the wrong link.

This paragraph from the paper was helpful for me:

Differentiable programming is of joint interest to the machine learning and programming language communities. As deep learning models becomes more and more sophisticated, researchers have noticed that building blocks into a large neural network model is similar to using functions, and that some powerful neural network patterns are analogous to higher-order functions in functional programming [Fong et al. 2017; Olah 2015]. This is also thanks to the development of modern deep learning frameworks which make defining neural networks “very much like a regular program” [Abadi et al. 2017; LeCun 2018].

ilaksh7y ago

They have some interesting software: https://feiwang3311.github.io/Lantern/

hnuser3557y ago· 2 in thread

What the hell is that supposed to mean and how it’s different from automatic differentiation

ddragon7y ago

I think it's just the realization that the execution graph in machine learning models at this point are not really different from any programming language AST, which means there is potential in exploring the intersection between writing programs and writing machine learning models with the AD tools.

hnuser3557y ago

Sorry for curse words I’m very confused

1 more reply

tanilama7y ago· 1 in thread

Is Differentiable Programming just a rebranding of tape-based auto differentiation?

psykotic7y ago

The only serious difference between classical AD and newer efforts like this is that they have an intermediate language ("what goes on the tape") with recursion and control flow operators which ideally lets them implement front-ends for normal programming languages.

You can achieve the same effect with classical AD by doing what PyTorch does in its default mode where the dynamic recursion and control flow is normal Python code and large tensor ops are asynchronously scheduled on the target device and an AD tape for that particular dynamic run is constructed each time. But the viability of PyTorch's approach is domain dependent: there are relatively few but large tensor ops, and while host-device synchronization is possible (for making the dynamic control flow dependent on intermediate device-computed results) it's also very costly. Although the first of these would be slightly less critical if the host language executed faster.

But, if you're not trying to achieve the kind of decoupling between host and device which PyTorch and similar frameworks want, then I don't see any particular reason you can't just use normal forward-mode or reverse-mode AD as an alternate scalar execution mode for existing code, which just needs a different interpreter/library, not a different intermediate language. For the purposes of differentiation, the branching and control flow operators don't exist, anyway. E.g. if you represent branching explicitly in your graph/tape as TensorFlow does, the subgradient of if(a, b, c) is just the subgradient of whichever of b or c is selected. The conditional variable a only serves as a selector; if you compute the subgradient with respect to the selector, the answer is always 0 (except for the knife-edge case where it's technically not defined but is treated as 0).

damip7y ago

Differentiable does not mean easy to optimize. One could imagine implementing sha-256 using differentiable operators, and yet the system as a whole would not be optimizable at all. It would be interesting to have compilers that optimize the "optimizability" of differentiable programs tho...

Also, here are two interesting examples of differentiation through physical systems for classification:

https://arxiv.org/pdf/1808.08412.pdf

https://innovate.ee.ucla.edu/wp-content/uploads/2018/07/2018...

xitrium7y ago

Be cautious using this article to try to learn anything. Differentiable programming is not actually related to deep learning; it's another word for automatic differentiation, a technique that is very important in deep learning implementations but useful for a variety of other tasks where having gradients available for arbitrary functions is useful.

The article is correct that "Differentiable Programming" seems to be a rebranding effort that I believe just helped automatic differentiation work from the machine learning world get published in Programming Languages journals. I wouldn't read too much into it.

j / k navigate · click thread line to collapse

41 comments

36 comments · 8 top-level

throwawaymath7y ago· 12 in thread

On a superficial level it seems like it:

1. Generalizes deep learning to an optimization function on decomposable input, and

2. Reduces the number of parameters required to learn the input by exploiting the structure of the input, thereby making learning more efficient.

Is that correct? Is it completely off? What am I missing? Is there any more meat to the article than this?

cf7y ago

Let me try to give a high-level explanation. As we have had success with using neural networks for different tasks, people began to ask what are the properties that make neural networks work so well.

ssivark7y ago

Supervised learning is a classic inverse problem -- you know the desired outcome, but not how to get there. (think control systems)

whatshisface7y ago

1 more reply

0db532a07y ago

How does auto-differentiation apply to loop indices? Is this really any more computationally powerful than techniques used to solve integer linear programmes?

usgroup7y ago

in that regard it’s mostly in the first few paragraphs but it’s subtley stated...

If you have a model and somecparameter space on it your trying to optimise then you can’t do much other than search the whole parameter space or approximate such a search intelligently.

So basically: differentiable means can be differentiated which means has gradient which means search less.

The parameter space stuff just means that less knowledge means more data and more parameters which is bad.

taktoa7y ago

knappa7y ago

enriquto7y ago

> The basic idea is to have an expressive programming language where all constructs are differentiable. Since the composition of diffeomorphisms is a diffeomorphism, (...)

No. It has nothing to do with diffeomorphisms (which are necessarily between spaces of the same dimension), but with piecewise differentiable functions.

1 more reply

throwawaymath7y ago

Wow, I wish you'd written the article. Thank you. I can only imagine what you'd be able to explain if you had as much space as the author...

1 more reply

sabalaba7y ago

For a concrete example, our recent paper RenderNet learns end to end differentiable ray tracing operations like ambient occlusion and normal maps which can be used in phong shading.

No “coding” required.

https://github.com/thunguyenphuoc/RenderNet https://papers.nips.cc/paper/8014-rendernet-a-deep-convoluti...

stabbles7y ago

> Since the composition of diffeomorphisms is a diffeomorphism, large programs (like ray-tracers) will be differentiable as a result.

enriquto7y ago

> Could someone (...) provide a different explanation of the concept?

Yes. It has nothing to do with deep learning, and nothing to do with optimization (although it is useful for both). It is just a way to write your functions so that they are easy to differentiate.

ricksharp7y ago· 5 in thread

Could someone list some practical examples where Differential Programming would be useful?

I am familiar where Nueral Networks and Convolutional Networks have done well especially around image processing etc.

But I can’t imagine where having differential code would help unless it is just tying multiple neural networks together in a continuous chain of differentiation.

For most programming tasks, I can’t imagine how differentiation would be possible or beneficial.

improbable227y ago

There are examples quite far from neural networks, the ones I can think of are broadly optimisation problems:

jononor7y ago

brewmarche7y ago

jondeuce7y ago

sdenton47y ago

ktpsns7y ago· 5 in thread

Something I don't understand about Automatic Differentiation is: Why not use a Computer Algebra System instead for generating derivatives of given functions?

one-more-minuteOP7y ago

The short answer is that CAS systems work with a "programming language" that doesn't have these features and is therefore a bit too limited for the kinds of models we're interested in.

[0] https://github.com/MikeInnes/diff-zoo

enriquto7y ago

You can still do that, most of the times. But if your function has many conditionals the final expression may get cumbersome.

wenc7y ago

If you're doing numerical work and don't need explicit derivative expressions, automatic differentiation offers a method that is both fast and convenient.

sampo7y ago

> Why not use a Computer Algebra System instead for generating derivatives of given functions?

salty_biscuits7y ago

An analytic form of the derivative might not be able to be found by computer algebra or expressible in a compact form.

ricksharp7y ago· 3 in thread

I found this paper that helps answer the question: https://arxiv.org/pdf/1803.10228.pdf

JadeNB7y ago

ricksharp7y ago

Thanks Jade, didn’t realize I grabbed the wrong link.

This paragraph from the paper was helpful for me:

ilaksh7y ago

They have some interesting software: https://feiwang3311.github.io/Lantern/

hnuser3557y ago· 2 in thread

What the hell is that supposed to mean and how it’s different from automatic differentiation

ddragon7y ago

hnuser3557y ago

Sorry for curse words I’m very confused

1 more reply

tanilama7y ago· 1 in thread

Is Differentiable Programming just a rebranding of tape-based auto differentiation?

psykotic7y ago

damip7y ago

Also, here are two interesting examples of differentiation through physical systems for classification:

https://arxiv.org/pdf/1808.08412.pdf

https://innovate.ee.ucla.edu/wp-content/uploads/2018/07/2018...

xitrium7y ago

j / k navigate · click thread line to collapse