Why would you take the derivative of a piece of code?
Having access to something the resembles the derivative of your code allows you to use optimization algorithms that converge faster (Newton or Gradient descent vs bisection methods) and you don't have to specify the derivatives by hand (which gets really bothersome when you have to specify a hessian matrix (2nd derivative information including all the mixed terms)).
That is, I can see how the derivative can help optimise a simulation, but I'm not clear on how using an auto derivative really helps.
Edit: note that I also see a difference in being able to get the derivative of a function, versus writing in a sense where any function can be moved to its derivative.
We've got a model, implemented in code. Since it's code can be differentiated - not sure how that works with branches, I guess that's the math. :) This is generated through a set of input parameters.
We've got an error function, representing the difference between the model and reality.
If we differentiate the error function, we can choose which set of parameter mutations are heading in the right direction to then generate a new model? We check each close point and find the max benefit?
However, if everything is taking the parameters as input, is the derivative of the error function only generated once?
Is it saying that the derivative of the error function is independent of the parameters, so it doesn't matter what the model is, they all have the same error function, and that error function can be found by generating a single model?
In machine learning, you create a loss function that takes in modal parameters and returns how accurate the model is. Then, you can optimize the function inputs for accuracy.
Automatic differentiation makes it much easier to code different types of models.
More generally if you approximate a smooth function f with a truncated taylor series of length n around c, the error behaves as O((x−c)^(n+1)), and the error of the k'th derivative of that function auto-diffed will be of order O((x-c)^(n+1-k)).
Maybe it does? I don't know.
This articles seems to misunderstand AD.
Automatic Differentiation doesn't incur more truncation error than symbolically differentiating the function and then calculating the symbolic derivative's value. Automatic differentiating is basically following the steps you'd follow for symbolic differentiation but substituting a value for the symbolic expansion and so avoiding the explosion of symbols that symbolic differentiation involves. But it's literally the same sequence of calculations. The one way symbolic differentiation might help is if you symbolically differentiated and then rearranged terms to avoid truncation error but that's a bit different.
The article seems to calculate sin(x) in a lossy fashion and then attribute to the error to AD. That's not how it works.
[I can go through the steps if anyone's doubtful]
Yes. The article even says that.
_"The AD system is (as you might have surmised) not incurring truncation errors. It is giving us exactly what we asked for, which is the derivative of my_sin. my_sin is a polynomial. The derivative of the polynomial is: [article lists the hand-derived derivative]"_
The reason symbolic might help is symbolic AD is often used in languages that don't represent things with numbers, but with lazy expressions. I will clarify that. (edit, I have updates that bit and I think it is clearer. Thanks)
> The article seems to calculate sin(x) in a lossy fashion and then attribute to the error to AD. That's not how it works.
The important bit is that the accurasy lost from a accurate derviative of a lossy approximation is greater than accurasy lost from a lossy approximation to an accurate derivative. Is there a bit I should clarify more about that? I tried to emphisize that at the end before the bit mentioning symbolic.
I mean, Griewank saying "Algorithmic differentiation does not incur truncation error" does deserve the caveat "unless the underlying system has truncation errors, which it often does". But you can give that caveat without bending the stick the other way.
The important bit is that the accurasy lost from a accurate derviative of a lossy approximation is greater than accurasy lost from a lossy approximation to an accurate derivative.
I understand you get inaccuracy from approximating the sin but I don't know what you're contrasting this to. I think real AD libraries deal with primitives and with combinations of primitive so for such a library, the derivative of sin(x) would be "symbolically" calculated as cos(x) since sin is primitive (essentially, all the "standard functions" have to be primitives and create things from that. I doubt any library would apply AD to it's approximation of a given function).
I haven't used such libraries but I am in the process of writing an AD subsystem for my own little language.
If you use a symbolic technique over numeric data without knowing what you're doing, I feel sorry for you.
(numeric: specifically the inclusion of floating-point.)
https://github.com/JuliaDiff/ChainRules.jl is used by (almost all) automatic differentiation engines and provides an extensive list of such rules.
If the example used sin|cos the auto diff implementations in Julia would have called native cos|-sin and not encurred such a "truncation error". However the post illustrates the idea in a good way.
Good post oxinabox
ChainRules is going to be used by everything. Right now it is used by 3 AD and a PR is open for a 4th, plus one thing that hasn't been released yet.
For context of anyone who doesn't know, I am the lead maintainer of ChainRules.jl, (and the auther of this blog post)
But do these truncation errors cause any real world heartaches? Does anyone do n-derivations of any approximate function for large n?
More of a problem is first derivatives, but that you need very high accuracy. Which doesn't occur in ML but does occur sometimes in scientific computing. (Doesn't occur in ML cos we just basically shake the thing about a bit anyway)
Now, AD avoids truncation error by hard coding the basic rules of differentiation, and applying these exact rules to the input. Thus the article setting up the rules for +-*/ as opposed to defining numerical differentiation as the usual limit. There is no magic, AD works because it doesnt use an approximation.
So if instead you use an approximation, like a Taylor series, for a function, and differentiate that, you dont get the derivative of the functions, you get the derivative of the truncated series you wrote. Same would be true for any function. This does not feel surprising.
So I can only assume that the article is really intended to just be about giving a round about explanation of how AD works, rather than uncovering some revelation, which is effectively a tautology, as the article itself points out.
So overall, valueable, but also a strange way of framing it IMO
I assume if is some kind of IEEE math thing. given that IEEE allow `(a+b)+c != a + (b + c)` but where is it occuring exactly?