Automatic differentiation does incur truncation errors (kinda) (opens in new tab)

(oxinabox.net)

83 pointsoxinabox5y ago42 comments

42 comments

28 comments · 9 top-level

jpollock5y ago· 8 in thread

I'm seeing a lot of discussions around Automatic Differentiation, but I don't understand the purpose.

Why would you take the derivative of a piece of code?

I have used Automatic Differention used extensively to "optimize" simulations. This is problem is know under many names such parameter estimation, local sensitivity analysis, optimal control ...

Having access to something the resembles the derivative of your code allows you to use optimization algorithms that converge faster (Newton or Gradient descent vs bisection methods) and you don't have to specify the derivatives by hand (which gets really bothersome when you have to specify a hessian matrix (2nd derivative information including all the mixed terms)).

taeric5y ago

Would be cool to see what your code looked like. I have the same question as the start of the thread.

That is, I can see how the derivative can help optimise a simulation, but I'm not clear on how using an auto derivative really helps.

Edit: note that I also see a difference in being able to get the derivative of a function, versus writing in a sense where any function can be moved to its derivative.

1 more reply

srl5y ago

The most frequent application is machine learning. If my piece of code is implementing some function that I want to minimize, then taking the derivative (w.r.t. some vector of parameters) tells me in which direction I need to change the parameters in order to reduce the function: "gradient descent".

jpollock5y ago

So, we've got "reality", as represented by data.

We've got a model, implemented in code. Since it's code can be differentiated - not sure how that works with branches, I guess that's the math. :) This is generated through a set of input parameters.

We've got an error function, representing the difference between the model and reality.

If we differentiate the error function, we can choose which set of parameter mutations are heading in the right direction to then generate a new model? We check each close point and find the max benefit?

However, if everything is taking the parameters as input, is the derivative of the error function only generated once?

Is it saying that the derivative of the error function is independent of the parameters, so it doesn't matter what the model is, they all have the same error function, and that error function can be found by generating a single model?

1 more reply

dnautics5y ago

Note that the example here is forward differentiation and most machine learning these days is backward propagation

2 more replies

jmeyer2k5y ago

To expand, it tells you how you can change certain variables of a function to change the output of a function.

In machine learning, you create a loss function that takes in modal parameters and returns how accurate the model is. Then, you can optimize the function inputs for accuracy.

Automatic differentiation makes it much easier to code different types of models.

eigenspace5y ago

You want to take the derivative of a piece of code so that you don't have to either hand-write the derivative yourself, or rely on finite differencing which can be slower in many circumstances and is almost universally less accurate.

sesuximo5y ago

It’s a tool to help solve inverse problems.

stabbles5y ago· 3 in thread

An obvious example here is when you expand e^x with its Taylor series using `n` terms. The derivative of that length-n Taylor expansion is identical but only `n-1` terms long, so you lose precision.

More generally if you approximate a smooth function f with a truncated taylor series of length n around c, the error behaves as O((x−c)^(n+1)), and the error of the k'th derivative of that function auto-diffed will be of order O((x-c)^(n+1-k)).

joe_the_user5y ago

That's true but this sort of problem would normally not happen with a standard AD library or subsystem, whatever language you're using. Such systems don't approximate functions then differentiate their approximations.

stabbles5y ago

Well, all functions that use finite precision arithmetic deal with approximations. Errors of 1/2 ulp like trig functions are rather the exception. And machine learning is keen on using low precision floating point type, so errors are relatively large.

taeric5y ago

I think the example is supposed to be indicative of larger functions that users define into the system. That is, that this was shown with sin/cos was simply an example. Right?

1 more reply

not2b5y ago· 2 in thread

You could use a lazy representation of the Taylor series that will give you as many terms as you ask for and symbolically differentiate that. Then you'll gate an accurate automatic differentiation. When you go to evaluate your approximation you'll get errors at that point, but you'll correctly get that the second derivative of sin(x) is exactly -sin(x).

oxinaboxOP5y ago

Correct, which is mentioned at the end of the post. I am not sure how well it generalizes, e.g. to multivariate functions that are defines as solutions of some thing like a differential equation.

Maybe it does? I don't know.

joe_the_user5y ago

You could do this but I'm pretty sure that real AD just deals all the common primitives directly. The derivative of sin(x) is cos(x), etc. AD just postpones the chain rule calculations that would be involved in something sin(x + x^2)^2, so the chain-rule happens at run time instead of a symbol expansion at compile time.

joe_the_user5y ago· 2 in thread

Article: "Bonus: will symbolic differentiation save me? Probably, but probably not in an interesting way."

This articles seems to misunderstand AD.

Automatic Differentiation doesn't incur more truncation error than symbolically differentiating the function and then calculating the symbolic derivative's value. Automatic differentiating is basically following the steps you'd follow for symbolic differentiation but substituting a value for the symbolic expansion and so avoiding the explosion of symbols that symbolic differentiation involves. But it's literally the same sequence of calculations. The one way symbolic differentiation might help is if you symbolically differentiated and then rearranged terms to avoid truncation error but that's a bit different.

The article seems to calculate sin(x) in a lossy fashion and then attribute to the error to AD. That's not how it works.

[I can go through the steps if anyone's doubtful]

oxinaboxOP5y ago

> Automatic Differentiation doesn't incur more truncation error than symbolically differentiating the function and then calculating the symbolic derivative's value.

Yes. The article even says that.

_"The AD system is (as you might have surmised) not incurring truncation errors. It is giving us exactly what we asked for, which is the derivative of my_sin. my_sin is a polynomial. The derivative of the polynomial is: [article lists the hand-derived derivative]"_

The reason symbolic might help is symbolic AD is often used in languages that don't represent things with numbers, but with lazy expressions. I will clarify that. (edit, I have updates that bit and I think it is clearer. Thanks)

> The article seems to calculate sin(x) in a lossy fashion and then attribute to the error to AD. That's not how it works.

The important bit is that the accurasy lost from a accurate derviative of a lossy approximation is greater than accurasy lost from a lossy approximation to an accurate derivative. Is there a bit I should clarify more about that? I tried to emphisize that at the end before the bit mentioning symbolic.

joe_the_user5y ago

I would suggest saying front and center, more prominently, that AD incurs the truncation errors that the equivalent symbolic derivative would incur. Since you're talking about AD, you should be giving people a good picture of what it is (since it's not that commonly understood). 'Cause you're kind of suggesting otherwise even if somewhere you're eventually saying this.

I mean, Griewank saying "Algorithmic differentiation does not incur truncation error" does deserve the caveat "unless the underlying system has truncation errors, which it often does". But you can give that caveat without bending the stick the other way.

The important bit is that the accurasy lost from a accurate derviative of a lossy approximation is greater than accurasy lost from a lossy approximation to an accurate derivative.

I understand you get inaccuracy from approximating the sin but I don't know what you're contrasting this to. I think real AD libraries deal with primitives and with combinations of primitive so for such a library, the derivative of sin(x) would be "symbolically" calculated as cos(x) since sin is primitive (essentially, all the "standard functions" have to be primitives and create things from that. I doubt any library would apply AD to it's approximation of a given function).

I haven't used such libraries but I am in the process of writing an AD subsystem for my own little language.

yudlejoza5y ago· 2 in thread

AD is a symbolic technique.

If you use a symbolic technique over numeric data without knowing what you're doing, I feel sorry for you.

(numeric: specifically the inclusion of floating-point.)

snicker75y ago

This is not true. Automatic differentiation and symbolic differentiation are very distinct.

yudlejoza5y ago

I never used the term 'symbolic differentiation'

- SD manipulates symbols provided to it as math formulas.

- AD manipulates symbols provided to it as program code.

freemint5y ago· 1 in thread

The Julia ecosystem provides has a library that includes the differentiation rules hinted at at the end.

https://github.com/JuliaDiff/ChainRules.jl is used by (almost all) automatic differentiation engines and provides an extensive list of such rules.

If the example used sin|cos the auto diff implementations in Julia would have called native cos|-sin and not encurred such a "truncation error". However the post illustrates the idea in a good way.

Good post oxinabox

oxinaboxOP5y ago

> https://github.com/JuliaDiff/ChainRules.jl is used by (almost all) automatic differentiation engines and provides an extensive list of such rules.

ChainRules is going to be used by everything. Right now it is used by 3 AD and a PR is open for a 4th, plus one thing that hasn't been released yet.

For context of anyone who doesn't know, I am the lead maintainer of ChainRules.jl, (and the auther of this blog post)

londons_explore5y ago· 1 in thread

Interesting...

But do these truncation errors cause any real world heartaches? Does anyone do n-derivations of any approximate function for large n?

oxinaboxOP5y ago

Occationally, there are a few places where you want arbitary high derivatives. But in general there are specialized solutions for those (though very few things implement them. E.g. Jax's JETs and TaylorSeries.jl in julia implement taylor-mode AD. But those also need custom primatives)

More of a problem is first derivatives, but that you need very high accuracy. Which doesn't occur in ML but does occur sometimes in scientific computing. (Doesn't occur in ML cos we just basically shake the thing about a bit anyway)

iujjkfjdkkdkf5y ago

First I want to say this article is actually a great intro to AD and shows off a taste of how well suited Julia is to writing this kind of thing. Mike Innes, linked in the article has a great tutorial on AD that made me realize how awesome Julia is as well, almost too smart :)

Now, AD avoids truncation error by hard coding the basic rules of differentiation, and applying these exact rules to the input. Thus the article setting up the rules for +-*/ as opposed to defining numerical differentiation as the usual limit. There is no magic, AD works because it doesnt use an approximation.

So if instead you use an approximation, like a Taylor series, for a function, and differentiate that, you dont get the derivative of the functions, you get the derivative of the truncated series you wrote. Same would be true for any function. This does not feel surprising.

So I can only assume that the article is really intended to just be about giving a round about explanation of how AD works, rather than uncovering some revelation, which is effectively a tautology, as the article itself points out.

So overall, valueable, but also a strange way of framing it IMO

oxinaboxOP5y ago

anyone one know why of the 7 AD's tried (8 including the one implemented at the start) there are the different answers? 0.4999999963909431 vs 0.4999999963909432 vs 0.4999999963909433

I assume if is some kind of IEEE math thing. given that IEEE allow `(a+b)+c != a + (b + c)` but where is it occuring exactly?

j / k navigate · click thread line to collapse

42 comments

28 comments · 9 top-level

jpollock5y ago· 8 in thread

I'm seeing a lot of discussions around Automatic Differentiation, but I don't understand the purpose.

Why would you take the derivative of a piece of code?

freemint5y ago

I have used Automatic Differention used extensively to "optimize" simulations. This is problem is know under many names such parameter estimation, local sensitivity analysis, optimal control ...

taeric5y ago

Would be cool to see what your code looked like. I have the same question as the start of the thread.

That is, I can see how the derivative can help optimise a simulation, but I'm not clear on how using an auto derivative really helps.

Edit: note that I also see a difference in being able to get the derivative of a function, versus writing in a sense where any function can be moved to its derivative.

1 more reply

srl5y ago

jpollock5y ago

So, we've got "reality", as represented by data.

We've got a model, implemented in code. Since it's code can be differentiated - not sure how that works with branches, I guess that's the math. :) This is generated through a set of input parameters.

We've got an error function, representing the difference between the model and reality.

However, if everything is taking the parameters as input, is the derivative of the error function only generated once?

1 more reply

dnautics5y ago

Note that the example here is forward differentiation and most machine learning these days is backward propagation

2 more replies

jmeyer2k5y ago

To expand, it tells you how you can change certain variables of a function to change the output of a function.

In machine learning, you create a loss function that takes in modal parameters and returns how accurate the model is. Then, you can optimize the function inputs for accuracy.

Automatic differentiation makes it much easier to code different types of models.

eigenspace5y ago

sesuximo5y ago

It’s a tool to help solve inverse problems.

stabbles5y ago· 3 in thread

An obvious example here is when you expand e^x with its Taylor series using `n` terms. The derivative of that length-n Taylor expansion is identical but only `n-1` terms long, so you lose precision.

joe_the_user5y ago

stabbles5y ago

taeric5y ago

I think the example is supposed to be indicative of larger functions that users define into the system. That is, that this was shown with sin/cos was simply an example. Right?

1 more reply

not2b5y ago· 2 in thread

oxinaboxOP5y ago

Correct, which is mentioned at the end of the post. I am not sure how well it generalizes, e.g. to multivariate functions that are defines as solutions of some thing like a differential equation.

Maybe it does? I don't know.

joe_the_user5y ago

joe_the_user5y ago· 2 in thread

Article: "Bonus: will symbolic differentiation save me? Probably, but probably not in an interesting way."

This articles seems to misunderstand AD.

The article seems to calculate sin(x) in a lossy fashion and then attribute to the error to AD. That's not how it works.

[I can go through the steps if anyone's doubtful]

oxinaboxOP5y ago

> Automatic Differentiation doesn't incur more truncation error than symbolically differentiating the function and then calculating the symbolic derivative's value.

Yes. The article even says that.

> The article seems to calculate sin(x) in a lossy fashion and then attribute to the error to AD. That's not how it works.

joe_the_user5y ago

The important bit is that the accurasy lost from a accurate derviative of a lossy approximation is greater than accurasy lost from a lossy approximation to an accurate derivative.

I haven't used such libraries but I am in the process of writing an AD subsystem for my own little language.

yudlejoza5y ago· 2 in thread

AD is a symbolic technique.

If you use a symbolic technique over numeric data without knowing what you're doing, I feel sorry for you.

(numeric: specifically the inclusion of floating-point.)

snicker75y ago

This is not true. Automatic differentiation and symbolic differentiation are very distinct.

yudlejoza5y ago

I never used the term 'symbolic differentiation'

- SD manipulates symbols provided to it as math formulas.

- AD manipulates symbols provided to it as program code.

freemint5y ago· 1 in thread

The Julia ecosystem provides has a library that includes the differentiation rules hinted at at the end.

https://github.com/JuliaDiff/ChainRules.jl is used by (almost all) automatic differentiation engines and provides an extensive list of such rules.

If the example used sin|cos the auto diff implementations in Julia would have called native cos|-sin and not encurred such a "truncation error". However the post illustrates the idea in a good way.

Good post oxinabox

oxinaboxOP5y ago

> https://github.com/JuliaDiff/ChainRules.jl is used by (almost all) automatic differentiation engines and provides an extensive list of such rules.

ChainRules is going to be used by everything. Right now it is used by 3 AD and a PR is open for a 4th, plus one thing that hasn't been released yet.

For context of anyone who doesn't know, I am the lead maintainer of ChainRules.jl, (and the auther of this blog post)

londons_explore5y ago· 1 in thread

Interesting...

But do these truncation errors cause any real world heartaches? Does anyone do n-derivations of any approximate function for large n?

oxinaboxOP5y ago

iujjkfjdkkdkf5y ago

So overall, valueable, but also a strange way of framing it IMO

oxinaboxOP5y ago

anyone one know why of the 7 AD's tried (8 including the one implemented at the start) there are the different answers? 0.4999999963909431 vs 0.4999999963909432 vs 0.4999999963909433

I assume if is some kind of IEEE math thing. given that IEEE allow `(a+b)+c != a + (b + c)` but where is it occuring exactly?

j / k navigate · click thread line to collapse