Exact numeric nth derivatives (opens in new tab)

(jliszka.github.io)

164 pointsjliszka12y ago47 comments

47 comments

38 comments · 15 top-level

backprojection12y ago· 11 in thread

I think it's worth noting that the problem with numerical differentiation, fundamentally, is that differentiation is an unbounded operator. In finite-differences, (the more obvious approach), you assume that your data are samples of some, general, function.

The problem then, is that that general functions have no (essential) bandlimit [1]. Remember that differentiation acts as a multiplication by a monomial, in the frequency domain [2]. Non-constant polynomials always eventually blow up away from 0, so in differentiating, you're multiplying a function by something that blows up, in the frequency domain. This means that, in the result, higher frequencies are going to dominate over lower frequencies, at a polynomial rate.

Let me be clear, the problem with numerical differentiation is not just that rounding errors accumulate, it's that differentiation is fundamentally unstable, and not something you want to apply to real-world data.

It depends very much on what your application is, however, I think generally a better approach to AD is to redefine your differentiation, by composing it with a low-pass filter. If designed properly, your low-pass filter will 'go to zero' faster (in the frequency-domain) than any polynomial, thus making this new operator bounded, and hence numerically more stable. It's not a panacea, but it begins to address the fundamental problem.

One example of such a filter is Gamma(n+1, n x^2)/Factorial[n], where Gamma is the incomplete gamma function [3].

In Python:

scipy.special.gammaincc(n+1,nx2) or mpmath.gammainc(n+1,nx2, regularized=True)

To see why this is a nice choice, notice item 2 in [4]. This filter is simply the product of exp(- x^2) (the Gaussian) multiplied by the first n-terms of the Taylor series of exp(+ x^2), (1/ the-Gaussian). Since this series converges unconditionally everywhere, as n-> +infinity, this filter converges to 1 for a fixed x (as you increase n), however, since it's still a gaussian times a polynomial, it always converges to 0 as you increase x, but fix n.

This is my area of research, so if anyone's interested I can give more details.

[1] https://en.wikipedia.org/wiki/Band-limit [2] https://en.wikipedia.org/wiki/Fourier_transform#Analysis_of_... [3] https://en.wikipedia.org/wiki/Incomplete_gamma_function [4] https://en.wikipedia.org/wiki/Incomplete_gamma_function#Spec...

jules12y ago

Read the article, it's not about computing derivatives of real world data (using finite differences or whatever), it's about exact derivatives of rational functions specified by a computer program. While what you wrote is interesting, none of it applies to this article.

acjohnson5512y ago

^ Totally this. It's not meant for differentiating a function known only by its points. It's meant for evaluating nth-order derivatives of functions you can already evaluate at arbitrary values. The functions don't even have to be rational. As long as the function can be calculated by a computer by some method, so can its derivatives.

This comes in handy in all sorts of things where you might have designed this fancy kernel function to use in some process where you need to be able to calculate the value of the function and also of some derivatives--backpropagation for example [1].

As I was told by someone in the field, at one point, people used to generate machine learning publications simply by finding functions that required fancy mathematical tricks to find closed-form derivatives of chosen functions so that they would be usable in learning algorithms. But in many cases, this work is unnecessary if you use automatic differentiation.

It's a really cool concept, applicable in specific situations. If you need to know the derivative of a function that's not fully specified, you need numerical differentiation [2]. If you need a closed-form expression for your derivative function, that's when you need symbolic differentiation [3].

[1] http://en.wikipedia.org/wiki/Backpropagation [2] http://en.wikipedia.org/wiki/Numerical_differentiation [3] http://en.wikipedia.org/wiki/Symbolic_differentiation

1 more reply

jordigh12y ago

I find it funny how you think of differentiation in terms of "frequency domain", "bandlimit" and "filter". Very signal-processy, very engineery. A mathematician would speak about unbounded operators in a Banach space. :-)

Edit: Reminds me of the list on page 3 here: http://arxiv.org/pdf/math/9404236v1

backprojection12y ago

> I find it funny how you think of differentiation in terms of "frequency domain", "bandlimit" and "filter". Very signal-processy, very engineery. A mathematician would speak about unbounded operators in a Banach space. :-)

This more or less depends on what kind of mathematician you're talking to. A mathematician working on Frames and signals almost certainly uses that language. But someone working in PDEs probably doesn't, even if the theory intersects a lot. It's probably a sign that there's nothing wrong with mathematicians and engineers sharing concepts :).

crntaylor12y ago

This is interesting. I am generally very skeptical about taking derivatives of functions computed from real-world data for exactly this reason. What I normally end up doing is applying some form of kernel smoothing (e.g. nearest K points with a Gaussian kernel) to approximate the function, and then compute derivatives of that instead.

Would you recommend another way?

backprojection12y ago

This is pretty much exactly what I'm saying. So you're convolving with a gaussian kernel. That's the exact same thing as multiplying, in the frequency domain, by another gaussian. (convolution theorem, and the FT of a gaussian is a gaussian). The Gamma function filter goes one step further, it throws in a polynomial to make that gaussian 'flatter', which is more suitable for a low-pass filter.

Try the script at the end of this comment. Here's the output http://imgur.com/5948cW6

Notice that the 0-th one is just a gaussian. If you're interested, these things have traditionally been called HDAFs, I have a library to do this kind of thing https://bitbucket.org/nmaxwell/hdaf it may need more documentation. I don't mind if anyone provides constructive feedback on it.

The functionality you'd be interested in is hdaf.hdaf_periodic_samples, - you can convolve with that to low-pass filter, and use hertz_to_sigma to get the sigma parameter (from cutoff frequency).

  import numpy, scipy.special
  import matplotlib.pyplot as plt

  x=numpy.linspace(-2,5,1000)
  for n in [0,1,4,10]:
      s=1 if n==0 else n
      h=scipy.special.gammaincc(n+1,s*x**2)
      plt.plot(x, h)
  plt.savefig("filters")

EDIT: formatting.

irremediable12y ago

Thanks for the insight! It's interesting to read a mathematician's more formal take on this -- I'm from an engineering background, and I've often seen the problem in control theory, where we approximate signals using observers and then differentiate the observed signal.

Edit: I should clarify that it's cool to read a mathematician's insight into the problem as seen by engineers, written in a way that engineers can understand.

short_circut12y ago

Low pass filters are absolute miracles when working with numerical derivatives. I have had instances where that was the only reason I ever got anything meaningful out of them. It really comes down to the fact that the numerical representation of a function is in general not truly analytic. It is has kinks and bends all over the place.

n00b10112y ago

Do you have any references for the approach that you recommended? You're saying that differentiation (not even just numerical differentiation) is "not something you want to apply to real-world data" ... This is surprising since differentiation obviously is used in the real-world in some very critical applications.

mrcactu512y ago

I believe you, but what applications do you have in mind?

backprojection12y ago

[EDIT] As others have said below, this comment, while maybe useful, doesn't really apply to the article (or AD), since AD is about differentiating functions, not data.

jordigh12y ago· 6 in thread

Am I missing something or is this begging the question?

For any function that is not a combination of polynomials, you need to have its Taylor expansion up to the desired order of derivatives, so you can't just take an "arbitrary" function and use this method to compute its derivative in exact arithmetic.

So for anything other than polynomials, you just reword the problem of finding exact derivatives to finding exact Taylor series, and in order to find Taylor series in most cases, you have to differentiate or express your function in terms of the Taylor series of known functions.

Edit: Indeed, take the only non-polynomial example here, a rational function (division by a polynomial). In order to make this work, you have to know the geometric series expansion of 1/(1-x). For each function that you want to differentiate this way, you have to keep adding more such pre-computed Taylor expansions.

crntaylor12y ago

The typical approach is to provide overloaded versions of primitive functions (generally addition, multiplication, subtraction, division, powers, trig and hyperbolic trig functions, exponential and logarithm) for which you explicitly tell the program what the first N terms in the Taylor series are.

The magic is that you also tell it how to compute Taylor series of function compositions, if the Taylor series of the functions being composed are already known - then any arbitrary function composed out the primitive functions can have its Taylor series computed automatically!

For your example, the function 1/(1-x) is the composition of

    x -> -x
    x -> 1+x
    x -> 1/x

and so its Taylor series is already known as long as you have already defined negation, addition and reciprocation.

jordigh12y ago

So instead of implementing a full CAS that knows linearity of differentiation along with the product and chain rules, you implement something that for a little less computation can give you exact derivatives at any given point. Am I understanding this correctly?

1 more reply

jliszkaOP12y ago

You're right, it is begging the question, for any simple function, you basically have to tell it what all the derivatives are for it to work.

The power of this technique is that it handles the product rule and chain rule for you, so composition and multiplication of functions work automatically without extra work.

For the case of 1/(1-x), you only need to tell it the derivatives (wrt x) of

f(x, a) = x * a

g(x, a) = x + a

h(x) = 1/x

Then it automatically knows how to compute the derivatives of h(g(f(x, -1), 1)).

kvb12y ago

No, really you don't need Taylor series at all; see for example Conal Elliott's "Beautiful Differentiation"[1], which shows how to handle arbitrary derivatives of multidimensional functions and never uses Taylor series for anything; all you need to know is that d/dx sin(x) = cos(x), etc. Just as a warning, Conal's paper is quite dense and is probably not a good introduction.

[1] http://conal.net/papers/beautiful-differentiation/

jordigh12y ago

But knowing that sin' = cos is again begging the question, just rephrasing it back.

I think I get it, though. You can get away with simplifying intermediate steps of a differentiation computation if you only care about the value of the function at a single point.

timmclean12y ago

I am not a mathematician, but I don't think this is actually begging the question.

> you can't just take an arbitrary function and use this method

Actually, you can, AFAIK. The relation f(x + E) = f(x) + f'(x)E still holds.

If we try this method with a rational function:

    f(x) = (x - 1)/(x - 2)

    f(x + E)
     = (x + E - 1) / (x + E - 2)
     = (x + E - 1)(x - E - 2) / ((x + E - 2)(x - E - 2))
     = (x^2 - x E - 2 x + x E - E^2 - 2 E - x + E + 2) / (x^2 - x E - 2 x + x E - E^2 - 2 E - 2 x + 2 E + 4)
     = (x^2 - 3 x + 2 - E) / (x^2 - 4 x + 4)
     = (x^2 - 3 x + 2) / (x^2 - 4 x + 4) - (1 / (x^2 - 4 x + 4)) E
     = (x - 1) / (x - 2) - (1 / (x - 2)^2) E

    f'(x) = -1 / (x - 2)^2

since

    f(x + E) = f(x) + f'(x) E

Edit to add: the key idea here being that no knowledge of differentiation is needed, just tricks for manipulating expressions involving x and E until they are in normal form.

gpsarakis12y ago· 2 in thread

Nice analysis. Hope you don't mind me adding that by omitting terms of the Taylor series you do have some loss of precision, however small. Also, solving linear equation systems may even introduce instability as the following must be preserved: http://en.wikipedia.org/wiki/Diagonally_dominant_matrix

crntaylor12y ago

The point is that when you're considering the Taylor series for a dual number argument, you don't lose any precision, because higher powers of the "imaginary" part of the dual number vanish. The example he gives is

    f(a + be) = f(a) + f'(a)be + 0.5 * f''(a) b^2 e^2 + O(e^3)
              = f(a) + f'(a)be

because e^n=0 for all n>1. This isn't an approximation - it's an exact relationship for dual numbers!

You will lose some precision by using floating point numbers instead of an arbitrary-precision real number type, but this is a limitation of the machine you're working on. The method is exact.

jordigh12y ago

No, the coefficients of the Taylor series are the exact derivatives, assuming the actual arithmetic were exact (it's not, because IEEE 754). There's no loss of precision there.

crntaylor12y ago· 1 in thread

Very neat. Presumably there is a more efficient method for implementing Nth order automatic differentiation than encoding the dual numbers as NxN matrices, though? To multiply the matrices takes O(N^3) time, whereas by exploiting their known structure I think you should be able to do it in O(N^2) time. Am I wrong?

jliszkaOP12y ago

You're right, there's only O(n) information in these particular n x n matrices, so you can multiply them in O(n^2).

The code provided memoizes cell values using the key (r - c), so only O(n^2) work is required to read off the top row of the matrix. I'm planning to make that more explicit in a follow-up post.

dhammack12y ago· 1 in thread

There's an interesting python library [1] which implements AD as well as has some neat features like automatic compilation to optimized C. It's developed by the AI lab at the University of Montreal, and is pretty popular in deep learning circles. I've found it to be a huge time saver to not worry whether you screwed up your gradient calculations when doing exploratory research!

[1] http://deeplearning.net/software/theano/

lightcatcher12y ago

My personal favorite feature of Theano is the automatic compilation to CUDA code (which would get about 8-15x speed-up over the optimized C code for the deep learning research I was doing).

Bill_Dimm12y ago· 1 in thread

There seems to be a typo at the beginning of the "Implementing dual numbers" section. It says:

  The number a+bd can be encoded as...

Should be:

  The number a+b*epsilon can be encoded as...

jliszkaOP12y ago

Thanks! fixing...

svantana12y ago· 1 in thread

Is it just me or this article pretty naive? The headline's use of the word "exact" would imply integer arithmetic only, but the computations are done with floating point. So basically (s)he is trading one rounding error for another, which seems to be small-ish in some particular cases. What about discontinuities? And why forward derivatives only? I hope noone will use this for any application that actually relies on exact derivatives.

Leszek12y ago

They're exact in the sense that they give the same value as if you had calculated the value of the analytic derivative. This is different from numerical differentiation, which approximates the derivative with finite differences.

tel12y ago

See also the ad package [1] for Haskell which has a number of interesting features of this vein.

[1] http://hackage.haskell.org/package/ad

fdej12y ago

This seems to be essentially the same thing as "power series arithmetic" (first-order "dual arithmetic" is equivalent to arithmetic in the ring of formal power series modulo x^2, but you can make that x^n).

Encoding power series as matrices is sometimes convenient for theoretical analysis (or, as here, educational purposes), but it's not very efficient. The space and time complexities with matrices are O(n^2) and O(n^3), versus O(n) and O(n^2) (or even O(n log n) using FFT) using the straightforward polynomial representation (in which working with hundreds of thousands of derivatives is feasible). In fact some of my current research focuses on doing this efficiently with huge-precision numbers, and with transcendental functions involved.

BoppreH12y ago

I couldn't believe it would work, so I made a toy implementation in Python using simple operator overloading: https://github.com/boppreh/derivative

All values tried so far agree with Wolfram Alpha, so color me surprised and happy for learning something new.

fusiongyro12y ago

Very neat article. This is essentially calculus with infinitesimals (also called "nonstandard analysis") implemented on the machine. If you like the approach, a more general and rigorous investigation can be had by reading H. Jerome Keisler's book Elementary Calculus, which is freely available online in the 2nd edition here:

  http://www.math.wisc.edu/~keisler/calc.html

The third edition is now in print. I've been studying calculus with it off-and-on for a while and I find the approach very intuitive, though Spivak's Calculus is probably a better book, the "standard analysis" is a little less intuitive (and now, evidently, harder to teach a machine).

mrcactu512y ago

congratulations, you have implmented the Zariski tangent space using nilpotent matrices. welcome to the beautiful theory of algebraic geometry and schemes.

http://math.stanford.edu/~vakil/0708-216/216class21.pdf

This really does fall in the ream of algebraic geometry since this method only works for rational functions - as he implemented it.

To numerically compute sin(x + ε) you need the Taylor series.

Pitarou12y ago

How does this technique compare to a computer implementation of the kinds of techniques we learnt in High School? Is it easier to implement? More efficient? Are there some situations where it isn't appropriate?

tomrod12y ago

This read nicely until I got to the code block. Does anyone else see this as yellow and gray (with syntax highlighting) coloration--such that it's virtually impossible to read?

Bahamut12y ago

As a former mathematician, the second sentence is an abomination:

"...but to give you an overview, the idea is that you introduce an algebraic symbol ϵ such that ϵ≠0 but ϵ^2=0"

j / k navigate · click thread line to collapse

47 comments

38 comments · 15 top-level

backprojection12y ago· 11 in thread

One example of such a filter is Gamma(n+1, n x^2)/Factorial[n], where Gamma is the incomplete gamma function [3].

In Python:

scipy.special.gammaincc(n+1,nx2) or mpmath.gammainc(n+1,nx2, regularized=True)

This is my area of research, so if anyone's interested I can give more details.

jules12y ago

acjohnson5512y ago

[1] http://en.wikipedia.org/wiki/Backpropagation [2] http://en.wikipedia.org/wiki/Numerical_differentiation [3] http://en.wikipedia.org/wiki/Symbolic_differentiation

1 more reply

jordigh12y ago

Edit: Reminds me of the list on page 3 here: http://arxiv.org/pdf/math/9404236v1

backprojection12y ago

crntaylor12y ago

Would you recommend another way?

backprojection12y ago

Try the script at the end of this comment. Here's the output http://imgur.com/5948cW6

The functionality you'd be interested in is hdaf.hdaf_periodic_samples, - you can convolve with that to low-pass filter, and use hertz_to_sigma to get the sigma parameter (from cutoff frequency).

  import numpy, scipy.special
  import matplotlib.pyplot as plt

  x=numpy.linspace(-2,5,1000)
  for n in [0,1,4,10]:
      s=1 if n==0 else n
      h=scipy.special.gammaincc(n+1,s*x**2)
      plt.plot(x, h)
  plt.savefig("filters")

EDIT: formatting.

irremediable12y ago

Edit: I should clarify that it's cool to read a mathematician's insight into the problem as seen by engineers, written in a way that engineers can understand.

short_circut12y ago

n00b10112y ago

mrcactu512y ago

I believe you, but what applications do you have in mind?

backprojection12y ago

[EDIT] As others have said below, this comment, while maybe useful, doesn't really apply to the article (or AD), since AD is about differentiating functions, not data.

jordigh12y ago· 6 in thread

Am I missing something or is this begging the question?

crntaylor12y ago

For your example, the function 1/(1-x) is the composition of

    x -> -x
    x -> 1+x
    x -> 1/x

and so its Taylor series is already known as long as you have already defined negation, addition and reciprocation.

jordigh12y ago

1 more reply

jliszkaOP12y ago

You're right, it is begging the question, for any simple function, you basically have to tell it what all the derivatives are for it to work.

The power of this technique is that it handles the product rule and chain rule for you, so composition and multiplication of functions work automatically without extra work.

For the case of 1/(1-x), you only need to tell it the derivatives (wrt x) of

f(x, a) = x * a

g(x, a) = x + a

h(x) = 1/x

Then it automatically knows how to compute the derivatives of h(g(f(x, -1), 1)).

kvb12y ago

[1] http://conal.net/papers/beautiful-differentiation/

jordigh12y ago

But knowing that sin' = cos is again begging the question, just rephrasing it back.

I think I get it, though. You can get away with simplifying intermediate steps of a differentiation computation if you only care about the value of the function at a single point.

timmclean12y ago

I am not a mathematician, but I don't think this is actually begging the question.

> you can't just take an arbitrary function and use this method

Actually, you can, AFAIK. The relation f(x + E) = f(x) + f'(x)E still holds.

If we try this method with a rational function:

    f(x) = (x - 1)/(x - 2)

    f(x + E)
     = (x + E - 1) / (x + E - 2)
     = (x + E - 1)(x - E - 2) / ((x + E - 2)(x - E - 2))
     = (x^2 - x E - 2 x + x E - E^2 - 2 E - x + E + 2) / (x^2 - x E - 2 x + x E - E^2 - 2 E - 2 x + 2 E + 4)
     = (x^2 - 3 x + 2 - E) / (x^2 - 4 x + 4)
     = (x^2 - 3 x + 2) / (x^2 - 4 x + 4) - (1 / (x^2 - 4 x + 4)) E
     = (x - 1) / (x - 2) - (1 / (x - 2)^2) E

    f'(x) = -1 / (x - 2)^2

since

    f(x + E) = f(x) + f'(x) E

Edit to add: the key idea here being that no knowledge of differentiation is needed, just tricks for manipulating expressions involving x and E until they are in normal form.

gpsarakis12y ago· 2 in thread

crntaylor12y ago

    f(a + be) = f(a) + f'(a)be + 0.5 * f''(a) b^2 e^2 + O(e^3)
              = f(a) + f'(a)be

because e^n=0 for all n>1. This isn't an approximation - it's an exact relationship for dual numbers!

You will lose some precision by using floating point numbers instead of an arbitrary-precision real number type, but this is a limitation of the machine you're working on. The method is exact.

jordigh12y ago

No, the coefficients of the Taylor series are the exact derivatives, assuming the actual arithmetic were exact (it's not, because IEEE 754). There's no loss of precision there.

crntaylor12y ago· 1 in thread

jliszkaOP12y ago

You're right, there's only O(n) information in these particular n x n matrices, so you can multiply them in O(n^2).

The code provided memoizes cell values using the key (r - c), so only O(n^2) work is required to read off the top row of the matrix. I'm planning to make that more explicit in a follow-up post.

dhammack12y ago· 1 in thread

[1] http://deeplearning.net/software/theano/

lightcatcher12y ago

My personal favorite feature of Theano is the automatic compilation to CUDA code (which would get about 8-15x speed-up over the optimized C code for the deep learning research I was doing).

Bill_Dimm12y ago· 1 in thread

There seems to be a typo at the beginning of the "Implementing dual numbers" section. It says:

  The number a+bd can be encoded as...

Should be:

  The number a+b*epsilon can be encoded as...

jliszkaOP12y ago

Thanks! fixing...

svantana12y ago· 1 in thread

Leszek12y ago

tel12y ago

See also the ad package [1] for Haskell which has a number of interesting features of this vein.

[1] http://hackage.haskell.org/package/ad

fdej12y ago

BoppreH12y ago

I couldn't believe it would work, so I made a toy implementation in Python using simple operator overloading: https://github.com/boppreh/derivative

All values tried so far agree with Wolfram Alpha, so color me surprised and happy for learning something new.

fusiongyro12y ago

  http://www.math.wisc.edu/~keisler/calc.html

mrcactu512y ago

congratulations, you have implmented the Zariski tangent space using nilpotent matrices. welcome to the beautiful theory of algebraic geometry and schemes.

http://math.stanford.edu/~vakil/0708-216/216class21.pdf

This really does fall in the ream of algebraic geometry since this method only works for rational functions - as he implemented it.

To numerically compute sin(x + ε) you need the Taylor series.

Pitarou12y ago

tomrod12y ago

This read nicely until I got to the code block. Does anyone else see this as yellow and gray (with syntax highlighting) coloration--such that it's virtually impossible to read?

Bahamut12y ago

As a former mathematician, the second sentence is an abomination:

"...but to give you an overview, the idea is that you introduce an algebraic symbol ϵ such that ϵ≠0 but ϵ^2=0"

j / k navigate · click thread line to collapse