Automatic Differentiation with Julia (opens in new tab)

(blog.rogerluo.me)

146 pointsAcur7y ago78 comments

78 comments

24 comments · 4 top-level

For those unaware of what automatic differentiation is: It's a close-to-magical tool which turns code for evaluating a function f into code for evaluating its derivative f'.

It uses a special sort of invented numbers which square to zero even though they are not themselves zero.

Here is one of the many tutorials on automatic differentiation: https://pizzaseminar.speicherleck.de/automatic-differentiati...

kxyvr7y ago

To clarify, your statement that, "It uses a special sort of invented numbers which square to zero even though they are not themselves zero." is not entirely true. In case someone else is looking at this, what the OP is referring to is dual numbers. That is one way to implement things, but not the most common way for fast tools.

Fundamentally, automatic differentiation is the methodical application of the chain rule. Forward mode results from the application of the rules for directional derivatives. Reverse mode results from the total derivative. The reason that we have a reverse pass in reverse mode can be seen from this perspective. The directional derivative is `(f o g)'(x)dx = f'(g(x)) g'(x)`. Note, we compute `g(x)` before `f(g(x))` and the derivative `g'(x)` before `f'(g(x))`. Therefore, we can compute the derivatives as we compute the answer. If we want the gradient, which results from the total derivative, we have `grad (f o g)(x)` = `g'(x)* grad f(g(x))`. Although we still compute `g(x)` before `f(g(x))` during our computation, the gradient requires the computation of `grad f(g(x))` before the application of the adjoint operator `g'(x)*`. We do the evaluations on the first pass, and cache extra values, and then compute the gradient on a reverse pass because we need the adjoint of the total derivatives in the reverse order.

Or, at least that's my bias in how to derive things.

IngoBlechschmid7y ago

Thank you for the very insightful background.

Besides efficiency concerns (which I don't have a clue about), a disadvantage of the point of view using dual numbers is that, to my knowledge, it can only be used to derive the forward mode of automatic differentiation. Still I take pleasure in appreciating the slightly mystic aura of the dual numbers. :-)

1 more reply

aaaaaaaaaab7y ago

>close-to-magical

As magical as the chain rule of differentiation.

abecedarius7y ago

It’s interesting though that the way calculus is classically taught does not make this obvious.

1 more reply

elcomet7y ago

Thid is really neat.

Is this really used in practice?

It seems to me that most of the AD frameworks used for deep learning implement the backward function that returns the jacobian for every initial function, and then chain those backward functions

cultus7y ago

I used it for non ML tasks in geophysics, which made my life a lot easier. However, I think most scientists and engineers aren't aware of it. It has been described as "criminally underused."

aaaaaaaaaab7y ago

To calculate the gradient of a function R^n -> R^m, forward-mode AD is preferable if n << m, while reverse-mode is faster for n >> m.

From this it should be clear why machine learning uses reverse-mode.

On the other hand, forward-mode is better for e.g. calculating the tangent of a high-dimensional curve (i.e. R -> R^n).

IngoBlechschmid7y ago

Yes, definitely, there are even battle-tested implementations for Fortran available.

Though I have never seen AD frameworks used in the production contexts for neural networks/backpropagation. As you say, the code for this seems to be mostly handrolled. Please take this negative statement with a grain of salt, I don't actually work in machine learning.

1 more reply

RivieraKid7y ago

How useful is it in practice?

RBerenguel7y ago

It's used extensively in numerical analysis/differential equations/some PDE work and the like, at least.

ncfausti7y ago· 9 in thread

For those curious about Julia, I just found this:

https://www.infoworld.com/article/3284380/data-science/what-...

Close to C speed in a dynamic language? Seems pretty great on paper. Is this generally the case?

jules7y ago

As long as you write code that's type stable, yea. Type stable means that the compiler can deduce the types of the variables used in a function given the types of the arguments passed to the function. An example of code that's not type stable is code something like this:

  function test1(n)
    x = 0
    for i = 1:n
      if i == 10
        x = x + 0.1
      else
        x = x + 1
      end
    end
    return x
  end

It isn't type stable because x starts out as an int but then changes to a float in the middle of the loop. This code compiles to 78 instructions.

The following code is type stable:

  function test2(n)
    x = 0.0
    for i = 1:n
      if i == 10
        x = x + 0.1
      else
        x = x + 1.0
      end
    end
    return x
  end

This code compiles to 14 instructions.

kristofferc7y ago

Julia is pretty good at dealing with this now though

    julia> @btime test1(10^5)
      147.427 μs (0 allocations: 0 bytes)
    99999.1

    julia> @btime test2(10^5)
      88.472 μs (0 allocations: 0 bytes)
    99999.1

In earlier versions of Julia, the penalty here would be order of magnitudes worse.

FridgeSeal7y ago

I've been using it at work and it's pretty great.

The performance is awesome, the REPL is super great as well. There's libraries that I'm itching to try out as soon as I've got a relevant project (Flux ML being the top of that list).

There's been a lot of situations in using it where I've just gone "this is everything I needed/wanted": the performance, the language features and API design, etc.

Plus there's a lot of very interesting things going on with the language and ecosystem, I definitely recommend trying it out.

sischoel7y ago

While Julia is indeed a dynamically typed language, it also supports type annotations and can in most cases infer the static type of a variable. If you write a function

  function foo(x)
    #do something
  end

and you call foo(10) and foo("some string"), then the compiler will create specialized methods foo(x::Int) and foo(x::String). Then there is no need for tracking the dynamic type of x inside these functions.

one-more-minute7y ago

Generally, yes. It sounds like magic at first, but it's really just like a very lazy C++ compiler. You can happily look through the generated LLVM and machine code if you want to make sure it's reasonable.

currymj7y ago

surprisingly yes, it really is. Julia is smart about type inference, so it compiles a specialized version of a function as needed for whatever argument types actually get used.

the catch is, the first time you run a function there is often a noticeable compile time. but it’s cached after that.

other problems with Julia include a somewhat immature/unfinished set of libraries, in part because the language was constantly changing underneath people.

but now that 1.0 has been released, the language will be stable for a long time and you can expect that to improve quickly.

good language! it gets a lot of hype on HN but that’s because it is actually very nice.

ska7y ago

Julia is hardly the first (cf some common lisp environments, 20+ years ago). To really get near (say within 2x) of c you need some sort of type annotation or inference of course, and (more importantly) you have to structure your code such that this works, but it's quite do-able. Your code does tend to end up a bit c-like in those performance critical sections, but that's hardly surprising.

byt1437y ago

>Your code does tend to end up a bit c-like in those performance critical sections, but that's hardly surprising.

That isn't true. Closures, higher order functions and fused broadcast array expressions are all very fast except in some corner cases.

1 more reply

kazagistar7y ago

It's "cost" is in startup times (compiling your code with LLVM at runtime slows things down, no surprise there), and you still have to write somewhat type-stable code (outputs uniquely inferrable from inputs) if you want it to actually run full speed.

one-more-minute7y ago· 1 in thread

You may like to take a look at Flux's implementation [1]; roughly the same idea but "professionalised" with performance work, tighter integration with the type system, nested AD and so on. It's a little less simple for that, of course, but is still under 500loc of fairly straightforward Julia code, and is generally a bit faster than PyTorch.

The Julia world has done a lot of experimentation with AD and is converging on some really cool things, so if you're interested in this field it's definitely worth a look.

[1]: https://github.com/FluxML/Flux.jl/blob/master/src/tracker/Tr...

cbkeller7y ago

Flux is awesome. One of the biggest advantages IMO is that kernels you can easily write yourself should be by default just as fast as what's built-in -- since it's all written in Julia and Julia itself is fast, without having to rely on C/C++/FORTRAN under the hood. As the Flux devs say:

"You could have written Flux. All of it, from LSTMs to GPU kernels, is straightforward Julia code. When in doubt, it’s well worth looking at the source. If you need something different, you can easily roll your own." http://fluxml.ai/Flux.jl/stable/

edit: and the automatic differentiation works on them too!

kxyvr7y ago

Alright, this looks neat, but I'm having a terrible time figuring out what's going on with the benchmark. Typically for AD, it's easiest to see four things: number of variables, time to compute function directly with no AD, time to compute function with AD enabled and calculate the gradient, ratio between these two numbers. Then, we run the same example with a varying number of variables to see how things scale. The advantage of reverse mode is that, theoretically, this ratio is fixed and bounded at 4 or 5 regardless of the number of variables. Realistically, this is much higher, but I think it's fine if it's somewhat between 10-40 times function evaluation as long as it scales. I can't figure out what their ratio is or whether or not it's appropriately scaling. Can anyone else?

j / k navigate · click thread line to collapse

78 comments

24 comments · 4 top-level

IngoBlechschmid7y ago· 10 in thread

For those unaware of what automatic differentiation is: It's a close-to-magical tool which turns code for evaluating a function f into code for evaluating its derivative f'.

It uses a special sort of invented numbers which square to zero even though they are not themselves zero.

Here is one of the many tutorials on automatic differentiation: https://pizzaseminar.speicherleck.de/automatic-differentiati...

kxyvr7y ago

Or, at least that's my bias in how to derive things.

IngoBlechschmid7y ago

Thank you for the very insightful background.

1 more reply

aaaaaaaaaab7y ago

>close-to-magical

As magical as the chain rule of differentiation.

abecedarius7y ago

It’s interesting though that the way calculus is classically taught does not make this obvious.

1 more reply

elcomet7y ago

Thid is really neat.

Is this really used in practice?

It seems to me that most of the AD frameworks used for deep learning implement the backward function that returns the jacobian for every initial function, and then chain those backward functions

cultus7y ago

I used it for non ML tasks in geophysics, which made my life a lot easier. However, I think most scientists and engineers aren't aware of it. It has been described as "criminally underused."

aaaaaaaaaab7y ago

To calculate the gradient of a function R^n -> R^m, forward-mode AD is preferable if n << m, while reverse-mode is faster for n >> m.

From this it should be clear why machine learning uses reverse-mode.

On the other hand, forward-mode is better for e.g. calculating the tangent of a high-dimensional curve (i.e. R -> R^n).

IngoBlechschmid7y ago

Yes, definitely, there are even battle-tested implementations for Fortran available.

1 more reply

RivieraKid7y ago

How useful is it in practice?

RBerenguel7y ago

It's used extensively in numerical analysis/differential equations/some PDE work and the like, at least.

ncfausti7y ago· 9 in thread

For those curious about Julia, I just found this:

https://www.infoworld.com/article/3284380/data-science/what-...

Close to C speed in a dynamic language? Seems pretty great on paper. Is this generally the case?

jules7y ago

  function test1(n)
    x = 0
    for i = 1:n
      if i == 10
        x = x + 0.1
      else
        x = x + 1
      end
    end
    return x
  end

It isn't type stable because x starts out as an int but then changes to a float in the middle of the loop. This code compiles to 78 instructions.

The following code is type stable:

  function test2(n)
    x = 0.0
    for i = 1:n
      if i == 10
        x = x + 0.1
      else
        x = x + 1.0
      end
    end
    return x
  end

This code compiles to 14 instructions.

kristofferc7y ago

Julia is pretty good at dealing with this now though

    julia> @btime test1(10^5)
      147.427 μs (0 allocations: 0 bytes)
    99999.1

    julia> @btime test2(10^5)
      88.472 μs (0 allocations: 0 bytes)
    99999.1

In earlier versions of Julia, the penalty here would be order of magnitudes worse.

FridgeSeal7y ago

I've been using it at work and it's pretty great.

The performance is awesome, the REPL is super great as well. There's libraries that I'm itching to try out as soon as I've got a relevant project (Flux ML being the top of that list).

There's been a lot of situations in using it where I've just gone "this is everything I needed/wanted": the performance, the language features and API design, etc.

Plus there's a lot of very interesting things going on with the language and ecosystem, I definitely recommend trying it out.

sischoel7y ago

While Julia is indeed a dynamically typed language, it also supports type annotations and can in most cases infer the static type of a variable. If you write a function

  function foo(x)
    #do something
  end

one-more-minute7y ago

currymj7y ago

surprisingly yes, it really is. Julia is smart about type inference, so it compiles a specialized version of a function as needed for whatever argument types actually get used.

the catch is, the first time you run a function there is often a noticeable compile time. but it’s cached after that.

other problems with Julia include a somewhat immature/unfinished set of libraries, in part because the language was constantly changing underneath people.

but now that 1.0 has been released, the language will be stable for a long time and you can expect that to improve quickly.

good language! it gets a lot of hype on HN but that’s because it is actually very nice.

ska7y ago

byt1437y ago

>Your code does tend to end up a bit c-like in those performance critical sections, but that's hardly surprising.

That isn't true. Closures, higher order functions and fused broadcast array expressions are all very fast except in some corner cases.

1 more reply

kazagistar7y ago

one-more-minute7y ago· 1 in thread

The Julia world has done a lot of experimentation with AD and is converging on some really cool things, so if you're interested in this field it's definitely worth a look.

[1]: https://github.com/FluxML/Flux.jl/blob/master/src/tracker/Tr...

cbkeller7y ago

edit: and the automatic differentiation works on them too!

kxyvr7y ago

j / k navigate · click thread line to collapse