The Julia world has done a lot of experimentation with AD and is converging on some really cool things, so if you're interested in this field it's definitely worth a look.
[1]: https://github.com/FluxML/Flux.jl/blob/master/src/tracker/Tr...
"You could have written Flux. All of it, from LSTMs to GPU kernels, is straightforward Julia code. When in doubt, it’s well worth looking at the source. If you need something different, you can easily roll your own." http://fluxml.ai/Flux.jl/stable/
edit: and the automatic differentiation works on them too!
It uses a special sort of invented numbers which square to zero even though they are not themselves zero.
Here is one of the many tutorials on automatic differentiation: https://pizzaseminar.speicherleck.de/automatic-differentiati...
Fundamentally, automatic differentiation is the methodical application of the chain rule. Forward mode results from the application of the rules for directional derivatives. Reverse mode results from the total derivative. The reason that we have a reverse pass in reverse mode can be seen from this perspective. The directional derivative is `(f o g)'(x)dx = f'(g(x)) g'(x)`. Note, we compute `g(x)` before `f(g(x))` and the derivative `g'(x)` before `f'(g(x))`. Therefore, we can compute the derivatives as we compute the answer. If we want the gradient, which results from the total derivative, we have `grad (f o g)(x)` = `g'(x)* grad f(g(x))`. Although we still compute `g(x)` before `f(g(x))` during our computation, the gradient requires the computation of `grad f(g(x))` before the application of the adjoint operator `g'(x)*`. We do the evaluations on the first pass, and cache extra values, and then compute the gradient on a reverse pass because we need the adjoint of the total derivatives in the reverse order.
Or, at least that's my bias in how to derive things.
Besides efficiency concerns (which I don't have a clue about), a disadvantage of the point of view using dual numbers is that, to my knowledge, it can only be used to derive the forward mode of automatic differentiation. Still I take pleasure in appreciating the slightly mystic aura of the dual numbers. :-)
As magical as the chain rule of differentiation.
Is this really used in practice?
It seems to me that most of the AD frameworks used for deep learning implement the backward function that returns the jacobian for every initial function, and then chain those backward functions
From this it should be clear why machine learning uses reverse-mode.
On the other hand, forward-mode is better for e.g. calculating the tangent of a high-dimensional curve (i.e. R -> R^n).
Though I have never seen AD frameworks used in the production contexts for neural networks/backpropagation. As you say, the code for this seems to be mostly handrolled. Please take this negative statement with a grain of salt, I don't actually work in machine learning.
https://www.infoworld.com/article/3284380/data-science/what-...
Close to C speed in a dynamic language? Seems pretty great on paper. Is this generally the case?
function test1(n)
x = 0
for i = 1:n
if i == 10
x = x + 0.1
else
x = x + 1
end
end
return x
end
It isn't type stable because x starts out as an int but then changes to a float in the middle of the loop. This code compiles to 78 instructions.The following code is type stable:
function test2(n)
x = 0.0
for i = 1:n
if i == 10
x = x + 0.1
else
x = x + 1.0
end
end
return x
end
This code compiles to 14 instructions. julia> @btime test1(10^5)
147.427 μs (0 allocations: 0 bytes)
99999.1
julia> @btime test2(10^5)
88.472 μs (0 allocations: 0 bytes)
99999.1
In earlier versions of Julia, the penalty here would be order of magnitudes worse.The performance is awesome, the REPL is super great as well. There's libraries that I'm itching to try out as soon as I've got a relevant project (Flux ML being the top of that list).
There's been a lot of situations in using it where I've just gone "this is everything I needed/wanted": the performance, the language features and API design, etc.
Plus there's a lot of very interesting things going on with the language and ecosystem, I definitely recommend trying it out.
function foo(x)
#do something
end
and you call foo(10) and foo("some string"), then the compiler will create specialized methods foo(x::Int) and foo(x::String). Then there is no need for tracking the dynamic type of x inside these functions.the catch is, the first time you run a function there is often a noticeable compile time. but it’s cached after that.
other problems with Julia include a somewhat immature/unfinished set of libraries, in part because the language was constantly changing underneath people.
but now that 1.0 has been released, the language will be stable for a long time and you can expect that to improve quickly.
good language! it gets a lot of hype on HN but that’s because it is actually very nice.
That isn't true. Closures, higher order functions and fused broadcast array expressions are all very fast except in some corner cases.