On a related note, some of you might also be interested in a Stanford CS class I'll be co-teaching with my adviser next quarter. It will be focused primarily on Convolutional Networks (but a lot of the learning machinery is generic):
CS231n: Convolutional Neural Networks for Visual Recognition http://vision.stanford.edu/teaching/cs231n/
I hope to make a lot of the materials/code freely available so everyone can follow along, and I will continue my work on this guide in parallel whenever I can squeeze in time. (And I'd be happy to hear any feedback on the content/style/presentation)
1) In "Becoming a Backprop Ninja", dx is never declared. Is it a variable that was set in a previous pass, or is it some value like 1 that depends on its function? I understand how to derive dx1 and dx2 from dx.
2) In the SVM, can "pull" be other values than -1, 0, 1? It seems like it might affect how rapidly the SVM learns.
3) It takes 300 iterations to train the binary classifier, could you add a brief paragraph about why this isn't 30 or 3000, and what has the greatest effect on iterations?
4) In 2D SVM, what is the α function? Are w0^2 and w1^2 an arbitrary way to regularize (like would |w0| + |w1| work as well?) It made me think of least squares fitting or RMS power, so I wondered what the reasoning was behind it.
Many more questions (and a sense of irony because I have learned NNs in the past and have since forgotten) so I am trying to give my brain some breadcrumbs.
I think your explanations are great, but you seem to imply that the gradient 3 for f(x,y) = xy with x = -2 and y=3 derived w.r.t x is somehow the "best" gradient (you call it the correct or actual gradient) but I don't understand why this is better than any other positive number.
Wouldn't we optimize faster of we tug with a force of 6 instead of 3?
x = -2, y= 3,
dx = 3, dy = -2
step = 0.0001
f(x,y) = -6
f(x+dx*step, y*dy*step) = -5.998...
// Now if we double the gradient
f(x+2*dx*step, y*2*dy*step) = -5.997...The analytical gradient is actually a general tool used in multivariable calculus. It's vector-valued, in that if you take the gradient of a function of three variables: f(x, y, z), then you'll get a 3-vector back. Vectors are defined by two characteristics: a direction and a magnitude. The gradient's direction is the direction of greatest increase and the magnitude of the gradient is the instantaneous rate of change.
The gradient is being put to work here in order to optimize a function using a process called gradient ascent. Intuitively it makes sense that in order to optimize your function you'd want to "move" in the direction of greatest increase at every step. However, the size of the step that you take is tricky. As you point out, in this case, we can increase the objective function value more if we double the step size. However, you're not actually doubling /the gradient/.
If you look at the expression that you wrote you should see:
x + 2 * dx * step, y * 2 * dy * step.
What you've done is double the step-multiplier, not the gradient itself (<dx, dy>). This means that the optimization process jumps further at each step. However, the step size that OP chose is somewhat arbitrary to begin with, so it's not clear why any particular choice would be better or worse. The reason that the step size matters is that if your step size is too large your optimization process might jump over a minimum location or something similarly bad -- there are pathological functions like the Rosenbrock Function [1] which are notoriously hard to optimize with gradient ascent. In practice, you'll often choose your step size more intelligently based on a few different tests, or vary it as the optimization process progresses.In this particular instance, the surface that you're trying to optimize is pretty simple so basically any step value will do the trick. It may take a different number of steps in order to compute the global maximum, but most reasonable step sizes will get there.
The simplicity of the example is deceiving you, in the sense that in a more complex circuit (than just x * y), it will no longer be the case that just simply marching along the direction of the gradient in huge steps will necessarily lead to good outcome.
All we can do is work very locally. You are at some specific point (x,y)=(-2,3). The gradient is telling you that the best direction (in this 2-D space) to go along if you're interested in increasing your function is in the direction (3, -2), which is the gradient at that point.
Now if you remember there's a crucial parameter step_size, and the update will take the form (in vector notation):
(x,y) += step_size * (dx, dy)
all that the gradient is saying is that if step_size is infinitesimally small, your function will increase, and that this direction is the Fastest increase. If you try any other direction with infinitesimally small step_size, you will do worse. However, there are no guarantees whatsoever on what happens when the step_size is larger. In practice we use step sizes of 0.001, or whatever (small numbers), and it works only because the functions we deal with are relatively smooth and well-behaved.
However, after we take the small step we have to right away re-evaluate the gradient because suddenly the direction could be all shifted. Sometimes doing gradient descent is compared to walking blind on a hill and trying to get to the top: you can sense the steepness of the hill at your feet (the gradient), and you make steps accordingly. But if you make too large of a step and you're not careful, there could have been actually been large drop.
TLDR: x * y is very simple example, this certainly wouldn't be the case in more complicated circuit.
Interesting choice of language, maybe you could add little run buttons to the code snippets to try all the examples in the browser :-)
I'm a huge fan of interactive demos (see e.g. my ConvNetJS demos: http://cs.stanford.edu/people/karpathy/convnetjs/ ) because they help develop intuitions that solidify the material. And yes, that's partly why I chose Javascript; I'm definitely planning to add little demos of the gradients flowing around the circuit and so on. Coming soon! :)
How do I get the math type setting working ?
Just a minor nit: s/peaked/piqued/
-A
I started this online free book "Neural Networks and Deep Learning" (http://neuralnetworksanddeeplearning.com/). I think it has a pretty good explanation and illustration.
My understanding after reading this guide, is that a neural network is essentially just a formula for guessing the ideal parameter(s) of another formula, and for successively refining those parameters as more training data is passed in. I already knew this "in theory" before, but now I think most of the magic and mystery has been brushed off. Thanks karpathy!
However seeing the code and then the related equation is teaching me mathematics that I couldn't grok in school 20 years ago.
This is (for me) a much better way to explain it!
Thank you for sharing.
I think of these things like lectures in college. Assuming that a) you have all the context up to that point, b) you grok every step, and c) it doesn't run beyond the length of you attention span - call it an hour - then you can learn astonishing things in a matter of a few sessions. But if you miss a step, and don't raise your hand because you think your question is stupid, then you might as well pack up and go home.
I find that even though the time element of articles is removed, I still trip over certain sections because the author was too lazy to anticipate my thinking process, which forces me to search other sites for the answers, and before you know it the impact of the writing has been completely diluted.
Andrew Ng's Standford course has been a god send in laying out the mathmatics of Machine Learning. Would be a good next step for anybody who was intrigued by this article.
http://www.greenteapress.com/thinkbayes/ http://www.greenteapress.com/index.html
My naive question are the gates are always so trivial or are they usually black boxes with an unknown function underneath(instead of clearly defined ,+, etc).
In other words do we usually/always know what stands behind f(x,y)?
Otherwise, if f(x,y)=xy then obviously you can hardcode a heuretic that just cranks up max x and max y and gets max f(x,y).
That is the question is why should we tweak the input slightly not go all out.
Feel free to drop me a line if you'd like to discuss any further. I'd be interested to hear what you're working on.
It pretty much is, though.
Can anyone define 'pertubations' for me?