Edit: Found an answer: https://www.quora.com/In-optimization-why-is-Newtons-method-...
The geometric answer is that gradient descent approximates the shape of your currently optimized point by a plane that is tangent to the current point, whereas Newton's approximates it with a quadratic "bowl" using the local curvature. In gradient descent you have to take short steps down the path, because you know your approximation cannot be true (you will not always be able to lower your error by moving in the same direction, if you assume there is some minimum error). In Newton's method you can assume that your bowl approximation is correct and go straight to the bottom of that bowl. If your space really does look like a bowl, you will get to the bottom super quickly.
The ELI5 answer: you are a ship captain looking for the lowest point in a part of the ocean. You cannot afford to send a submersible down, so your approach is to send an anchor down to check the depth and take a panoramic photo a few feet around the anchor. After analyzing the results of each anchor drop you can move your ship anywhere and try again. In gradient descent you look for the lowest point on the edge of the camera's vision and move in that direction. You can move as far as you'd like, but you will not go too far because you know the drop will not go on forever. In Newton's method you will look at the curvature in the pano and extrapolate it out assuming it is a "bowl" shape. For example maybe your anchor is on an underwater J-shaped cliff getting less steep as it goes north, so you map it out as a U shape that starts going up again if you go a mile north. So, you move your ship a mile north and try again there. In gradient descent you would have moved maybe 100 feet north then checked again. Newton's method can intuitively get you to the low point in very few anchor readings, but crunching the numbers of the bowl can be very time-consuming (this becomes relevant in higher dimensions). In this analogy, the long and lat coordinates in the ocean are your two parameters to be optimized, and the height of the ocean floor is the error in your optimization. You want to find the parameters that give you the lowest error.
http://thelaziestprogrammer.com/sharrington/math-of-machine-...
This looks at the algorithm the same way that the OP does (mathematically, visually, programmatically), only it applies it to a simpler Linear Regression model.
I would like to see some mention of the fact that the division by the gradient is a meaningless, purely formal motivation for the correct step (inverting the Hessian) that follows.
f(x) =~ f(a) + f'(a)(x-a)
We want f(x) = 0 so, 0 = f(a) + f'(a)(x-a)
Multiply both sides by the inverse of f'(x): 0 = f'(a)^-1 f(a) + x-a
So: x = a - f'(a)^-1 f(a)
This is the update equation for Newton's method where a is the current iterate and x is the next iterate.If f is a multi dimensional function f : R^n -> R^n then the derivative f'(a) is the Jacobian, and inversion becomes matrix inversion.
When we use Newton's method for minimisation of a function g we solve g'(x) = 0, so we pick f(x) = g'(x). Since the formula above contains f' we get a second derivative. The second derivative in multiple dimensions is the hessian.
If you aren't using LaTeX for formatting, then think partial^2 of l. FWIW: I just found this[2] which would make (using the physics package) even simpler to represent.
[1] https://en.wikipedia.org/wiki/Hessian_matrix
[2] https://tex.stackexchange.com/questions/225523/how-to-write-...
As to the motivation for the correct step: can you point me to a resource that explains this? Not sure I follow...
You write an equation involving division by the gradient. This is an illegal operation (one cannot divide by a vector), and your final recipe doesn't do it. As far as I can tell, you are writing down the incorrect, illegally-vector-inverting formula as motivation for the correct formula involving the (inverse of the) Hessian. All I am suggesting is that you say explicitly something like "Of course, this formula as written is not literally correct; one cannot actually divide by a vector. The correct procedure is explained below."
(Incidentally, speaking of inverses, another poster (https://news.ycombinator.com/item?id=14881265) has mentioned that it may be a bit confusing to speak of the inverse of a matrix rather than the reciprocal, since (as I interpret that other poster's point) the reciprocal of a matrix is just its inverse. I might prefer to say something like "We write $H_{\ell(\theta)}^{-1}\nabla\ell(\theta)$ rather than $\frac{\nabla\ell(\theta)}{H_\ell(\theta)}$ to emphasise that we are inverting a matrix, not a scalar, so that the order of multiplication matters.")
In retrospect, I could have better represented the data with 2 overlying histograms, but this (somewhat) captures the intent of showing that "more expensive houses tend to have more than 2 bathrooms".
Maybe a linked post that dives deeper into the 5 steps of Newton's Method? Would that be more approachable?
Gradient Descent explored similarly to this post: http://thelaziestprogrammer.com/sharrington/math-of-machine-...
The difference between the two: https://www.quora.com/In-optimization-why-is-Newtons-method-...
That said, I should include facts related to convergence, and maybe even speed compared to SGD.
As to the reciprocal -> inverse generalization, do you have any resources you could point we towards to better understand this?
Additionally, a concrete answer to "Why would following the tangent repeatedly be a good idea?" has been hard to come by for me. I am able to visualize this, but if you have resources that explain this well please share.
Newton’s method boils down to replacing your function by a first-order approximation. For a differentiable function, in a small neighbourhood(!), that’s a good approximation (by definition), though, and the zero of the model function will be very close to the zero of the original function (if it lies in that neighbourhood).
PS: i did not expect the poster and author to be the same person, otherwise I would’ve phrased my criticism differently. A SHOW HN would have helped.
PPS: basically the whole reciprocal/inverse confusion only arises because you start the multidimensional case from your iteration formula. If you back to its derivation, and start again from there, you can avoid that.