Can anyone take a real world example of human behavior and show me how it relates to how these techniques predict humans will behave?
I love the field but feel like there is a temptation to take giant leaps not supported by other observations.
A neuron is either activated or not and each of the many inputs can be either excitatory (encourages activation) or inhibitory (discourages activation). McCulloch and Pitts formalized this as a weighted average of the inputs that was then thresholded to 0 or 1. And they showed some basic theoretical results from that that gave it some credit as a model for how intelligence can arise from neurons. Essentially they said behavior can be described as a classifier.
AFAIK, they didn't go much into how the weights were actually learned. Different strategies were tried, but we ultimately started to soften the threshold function into the logistic function (to make the network differentiable) and solve for the weights by gradient descent.
Modern Deep Learning makes the additional assumption that neurons in the same layer are not interconnected. This assumption, along with the fact that we're just dealing with weighted averages, allows us to describe networks in matrix form, allows us to compute the gradients with backprop, and allows efficient simulation on the GPU. This assumption is more practical than biological.
> show me how it relates to how [...] humans will behave?
[This page][1] attempts to connect the dots between the McCulloch and Pitts model, the resulting classifiers, and behavior. Essentially, the theory was that neurons can be formalized into classifiers, and behavior is just the output of these classifiers. I don't know too much about modern neuroscience, but given the amazing results we are seen these days in vision, language, and planning, I'd say the central ideas of the theory are still credible.
[1]: http://www.mind.ilstu.edu/curriculum/modOverview.php?modGUI=...
Calling those chained regressions similar to the brain is about as correct as saying that a 3y old's drawing of a car is similar to a real Tesla...
https://en.wikipedia.org/wiki/Hilbert%27s_thirteenth_problem
Kolmogorov authored a paper titled "On Representation of Continuous Functions of Several Variables by Superpositions of Continuous Functions of Smaller Number of Variables," that basically solved this in 1961. This led to a nice back and forth series of papers between Kolmogorov and Arnold, but the one that becomes more important is Kolmogorov's paper, "On the Representation of Continuous Functions of Many Variables by Superposition of Continuous Functions of One Variable and Addition," in 1963. What this paper proves is that any continuous function defined on the n-dimensional unit cube can be represented by the superposition of 2n one dimensional continuous functions:
https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_repr...
Now, the problem with this theorem is that it doesn't say how to find these magical 2n functions. However, in 1989 Cybbenko published the paper, "Approximation by Superpositions of Sigmoidal Functions," which both extends and weakens the above result. Basically, he loses the 2n bound, but gives a way to construct these functions by using a linear projection inside of a superposition of sigmoids. This led to the universal approximation theorem:
https://en.wikipedia.org/wiki/Universal_approximation_theore...
and I would contend the underpinnings for the modern neural net models. Now, is there any biology in there? No. It's a long series of function approximation papers. That said, I don't know the authors involved or what inspired them to write these papers. However, given that we have a documented history of dry function approximation papers that give us the mathematical power that we need to begin to justify these models, I tend to feel that the biological connections are oversold.
y = X β + ε ...and a few assumptions give you... (X^t X)^-1 y = β*
I might be missing something in the blog post.
matrix inversion is ~O(n^3)
gradient descent is ~O(np) where p is the number of predictors and n are the observations (n x p matrix).
for lasso, calculating that derivative of the multiplier is not possible (for all points), so coordinated descent is used.
Has anyone managed to land a decent deep learning job without formal CS/machine learning training? How did you approach it?
I felt like they were to math heavy. However, I'm struggling on how to learn deep learning.
These statements are in contention. You will never really understand machine learning without learning a fair bit of the math.I do think a lot can be done on the presentation of the material, and certainly don't think much of credentialism.
Honestly, in your shoes I would look for a position where you can learn from people internally, rather than try and qualify yourself first. Even if you do a bunch of online learning and toy problems, you are going to flail about if you don't have a strong mentor in your first position.
What related/supportive skills do you have to bring to a group that is doing ML ?
edit: I should add that you don't really have to understand much these days to integrate (some) ML into a system, but you aren't going to get very far into modeling or understanding issues without some background. You can only get so far with black boxes.
I have around 8 years of professional software experience (C++/C#) and have fiddled around with some rudimentary machine learning for work, like linear regression, k-means clustering, etc. I have a decent idea of how/why they work, but have fallen flat on my face when learning the theory behind more complicated algorithms, e.g. Hessians from Andrew Ng's class. In my experience, many classes tend to focus on a ground up approach. With higher level frameworks like Keras, how necessary is this?
Its one thing to know the math and theory to design, train, and tune the algorithm your company needs. But implementing it into production, at scale? That's not the same person.
Ideally, you have Person/Team A, who designs but knows enough about implementation to keep that in mind during their process, and Person/Team B who implements it into the software but knows enough about the design to make it work.
In the first example, the method compute_error_for_line_given_points is called with values 1, 2, [[3,6],[6,9],[12,18]]. Where did those values come from?
Later in that same example, there is an "Error = 4^2 + (-1)^2 + 6^2". Where did those values come from?
Later, there's another form: "Error = x^5 - 2x^3 -2" What about these?
There seem to be magic formulae everywhere, with no real explanation in the article about where they came from. Without that, I have no way of actually understanding this.
Am I missing something fundamental here?
Many of the deep learning courses assume "high school math", but my school must have skipped matrices, so I've been watching Khan Academy videos.
Are there any good posts / books on walking through the math of deep learning from a true beginner's perspective?
If the first example had been kept, then the second would have been "Error = (6 - (2·3 + 1))² + (9 - (2·6 + 1))² + (18 - (2·12 + 1))² = (-1)² + (-4)² + (-7)² = 66", which is what compute_error_for_line_given_points evaluates to.
The third would have been "Error = (6 - (m·3 + b))² + (9 - (m·6 + b))² + (18 - (m·12 + b))² = 3·b² + 42·b·m - 66·b + 189·m² - 576·m + 441" and its derivative would have to be taken in two directions, giving "dError/dm = 42·b + 378·m - 576" and "dError/db = 6·b + 42·m - 66". Visualizing that slope would require a 3D plot.
>In the first example, the method compute_error_for_line_given_points is called with values 1, 2, [[3,6],[6,9],[12,18]]. Where did those values come from?
It's an example. The first two arguments define a line y = 2x + 1, the pairs are (x,y) points being used to compute the error.
"To play with this, let’s assume that the error function is Error=x^5−2x^3−2"
This is just an example of a function used as exposition to talk about derivatives.
It isn't even an error function though. An error function has to be a function of at least two variables.
On a side note: I lost all respect for andrew ng after the Baidu cheating scandal. https://www.nytimes.com/2015/06/04/technology/computer-scien...
I felt he got away too easy on that, without any apology or even a public statement - especially considering he is a former academic. (And that too he silently deleted his google+ posts.) Imagine if something like that had happened at a Google research team - I am pretty sure Jeff Dean or Peter Norvig would have stepped down.
I don't understand why you'd want to cheat for a competition like this? I get it, people cheat all the time, but the field of machine learning is built on a foundation of open and shared research, and trust.
[The author of the paper mentioned on parent comment is Jürgen Schmidhuber - inventor of LSTMs and a very colorful character in neural land. The NYTimes did a nice profile on him a while back: https://www.nytimes.com/2016/11/27/technology/artificial-int... HN Discussion: https://news.ycombinator.com/item?id=13066646]
Andrew is now working at Tesla. I believe this is his course:
And then the idea that numerical optimization accounting for the slope was novel. How does he think that mathematicians calculated for the preceding centuries?
Linear regression springs full formed in the 1950's and '60's? What happened to Fisher and Student and Pearson and all the rest?
Where's Hopfield? Where's Potts? Where's an awareness of the history of mathematics in general?
<ducks because="had to get that one out there"/>
Edit: There we go with the downvotes, I knew it that deep learning guys can't stand this claim (but it's true, as the post itself goes to show in great length... :-))