I still haven’t forgiven CGP Grey for changing the title to his 2017 ML video to “How AIs, like ChatGPT, learn”. The video is about genetic algorithms and has nothing to do with ChatGPT. (or with anything else in modern AI)
I think that perhaps reading this description after watching the video might make the process more memorable. My guess is that if I had my daughter read this first, it wouldn't do much to make the video easier to parse. Reading this real-world example after watching the video could help solidify the concept.
Disclaimer: I don't know a lot about AI/ML, so it's possible that I am 100% wrong here!
This one? https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_...
The footnote video was an alright explanation of backprop. If that were part of the main video that would have been reasonable.
I really like his history/geography videos but anything technical leave a lot to be desired. And don't get me started on Humans Need Not Apply.
Well now you have to tell us. :) Many of the concrete examples in that video are exaggerated and/or misunderstood but the general question it asks - what to do when automation makes many people unemployable through no fault of their own - seems valid.
This is, I think, the case. But I don't begrudge them too much, YouTube is cutthroat.
Knowledge is hard to attain...
I think this book is a shining example of that philosophy: https://www.buildyourownlisp.com/. In the book, you implement an extremely bare-bones version of lisp, but it has been invaluable in my career. I found I was able to understand nuanced language features much more quickly because I have a clear model of how programming languages are decomposed into their components.
I agree but that's a good guide to build a technical understanding of a complex subject, not sufficient-in-itself tool set for considering questions in that complex subject.
Especially, I'll people combining some "non-technical summary" of quantum-mechanics/Newton Gravity/genetic engineer/etc with their personal common sense are constant annoyance to me whenever such topics come here.
Side question: Is there a entry level build your own language model or gan type learning tool out there as well?
One of the most interesting things about this topic is that the fundamental concepts and implementations are all really simple. It's the fact that it actually works that's mind boggling. In any case, the above is not a months like affair - but like one week of dedicated work.
I do think that maybe it feels inaccessible because we transition from discrete concepts easily digestible by CS grads into some complicated math with very terse mathematic notation, yet the math might not be as hard if presented in a way that doesn't scare away programmers.
I think it's actually worse than that - somebody who doesn't know actually realizes that he doesn't know, but somebody who _thinks_ he understands through analogies and metaphors will confidently come to the incorrect conclusion and then argue with somebody who actually does understand the topic - often managing to convince innocent bystanders because his reasoning is easier to grasp and the so-called expert seems to be getting more and more flustered (as he tries to explain why the analogy is actually correct, but oversimplified).
On the one hand they're jargon used as short hand to technical concepts understood well by domain experts. And the concision they afford can lead to deeper understanding as they transcend their composite or adapted meanings and become base terminology in and of themselves (I think of e.g. Latin in English legal terminology. "Habeas corpus" has a literal meaning when translated, but the understood jargon has a deeper, and more specific meaning). At that point, they are powerful because of the precision of meaning and concision of expression they afford.
On the other hand, they lift intuitive terminology from a base language that is understood in vaguer terms by a broader audience. And this creates invisible disconnects because the abstraction created by these terms leaks like a sieve unless you know the precise semantics and have the model to use them.
By translating a discourse into a higher metaphoric level, we increase precision and efficiency amongst mutual understanders, but at the same time, we increase the level of ambiguity, the number of possible interpretations, and the availability of terms familiar to (and thus, handles to grab on to) non-understanders. And that latter situation allows non-understanders to string together what sound superficially like well-formed thoughts using jargon terms, but based on the base language semantics. But without the deeper knowledge required to understand whether a given utterance scans or not.
That's how I've been trying to wrap my head around it at least. I hope it doesn't sound like moralizing or condescension, I don't mean it to. I know I'm "guilty" of trying to manipulate metaphoric models that I don't actually understand, based on the lay-semantics of their jargon.
This description more closely describes reinforcement learning, rather than gradient based optimization.
In fact, the entire metaphor of a confused individual being slapped or rewarded without understanding what's going on doesn't really make sense when considering gradient optimization because the gradient wrt the to loss function tells the network exactly how to change it's behavior to improve it's performance.
This last point is incredibly important to understand correctly since it contains one of the biggest assumptions about network behavior: that the optimal solution, or at least good enough for our concerns solution, can be found by slowing taking small steps in the right direction.
Neural networks are great at refining their beliefs but have a difficult time radically changing them. A better analogy might be trying to very slowly convince your uncle that climate change is real, and not a liberal conspiracy.
edit: it also does a poor job of explaining layers, which reads much more similar to how ensemble methods work (lots of little classifiers voting) than how deep networks work.
But, for now it's the best tool we have, so we just have to hope that we get close enough, or just empirically run lots of times to find the best local minimum we can. Incidentally, this actually is more like a brute-force approach, but at the ensemble level, which is quite different than the article means it.
It does an excellent job of taking you from 0 to a decent understanding without dumbing down the content or abusing analogies.
1. "gradient descent" is like tuning a guitar by ear and listening to the beat frequencies ("loss") and then decreasing these by tuning a string up or down.
2. the best I can come up with for "backpropagation" is to imagine a clever device that can tirelessly optimize a Rube Goldberg machine for you but as a science, not an art.
Also reminds me of Ted Chiang warning us that we will torture innumerable AI entities long before we start having real conversations about treating them with compassion.
> what the reward / punishment system really equates to
Nothing, and least as far as neural network training goes. This is an extremely poor analogy regarding how neural networks learn.
If you've ever done any kind of physical training and have had a trainer sightly adjust the position of your limbs until what ever activity you're doing feels better, that's a much closer analogy. You're gently searching the space of possible correct positions, guided by an algorithm (your trainer) that knows how to move you towards a more correct solution.
There's nothing analogous to a "reward" or "punishment" when neural networks are learning.
Well deep reinforcement learning.
Well, in the article, it says the punishment was a slap. On the other hand, he just says "she gives you a wonderful reward"... so you're left to use your imagination there.
It seems like these sort of structures just regularly arise to help regulate the flow of information through a system.
The similarity of corporations and neural nets is pretty much only that both are information processing systems. An operating system or missile guidance system is far more like a corporation than a neural network.
Neural networks have no memory and generally don't seek particular goals, they simply recognize, predict and generate similar instances.
Different ways you can interact with that data structure can however provide meaning and store information in the weights etc.
> Neural networks have no memory and generally don't seek particular goals, they simply recognize, predict and generate similar instances.
Sounds exactly like every corporation I've ever worked in.The memory isn’t digital bits like we think of now though, but abstractions in higher dimensions.
It depends what you mean by advanced math. There is a lot of math that only really comes into play because of the high dimensionality! For example math related to tensor wrangling, low rank approximations, spectral theory, harmonic theory, matrix calculus derivatives, universality principles, and other concepts that could be interesting or bewildering or horrifying depending how you react to it. Of course some of it is only linear algebra of the 'just high school math' kind but that's not how I would normally describe it. If you look at the math in the proofs in the appendices of the more technical AI papers on arxiv there is often some weird stuff in there, not just matrix multiply and softmax.
There are many aspects to learning systems that we still don't have any kind of grasp on, and will take more than a little advanced math (statistics/probability theory, transport theory, topology, etc.) to understand as a community.
Dunning-Kruger is probably more common in spaces like this one, where people carry social capital for being able to "spin up quickly". But the true meta-skill of upskilling is turning unknown unknowns (UU) into known unknowns (KU), and then into known knowns (KK). It's not enough to just jump from UU to KK through osmosis by reading blog posts on a news aggregator, because there will still be a huge space of unknowns not covered by that approach.
Sure, I can explain deep learning using the analogy of mixing paint colors.
When you mix different colors of paint, you can create new colors that are a combination of the original colors. Similarly, in deep learning, you can combine different types of data, such as images, text, and audio, to create a more comprehensive understanding of the information.
In a neural network, which is the basic building block of deep learning, each layer is like a different color of paint. The input layer represents the initial data, which is like a blank canvas. As the data passes through each layer, the network extracts increasingly complex features, similar to how mixing different colors creates more complex hues.
Each layer in a neural network can be thought of as a different color that is mixed with the previous layer's color. As the colors mix, the resulting color becomes more complex and nuanced, just as the data becomes more sophisticated and detailed as it passes through the layers of the neural network.
At the end of the process, the output layer represents the final result, which is like a finished painting. The deep learning algorithm can then use this final output to make predictions or classifications based on the input data.
So, in essence, deep learning is like mixing paint colors to create new and more complex hues, but instead of colors, the mixtures are made up of data and the neural network layers are the different colors being mixed.
Initialization: The weights and biases of the neural network are like the starting colors of the paint mixture. Just like how you need to start with some base colors to mix new ones, the neural network needs some initial weights and biases to begin the training process.
Forward Pass: During the forward pass, the input data is fed into the neural network, and the network produces an output. This output is like the mixed color of paint that results from combining different colors.
Backward Pass: During the backward pass, the error between the predicted output and the actual output is calculated using a loss function. This error is like a measure of how close the mixed color is to the desired color. Just like how you would adjust the mixture of colors to get the desired hue, the neural network adjusts the weights and biases of the neurons to minimize the error between the predicted output and the actual output.
Repeat: Steps 2 and 3 are repeated many times with different input data points, and the weights and biases of the neural network are updated after each iteration. This process is like adding more colors to the mixture and adjusting the amounts until the desired color is achieved.
Validation: After several epochs, the neural network is evaluated using a validation set of data to determine if it's overfitting or underfitting the training data. This step is like checking the mixture of colors to see if it matches the desired color under different lighting conditions.
Testing: Once the neural network has been trained and evaluated, it can be used to make predictions on new data. This step is like using the paint mixture to create a painting.
In summary, just like mixing paint colors to achieve a desired hue, the neural network adjusts its weights and biases to minimize the error between the predicted output and the actual output until it accurately predicts the output for new data.
It's not that it leaves out details, it's that the articles metaphors are not actually correct in regards to the way deep learning works.
This post mostly confuses both reinforcement learning and ensemble models with deep learning. If you only enough "enough to be dangerous" then this post will steer your intuition in the wrong direction.
This needs expanding: it's the sort of thing that's easy for a programmer to say, but few non-{programmer,mathematically trained person} are going to see that an RGB value has 3 parts and so a collection of RGB values could be sliced into 3 sheets.
The RGB color model and representation of images in it is already technical. Anyone who knows what it means also wouldn't need to be told the following quip:
>Also note that computers see things as multi-dimensional tables of data. They don't look at a "picture" - they see 3 spreadsheets of numbers representing the RGB values of the picture.
...which is the only time RGB is mentioned in the article.
That's before we get to the part that "multidimensional" here is extraneous, and doesn't even match the typical usage (where RGBA is stored as a single 32-bit value). Everything is a tape of 1's and 0's, "multidimensionality" comes from interpretation of data.
The dimension of image data is still 2: each pixel is a sample a 2D projection of a 3D world, and is related to other pixels in a way that's different than, say, those of letters in a line of text, or voxels (letters don't have a a well-defined "up" neighbor, voxels have more well-defined neighbors than pixels do).
(also, popular conciousness forgets that technically the Chinese Room argument is only arguing against the much narrower, and now philosophically unfashionable, "Hard AI" stance as it was held in the 70s)
Searle has stood behind his argument in the 70s, but in every decade since then too.
The main failure is that most people fundamentally don't believe they are mechanistic. If one believe in dualism, then it easy to attribute various mental states to that dualism, and of course a computer neural network cannot experience qualia like humans do.
I don't believe in a soul, and thus believe that a computer neural network, probably not today's models but a future one that is large enough and has the right recurrent topology, will be able to have qualia similar to what humans and animals experience.
Cute.
The algorithm/workers are not really “looking” at a picture of a cat, they are analysing and looking for patterns in the data that defines the picture of the cat.
End of knowledge.
Maybe I should ask ChatGPT?
https://www.youtube.com/watch?v=kCc8FmEb1nY
(I haven't gotten around to watching it yet)
You actually should, it spits out a pretty good explanation (sometimes).
1 - We want to model data, representative of some system, through functions.
2 - Virtually any function can be expressed by a n-th order polynomial.
3 - We wish to learn the parameters, the coefficients, of such polynomials.
4 - Neural networks allow us to brute-force test candidate values of such parameters (finding optimal candidate parameters such that error between expected and actual values of our dataset are minimized)
Whereas prior, methods (e.g. PCA) could only model linear relationships, neural networks allowed us to begin modeling non-linear ones.
y = a + bx + cx^2 + dx^3 + ...
The resulting model is linear, even though there are powers of x in your formula. Because x and y are known from the data. They're not what you're solving for, you're solving for the unknown coefficients (a, b, c, d...). This gives you a linear system of equations in those unknown coefficients, which can be solved using standard linear least squares methods.So fitting polynomials is easy. The problem is that it's not that useful. Deep learning has to solve much harder problems to get to a useful model.
As far as my understanding goes, you can represent practically any function as layers of linear transformations followed by non-linear functions (e.g. `ReLU(x) = max(0, x)`). It's this sprinkling of non-linearity that allows the networks to be able to model complex functions.
However, from my perspective, the secret sauce is (1) composability and (2) differentiability. These enable the backpropagation process (which is just "the chain rule" from calculus) and this is what allows these massive mathematical expressions to learn parameters (weights and biases) that perform well.
- It requires some mathematical understanding so will exclude some part of the non-technical audience
- It is the incorrect analogy. Non-linearities in neural networks have nothing to do with polynomials. In fact, polynomial regression is a type of linear regression, and for the most part, it sucks.
Also, as someone mentioned, all the “serious” alternative ML methods prior to the deep learning revolution allow modeling non linearities (even if just through modification of linear regressions, like polynomial regression).
But there are many things that are not functions. Like circles. And they tend to crop up a lot in the real world, no pun intended.
Probably what you are trying to say is that "a circle is not the image of a function", but that is also not true. You're assuming since in cartesian coordinates you can solve for y = +/- sqrt(R^2 - x^2), the fact that y is multi-valued means it's not a function. This is what they teach in highschool pre-calculus anyway.
But for example, we can associate the points on a circle with the image of the function e^{i theta}. Or equivalently, with the R^2-valued function f(theta) = (cos(theta), sin(theta)).
Prior methods also allowed modelling of non-linear relationships, eg. Random Forests.