https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...
The part I find most interesting is his proposal that neural networks largely work by “hitching a ride” on fundamental computational complexity, in practice sort of searching around the space of functions representable by an architecture for something that works. And, to the extent this is true, that puts explainability at fundamental odds with the highest value / most dense / best deep learning outputs — if they are easily “explainable” by inspection, then they are likely not using all of the complexity available to them.
I think this is a pretty profound idea, and it sounds right to me — it seems like a rich theoretical area for next-gen information theory, essentially are their (soft/hard) bounds on certain kinds of explainability/inspectability?
FWIW, there’s a reasonably long history of mathematicians constructing their own ontologies and concepts and then people taking like 50 or 100 years to unpack and understand them and figure out what they add. I think of Wolfram’s cellular automata like this, possibly really profound, time will tell, and unusual in that he has the wealth and platform and interest in boosting the idea while he’s alive.
The ML research community generally agrees that the key to generalization is finding the shortest "program" that explains the data (Occam's Razor / MDL principle). But directly searching for these minimal programs (architecture space, feature space, training space etc) is exceptionally dificult, so we end up approximating the search to look something like GPR or circuit search guided by backprop.
This shortest program idea is related to Kolmogorov complexity (arises out of classical Information Theory) - i.e. the length of the most concise program that generates a given string (because if your not operating on the shortest program, then there is looseness/or overfit!). In ML, the training data is the string, and the learned model is the program. We want the most compact model that still captures the underlying patterns.
(D)NNs have been super successful, their reliance on approximations suggests there's plenty of room for improvement in terms of inductive bias and more program-like representations. I think approaches that combine the flexibility of neural nets with the structured nature of symbolic representations will lead to more efficient and performant learning systems. It seems like a rich area to just "try stuff" in.
Leslie Valiant touches on some of the same ideas in his book "Probably approximately correct" which tries to nail down some of the computational phenomena associated with the emergent properties of reality (its heady stuff).
If you look at what a biological neural network is actually trying to optimize for, you might be able to answer The Bitter Lesson more adeptly.
Latency is a caveat, not a feature. Simulating a biologically-plausible amount of real-time delay is almost certainly wasteful.
Leaky charge carriers are another caveat. In a computer simulation, you can never leak any charge (i.e. information) if you so desire. This would presumably make the simulation more efficient.
Inhibitory neurology exists to preserve stability of the network within the constraints of biology. In a simulation, resources are still constrained but you could use heuristics outside biology to eliminate the fundamental need for this extra complexity. For example, halting the network after a limit of spiking activity is met.
Learning rules like STDP may exist because population members learned experiences cannot survive across generations. If you have the ability to copy the exact learned experiences from prior generations into new generations (i.e. cloning the candidates in memory), this learning rule may represent a confusing distraction more than a benefit.
Wolfram has a hammer and sees everything as a nail. But its a really interesting hammer.
Could you define explainability in this context?
That’s… why we’re here?
That’s… why we’re here? How else could we characterise what any learning algorithm does?
Everyone is a complex mixture of both.
My dad loved reading and sharing technical subjects with me and is probably part of the reason why I enjoy a good career today.
He also cheated on my mom for 30 years for which we didn't discover until the last 3 years of his life. We didn't have much money growing up. He probably took her out to dinner with money we didn't have.
It's perfectly normal to both love and hate parts of someone, but not reject them as a whole.
That being said, I’m enjoying this. I often experiment with neural networks in a similar fashion and like to see people’s work like this.
Is this similar to the lottery ticket hypothesis?
Also the visualizations are beautiful and a nice way to demonstrate the "universal approximation theorem"
It feels like a religious talk.
The presentation consists of chunks of hard-to-digest, profound-sounding text followed by a supposedly informative picture with lots of blobs, then the whole pattern is repeated over and over.
But it never gets to the point. There is never an outcome, never a summary. It is always some sort of patterns and blobs that are supposedly explaining everything ... except nothing useful is ever communicated. You are supposed to "see" how the blobs are "everything..." a new kind of Science.
He cannot predict anything; he can not forecast anything; all he does is use Mathematica to generate multiplots of symmetric little blobs and then suggests that those blobs somehow explain something that currently exists
I find these Wolfram blogs a massive waste of time.
They are boring to the extreme.
It is a given from Church-Turing that some automata will be equivalent to some turing machines, and while it is a profound result the specific details of the equivalence isn't super important unless, perhaps, it becomes super fast and efficient to run the automata instead of Von Neumann architecture.
I often explain boring things with diagrams consisting of boxes and arrows, some times with different colours.
I think this is novel (I've seen BNN https://arxiv.org/pdf/1601.06071 This actually makes things continuous for training, but if inference is sufficiently fast and you have an effective mechanism for permutation, training could be faster using that)
I am curious what other folks (especially researchers) think. The takes on Wolfram are not always uniformly positive but this is interesting (I think!)
A good take-away from the Wolfram writeup is that you can do machine learning on any pile of atoms you've got lying around, so you might as well do it on whatever you've got the best tooling for - right now this is silicon doing fixed-point linear algebra operations, by a long shot.
It depends on what your constraint is! So if you're memory constrained (or don't have a GPU), a bunch of 1 bit atoms with operations that are very fast on CPU might be better
I haven't thought very deeply about whether it's provably faster to do gradient descent on 32 bits vs 8, but it probably always is. What's the next step to speed up training?
https://en.wikipedia.org/wiki/Tsetlin_machine
They are discrete, individually interpretable, and can be configured into complicated architectures.
These guys are trying to make chips for ML using Tsetlin machines...
>There’s no overarching theory to it in itself; it’s just a reflection of the resources that were out there. Or, in the case of machine learning, one can expect that what one sees will be to a large extent a reflection of the raw characteristics of computational irreducibility
Strikes me as a very reductive and defeatist take that flies in the face of the grand agenda Wolfram sets forth.
It would have been much more productive to chisel away at it to figure out something rather than expecting the Theory to be unveiled in full at once.
For instance, what I learn from the kinds of playing around that Wolfram does in the article is: neural nets are but one way to achieve learning & intellectual performance, and even within that there are a myriad different ways to do it, but most importantly: there is a breadth vs depth trade-off, in that neural nets being very broad/versatile are not quite the best at going deep/specialised; you need a different solution for that (e.g. even good old instruction set architecture might be the right thing in many cases). This is essentially why ChatGPT ended up needing Python tooling to reliably calculate 2+2.
This is untrue. ChatGPT very reliably calculates 2+2 without invoking any tooling.
"tasks—like writing essays—that we humans could do, but we didn’t think computers could do, are actually in some sense computationally easier than we thought."
It hurts one's pride to realize that the specialized thing they do isn't quite as special as was previously thought.
If they were writing essays, I would suggest that it wouldn’t be so ridiculously easy to pick out the obviously AI articles everywhere.
> a standard result from calculus gives us a vastly more efficient procedure that in effect “maximally reuses” parts of the computation that have already been done.
This partially explains why gradient descent becomes mainstream.
The acrobatics that Wolfram can do with the code and his analysis is awesome, and doing the same without the homoiconicity and metaprogramming makes my poor brain shudder.
Do note, Wolfram Language is homoiconic, and I think I remember reading that it supports Fexprs. It has some really neat properties, and it's a real shame that it's not Open Source and more widely used.
----------
Almost all the code and its display is some form of meta-programming. Stephen Wolfram is literally brute-forcing/fuzzing all combinations of "code".
- Permuting all the different rules/functions in a given scope
- evolutionary adapting/modifying them
- graphing and analyzing those structures
- producing the HTML for display
I get that "normal machine learning" is also permuting different programs. But it's more special when you are using the same language for the whole stack. There is a canyon that you have to cross without homoiconicity, (granted I don't know exactly how Wolfram generated and analyzed everything here, but I have used his language before, and I see the hallmarks of it).I can't really copy and paste an example for you, because plaintext struggles. Here is an excerpt some fanciness in there:
And as an example, here are the results of the forward and backward methods for the problem of learning the function f[x] = <graph of the function> , for the “breakthrough” configurations that we showed above:
You might see a "just" a small .png interspersed in plain text. The language and runtime itself has deep support for interacting with graphics like this.The only other systems that I see that can juggle the same computation/patterns around like this are pure object oriented systems like Smalltalk/Pharo. You necessarily need first class functions to come even close to the capability, but as soon as you want to start messing with the rules themselves, you need some sort of term re-writing, lisp macro, or fexpr (or something similar?).
Don't get me wrong, you can do it all "by hand" (with compiler or interpreter help), you can generate the strings or opcodes for a processor or use reflection libraries, generate the graphs and use some HTML generator library to stitch it all together. But in the case of this article, you can clearly see that he has direct command over the contents of these computations in his Wolfram Language compared to other systems, because it's injected right into his prose. The outcome here can look like Jupyter labs or other notebooks. But in homoiconic languages there is a lot more "first-class citizenry" than you get with notebooks. The notebook format is just something that can "pop out" of certain workflows.
If you try to do this with C++ templates, Python Attribute hacking, Java byte-code magic... like... you can, but it's too hard and confusing, so most people don't do it. People just end up creating specific DSLs or libraries for different forms of media/computations, with templating smeared on top. Export to a renderer and call it a day -> remember to have fun designing a tight feedback loop here. /s
Nothing is composable, and it makes for very brittle systems as soon you want to inject some part of a computation into another area of the system. It's way way overspecified.
Taking the importance of homoiconicty further, when I read this article I just start extrapolating, moving past xor or "rule 12", and applying these techniques to the symbolic logic, like Tseltin machine referenced in another part of this thread: https://en.wikipedia.org/wiki/Tsetlin_machine
Or using something like miniKanran: https://en.wikipedia.org/wiki/MiniKanren
It seems to me that training AI on these kinds systems will give them far more capability in producing useful code that is compatible with our systems, because, for starters, you have to dedicate less neuronal connections on syntax parsing with a grammar that is actually fundamentally broken and ad hoc. But I think there are far deeper reasons than just this.
----------
I think it's so hard to express this idea because it's like trying to explain why having arms and legs is better than not. It's applied to every part of the process of getting from point A to point B.
Also, addendum, I'm not 100% sure homoiconicity it "required" per se. I suppose any structured and reversible form of "upleveling" or "downleveling" logic that remains accessible from all layers of the system would work. Even good ol' Lisp macros have hygiene problems that can be solved, e.g. by Racket's syntax-parse.