I'm going to have to spend more time digesting the article, but one thing that jumps out at me, and maybe it's answered in the article and I don't understand it, is the role of time. Generally in physics, you're talking about a quantity being conserved over time, and I'm not sure what plays the role of time when you're talking about conserved quantities in machine learning -- is it conserved over training iterations or over inference layers, or what?
edit: now that i've read it again, I just saw that they described in the second paragraph.
I'm now wondering if in something like Sora that can do a kind of physical modeling, if there's some conserved quantity in the neural network that is _directly analagous_ to conserved quantities in physics -- if there is, for example, something that represents momentum, that operates exactly as momentum as it progresses through the layers.
I think the analogue in machine learning is conservation over changes in the training data. After all, the point of machine learning is to find general models that describe the training data given, and minimize the loss function. Assuming that a useful model can be trained, the whole point is that it generalizes to new, unseen instances with minimal losses, i.e. the model remains invariant under shifts in the instances seen.
The more interesting part to me is what this says about philosophy of physics. Noether's Theorem can be restated as "The laws of physics are invariant under X transformation", where X is the gauge symmetry associated with the conservation law. But maybe this is simply a consequence of how we do physics. After all, the point of science is to produce generalized laws from empirical observations. It's trivially easy to find a real-world situation where conservation of energy does not hold (any system with friction, which is basically all of them), but the math gets very messy if you try to actually model the real data, so we rely on approximations that are close enough most of the time. And if many people take empirical measurements at many different points in space, and time, and orientations, you get generalized laws that hold regardless of where/when/who takes the measurement.
Machine learning could be viewed as doing science on empirically measurable social quantities. It won't always be accurate, as individual machine-learning fails show. But it's accurate enough that it can provide useful models for civilization-scale quantities.
That's not what i meant.
When you talk about "conservation of angular momentum", the symmetry is invariance over rotation, but the angular momentum is conserved _over time_.
Conservation of energy absolutely still holds, but entropy is not conserved so the process is irreversible. If your model doesn't include heat, then discrete energy won't be conserved in a process that produces heat, but that's your modeling choice, not a statement about physics. It is common to model such processes using a dissipation potential.
I know that I can remember momentum is paired with translation simply because there's both the angular momentum and the non-angular momentum one and in space you have translation and rotation, so for time energy is the only one that's left over, but I'm not looking for a trick to remember it, I'm looking for the fundamental reason, as well as how to tell what will be paired with some invariance when looking at some other new invariance
So they are not approximations, but are just terribly difficult calculations, no?
Maybe I'm misunderstanding your point, but this should be true regardless of our philosophy of physics correct?
So in this case, we're explicitly defining the set of desired invariances.
I found the thinking of William Sidis to be particularly thought provoking perspective on Noether's benchmark work, in his paper The Animate and the Inanimate he posits--at a high level--that life is a "reversal of the second law of thermodynamics"; not that the 2nd law is a physical symmetry, but a mental one in an existence where energy reversibly flows between positive and negative states.
Indeed, when considering machine learning, I think it's quite interesting to consider how the organizing of information/knowledge done during training in some real way mirrors the energy-creating information interred in the mind of Maxwell's demon.
When taking into account the possible transitive benefits of knowledge organized via machine learning, and its attendant oracle through application, it's easy to see a world where this results in a net entropy loss, the creation of a previously non-existent energy gradient.
In my mind this has interesting implications for Fermi's paradox as it seems to imply the inevitibility of the organization of information. Taken further into my own personal dogma, I think it's inevitable that we create--what we would consider--a sentient being as I believe this is the cycle of our own origin in the larger evolutionary timeline.
Life temporarily displaces entropy, locally.
Life wins battles, chaos wins the war.
>Indeed, when considering machine learning, I think it's quite interesting to consider how the organizing of information/knowledge done during training in some real way mirrors the energy-creating information interred in the mind of Maxwell's demon.
This is our human bias favoring the common myth of ever-expanding complexity is an "inevitable" result of the passage of time; refer to Stephen Jay Gould's "Full House: The Spread of Excellence from Plato to Darwin"[0] for the only palatable refute modern evolutionists can offer.
>When taking into account the possible transitive benefits of knowledge organized via machine learning, and its attendant oracle through application, it's easy to see a world where this results in a net entropy loss, the creation of a previously non-existent energy gradient.
Because it is. Randomness combined with a sieve, like a generator and a discriminator, like the primordial protein soup and our own existence as a selector, like chaos and order themselves, MAY - but DOES NOT have to - lead to temporary, localized areas of complexity, that we call 'life'.
This "energy gradient" you speak of is literally gravity pulling baryonic matter foward thru space time. All work requires a temperature gradient - Hawking's musings on the second law of thermodynamics and your own intuition can reason why.
>In my mind this has interesting implications for Fermi's paradox as it seems to imply the inevitibility of the organization of information. Taken further into my own personal dogma, I think it's inevitable that we create--what we would consider--a sentient being as I believe this is the cycle of our own origin in the larger evolutionary timeline.
Over cosmological time spans, it is a near-mathematical certainty, that we are to either reach the universe's Omega point[1] on "our" own accord, perish to our own, by our own creation, or by our own son's, hands.
[0]: https://www.amazon.com/Full-House-Spread-Excellence-Darwin/d...
This gives a vector with dimensions equal to however many directions you can translate a layer in and which is conserved over all (convolutional) layers.
Regarding the role of time, the idea of a purely conserved quantity is that it is conserved under the conditions of the system (that's why the article frequently references Newton's First Law), so they're generally held "for all time that these symmetries exist in the system".
Specifically on time: the invariant for systems that exhibit continuous time symmetries (i.e. you move a little bit forward or backward in time and the system looks exactly the same) is energy.
imagine a spring at rest (not moving)
strike the spring, it's now oscillating
the system now contains energy like a battery
what is energy? it's stored work potential
the battery is storing the energy, which can then be taken out at some future time
the spring is transporting the energy through time
in fact how do we measure time? with clocks. What's a clock? It's an oscillator. The energized spring is the clock. When system energy is zero, what is time even? There's no baseline against which to measure change when nothing is changing
Then, if your dynamical system is symmetrical under these transformations you can construct a quantity whose derivative wrt s is zero.
Isn't the model attempting to conserve information during training? And isn't information a physical quantity?
My first thought on reading that was that if there was it would be interesting to see if there was some way it tied into the concept of us living in a simulation, i.e. we're all living in a complex ML network simulation.
It's basically the same way you could use light to solve a maze, just flood the exit with light and walk in the direction which is brightest. Works better for mirror mazes.
https://www.microsoft.com/en-us/research/uploads/prod/2023/0...
Physics contains a lot of 'machinery' for solving for low energy states.
I think the following sentence in the article is wrong "Applying Noether's theorem gives us three conserved quantities—one for each degree of freedom in our group of transformations—which turn out to be horizontal, vertical, and angular momentum.”
I think the correct statement is "Applying Noether's theorem gives us three conserved quantities—one for each degree of freedom in our group of transformations—which turn out to be translation, rotation, and time shifting.”
I think translation leads to conservation of momentum, rotation leads to conservation of angular momentum, and time shifting leads to conservation of energy (potential+kinetic). It's been a few decades since I saw the proof, so I might be wrong.
In that sentence I was only talking about the translations and rotations of the plane as a group of invariances for the action of the two-body problem. This group is generated by one-parameter subgroups producing vertical translation, horizontal translation, and rotation about a particular point. Those are the "three degrees of freedom" I was counting.
You're right about the correspondence from symmetries to conservation laws in general.
More generically in 3 dimensions a transformation with 3 translational 2 rotational and 1 time independence would provide conservation of 3 momenta 2 angular momenta and 1 energy.
Besides figuring out a good way of dealing with reference frames, the only trick I'd pass on is to use CSS variables to change colors and sizes (line widths, arrow dimensions, etc.) interactively. It definitely helps to tighten the feedback loop on those decisions.
I've been using Emmy from the Clojurescript ecosystem, which works pretty good, but has a few quirks.
Softmax gives rise to translation symmetry, batch normalization to scale symmetry, homogeneous activations to rescale symmetry. Each of those induce their own learning invariants through training.
By the way, maybe I'm being too much of a math snob, but I'd argue Kunin's result is only superficially similar to Noether's theorem. (In the paper they call it a "striking similarity"!) Geometrically, what they're saying is that, if a loss function is invariant under a non-zero vector field, then the trajectory of gradient descent will be tangent to the codimension-1 distribution of vectors perpendicular to the vector field. If that distribution is integrable (in the sense of the Frobenius theorem), then any of its integrals is conserved under gradient descent. That's a very different geometric picture from Noether's theorem. For example, Noether's theorem gives a direct mapping from invariances to conserved quantities, whereas they need a special integrability condition to hold. But yes, it is a nice result, certainly worth keeping in mind when thinking about your gradient flows. :)
By the way, you might be interested in [1], which also studies gradient descent from the point of view of mechanics and seems to really use Noether-like results.
[1] Tanaka, Hidenori, and Daniel Kunin. “Noether’s Learning Dynamics: Role of Symmetry Breaking in Neural Networks.” In Advances in Neural Information Processing Systems, 34:25646–60. Curran Associates, Inc., 2021. https://papers.nips.cc/paper/2021/hash/d76d8deea9c19cc9aaf22....
The main problem I see with it is that most of the time you don't want the optimum for your objective function, as that frequently results in overfitting. this leads to things like early stopping being typical.
And yes, that's quite true. When parameter gradients don't quite vanish, then the equation
<g_x, d x / d eps> = <g_y, d y / d eps>
becomes
<g_x, d x / d eps> = <g_y, d y / d eps> - <g_theta, d theta / d eps>
where g_theta is the gradient with respect to theta.
In defense of my hypothesis that interesting approximate conservation laws exist in practice, I'd argue that maybe parameter gradients at early stopping are small enough that the last term is pretty small compared to the first two.
On the other hand, stepping back, the condition that our network parameters are approximately stationary for a loss function feels pretty... shallow. My impression of deep learning is that an optimized model _cannot_ be understood as just "some solution to an optimization problem," but is more like a sample from a Boltzmann distribution which happens to concentrate a lot of its probability mass around _certain_ minimizers of an energy. So, if we can prove something that is true for neural networks simply because they're "near stationary points", we probably aren't saying anything very fundamental about deep learning.
It also makes me think about the surprising success of highly quantized models (see for example recent paper on ternary networks, where the only valid numbers re 0, 1, and -1.)
Artificial Neural Networks were originally conceived as an approximation to an analog, continuous system, where floating-point numbers are stand-ins for reals. This is related to the ability to back-prop because real functions are generally differentiable. But if it turns out that we can closely approximate the same behavior with a small, discrete set of integers, it makes the whole edifice feel more like some sort of Cellular Automaton with reversible rules, rather than a set of functions over the reals.
Finally (sorry for the rabbit-holing) - how does this relate to our brains? Note that real neurons "fire" -- that is, they generate a discrete event when their internal configuration reaches a triggering state.
Lots to chew on...
The key insight is that a (finite) discrete, reversible system will always eventually cycle back to its original state. This fact has very interesting follow-on implications for the concept of entropy and the Second Law. If it is guaranteed that a system will return to a prior state, how can it also be true that entropy (disorder) always increases?
cgadski: what did you use to make it?
In the beginning, I used kognise's water.css [1], so most of the smart decisions (background/text color, margins, line spacing I think) probably come from there. Since then it's been some amount of little adjustments. The font is by Jean François Porchez, called Le Monde Livre Classic [2].
I draft in Obsidian [3] and build the site with a couple python scripts and KaTeX.
[1] https://watercss.kognise.dev/
[2] https://typofonderie.com/fr/fonts/le-monde-livre-classic
<!-- this blog is proudly generated by, like, GNU make -->
Another approach might be to take an information theoretic view with the infinite-width finite-entropy nets.
How do you insert rules that aren't learned into what weights are learned?
https://en.wikipedia.org/wiki/Physics-informed_neural_networ... https://www.youtube.com/watch?v=JoFW2uSd3Uo
Like in ANN backprop, the gradient descent algorithm can use a momentum to overcome getting stuck in local minima. This was heuristically physical when I learned it.. perhaps it's been developed since. Maybe only allowing a "real" energy to the momentum would then align it with an ability to do work calculation. Might also help with ensemble/monte carlo methods, to maintain an energy account across the ensemble.
A neural network is a type of machine that solves non linear optimization problems, and the principle of least action is also a non linear optimization problem that nature solves by some kind of natural law.
This is the one thing that chatgpt mentioned which surpised me the most and which I had not previously considered.
> Eigenvalues of the Hamiltonian in quantum mechanics correspond to energy states. In neural networks, the eigenvalues (principal components) of certain matrices, like the weight matrices in certain layers, can provide information about the dominant features or patterns. The notion of states or dominant features might be loosely analogous between the two domains.
I am skeptical that any conserved quantity besides energy would have a corresponding conserved quantity in ML, and the Reynolds operator will likely be relevant for understanding any correspondence like this.
iirc the Reynolds operator plays an important role in Noethers theorem, and it involves an averaging operation similar to what is described in the linked article.
Abstract: Progress in machine learning (ML) stems from a combination of data availability, computational resources, and an appropriate encoding of inductive biases. Useful biases often exploit symmetries in the prediction problem, such as convolutional networks relying on translation equivariance. Automatically discovering these useful symmetries holds the potential to greatly improve the performance of ML systems, but still remains a challenge. In this work, we focus on sequential prediction problems and take inspiration from Noether's theorem to reduce the problem of finding inductive biases to meta-learning useful conserved quantities. We propose Noether Networks: a new type of architecture where a meta-learned conservation loss is optimized inside the prediction function. We show, theoretically and experimentally, that Noether Networks improve prediction quality, providing a general framework for discovering inductive biases in sequential problems.