I don't really know what more needs to be said here.
It might be nice to compare fairly complete models on a well defined domain, but I can't see it as a general guiding principle. It would get theorizing stuck in a local minimum.
Causal models give counter-factual predictions for existence claims (eg., that a planet exists because the orbit of two other planets doesn't follow the causal model).
Science, in most cases, prefers models with poor "engineering predictions" (ie., point estimates of observables) because they have vastly superior explanatory power.
In most cases it would be a catastrophe for a scientific model to be making good estimates of observables, because we know a priori, that observables aren't fully determined by the model (eg., just consider that F=GMm/r^2 basically didnt apply to most observations of the solar system when it was formulated by newton; nor really does it much today).
Explanatory power is not a property of compression, nor association, nor "prediction" in this engineering sense. Consider here that a lossless model of the solar system would never have yielded newton's law of graviton (since most of the objects in the solar system are unknown).
This entire project is just, "what if science were like ML?" -- an interesting question only because how vast the gap is; and how absurd the suggestion.
Scientific models aren't necessarily "causal" to begin with. They are functions that give predictions about measurements. It is these predictions that are tested against data, not the function's confabulated "causal" justifications.
People learn from data not adhering to predictions. This difference from model functions can be compressed, if not random. This compression then might reflect in the form of modularization in justifications, which again is interpreted as causal relationships.
Nope! No theory of heat is a compression of therometer readings; no theory of gravity, of orbits; no theory of atoms of spectra. No theories compress measurments!
Such a thing is pure superstition. Heat is not the motion of thermometer fluid.
> not the function's confabulated "causal" justifications.
Nope!
We construct an experiment by counter-factual analysis of its causal semantics; we do not simply test whether observable quantities match prior data. Arbitary associative models match arbitary amounts of prior data. This is the opposite of science.
We test scientific models by creating new experiments; it isnt "the data" which matters here, but that the experiment is designed to test the causal assumptions of the model.
If the experiment doesn't: control causes, identify novel measures with potential causes, etc. then any data collected is useless.
This is why you need, you know: randomised controlled trials, microscopes, satellites, ... etc.
"Data" in the ML sense does not matter. This is pure superstitious pseudoscience. Science is a process of creating data under experimental conditions designed to be counter-factual tests of theories. Science is about the data generating process (reality), not our measurements of it.
I don't really know what more needs to be said there.
Eg., to say "increasingly energetic motion of molecules leads to increasingly hot water" is an interpretation of a very wide class of equations.
It posits the existence of molecules (a scientific discovery), water, energy, motion, heat, etc. and it provides a means of creating equations&measures tied to each of these terms.
Science is the production of those interpretations. There is no bare "data" which tells you how reality is.
Science isn't "magic trick engineering", it's Explanation. "Compressing tables of data" is something they do in the pseudosciences -- as you've seen, none of it is reproducible: "IQ" is just a compression of survey quizzes. Do you really think it exists?
Do you think you can just compress survey results and claim to have an explanatory model of the most complex system in the entire universe? (a person, society, and their joint interaction) etc.
ML is a temple to pseudoscience, permitted only because the situations it's used in are engineered and low-risk. The whole thing is a dumb trick. You cannot build models of the world from associations in data: that is called superstition.
Recents events in ML make me feel about 2/3 vindicated of the claims made in the book. Based on the book's ideas, I began training LLMs based on large corpora in the early 2010s, well before it was "cool". I figured out that LLMs could scale to giga-parameter complexity without overfitting, and that the concepts developed under this training would be reusable for other tasks (I called this the Reusability Hypothesis, to emphasize that it was deeply non-obvious; other terms like "self-supervision" are more common in the literature).
I missed on two related points. Technically, I did not think DNNs would scale up forever; I thought that they would hit some barrier, and the engineers would not be able to debug the problem because of the black-box nature of DNNs. Philosophically, I wanted this work to resemble classical empirical science in that the humans involved should achieve a high degree of knowledge relating to the material. In the case of LLMs, I wanted researchers (including myself) to develop understanding of key concepts in linguistics such as syntax, semantics, morphology, etc.
This style of research actually worked! I built a statistical parser without using any labelled training data! And I did learn a ton about syntax by building these models. One nice insight was that the PCFG is a bad formalism for grammar; I wrote about this here:
https://ozoraresearch.wordpress.com/2017/03/17/chuckling-a-b...
Obviously, I feel into the "Bitter Lesson" trap described by Rich Sutton. The DNNs can scale up, and can improve up their understanding much faster than a group of human researchers can.
One funny memory is that in 2013 I went to CVPR and told a bunch of CV researchers that they should give up on modeling P(L|I) - label given image - and just model P(I) instead - the probability of an image. They weren't too happy to hear that. I'm not sure that approach has yet taken over the CV world, but based on the overwhelming success of GPT in the NLP world, I'm sure it's just a matter of time.
In hindsight, I regret the emphasis I placed on the keyword "compression". To me, compression is a nice and rigorous way to compare models, with a built-in Occam's principle. But "compression" means many different things to different people. The important idea is that we're modeling very large unlabelled datasets, using the most natural objective metric in this setting.
edit: I used the wrong name in reference to the Bitter Lesson idea, here is the essay: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Yeah, iGPT was the writing on the wall there, but CLIP gave cheap non-generative modeling a new lease on life. Contrastive learning sucks in many ways, but it's substantially cheaper: compare the cost of training a CLIP to the cost of training a DALL-E 1. (CLIP itself was originally generative, doing the obvious generation of caption & image separately, but they found it was like 8x cheaper to go full contrastive.) So, everyone flocked into that to avoid paying the Bitter Lesson. However, people increasingly run into the limits of contrastive learning (eg. about half the examples you'll see of DALL-E 2 or Midjourney or SD failing on a prompt are probably due solely to the use of contrastive embeddings) and compute/resources keep piling up, so we'll get to generative-everything in images eventually.
I was led to your book by recent research in self-supervised learning by LeCun et al [1] [2]. Since reading your book, I have been digging into the work by Rissanen [3], Grunwald [4], and Hinton [5], among many others. I'm trying to build up my knowledge so that I can apply it to TinyML [6] (e.g. running a neural network on a microcontroller with 256kb of RAM). In a TinyML context, power usage must be low and labeled data is non-existent. I have a vague intuition of how MDL can be used to guide the engineering constraints of TinyML, and I'm hoping to formalize this in my research.
Dan, if you know of any papers or research groups that would be related to this area, I'd love to read more about it.
[1] https://ai.meta.com/blog/self-supervised-learning-the-dark-m...
[2] https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/
[3] https://link.springer.com/book/10.1007/978-0-387-68812-1
[4] https://mitpress.mit.edu/9780262529631/the-minimum-descripti...
[5] https://www.researchgate.net/publication/5920308_To_recogniz...
Noise certainly affects the compression rate. But you are not concerned with the absolute compression rate, you are only concerned with the relative rate achieved by two theories A and B. Both theories will be negatively impacted to the same degree by the noise, so the comparison still works to select which theory is better.