Notes on a New Philosophy of Empirical Science (2011) (opens in new tab)

(arxiv.org)

38 pointsboberoni2y ago30 comments

30 comments

15 comments · 3 top-level

mjburgess2y ago· 6 in thread

Scientific models are causal. Compression is a condition on association.

I don't really know what more needs to be said here.

tgv2y ago

It's an idea about how to judge models. A model's predictive capabilities are modelled (indeed) as compression, like "how many bits do you need to set up your model and correct its output".

It might be nice to compare fairly complete models on a well defined domain, but I can't see it as a general guiding principle. It would get theorizing stuck in a local minimum.

mjburgess2y ago

Science isn't interested in this kind of prediction. That's just engineering.

Causal models give counter-factual predictions for existence claims (eg., that a planet exists because the orbit of two other planets doesn't follow the causal model).

Science, in most cases, prefers models with poor "engineering predictions" (ie., point estimates of observables) because they have vastly superior explanatory power.

In most cases it would be a catastrophe for a scientific model to be making good estimates of observables, because we know a priori, that observables aren't fully determined by the model (eg., just consider that F=GMm/r^2 basically didnt apply to most observations of the solar system when it was formulated by newton; nor really does it much today).

Explanatory power is not a property of compression, nor association, nor "prediction" in this engineering sense. Consider here that a lossless model of the solar system would never have yielded newton's law of graviton (since most of the objects in the solar system are unknown).

This entire project is just, "what if science were like ML?" -- an interesting question only because how vast the gap is; and how absurd the suggestion.

1 more reply

Loquebantur2y ago

Scientific models clearly represent a compression of measurement data?

Scientific models aren't necessarily "causal" to begin with. They are functions that give predictions about measurements. It is these predictions that are tested against data, not the function's confabulated "causal" justifications.

People learn from data not adhering to predictions. This difference from model functions can be compressed, if not random. This compression then might reflect in the form of modularization in justifications, which again is interpreted as causal relationships.

mjburgess2y ago

> Scientific models clearly represent a compression of measurement data?

Nope! No theory of heat is a compression of therometer readings; no theory of gravity, of orbits; no theory of atoms of spectra. No theories compress measurments!

Such a thing is pure superstition. Heat is not the motion of thermometer fluid.

> not the function's confabulated "causal" justifications.

Nope!

We construct an experiment by counter-factual analysis of its causal semantics; we do not simply test whether observable quantities match prior data. Arbitary associative models match arbitary amounts of prior data. This is the opposite of science.

We test scientific models by creating new experiments; it isnt "the data" which matters here, but that the experiment is designed to test the causal assumptions of the model.

If the experiment doesn't: control causes, identify novel measures with potential causes, etc. then any data collected is useless.

This is why you need, you know: randomised controlled trials, microscopes, satellites, ... etc.

"Data" in the ML sense does not matter. This is pure superstitious pseudoscience. Science is a process of creating data under experimental conditions designed to be counter-factual tests of theories. Science is about the data generating process (reality), not our measurements of it.

1 more reply

gwern2y ago

A good compressor also needs to compress data from experiments using randomization. Causal data is also data.

I don't really know what more needs to be said there.

mjburgess2y ago

There is no such thing as "causal data". A causal model is an interpretation of data.

Eg., to say "increasingly energetic motion of molecules leads to increasingly hot water" is an interpretation of a very wide class of equations.

It posits the existence of molecules (a scientific discovery), water, energy, motion, heat, etc. and it provides a means of creating equations&measures tied to each of these terms.

Science is the production of those interpretations. There is no bare "data" which tells you how reality is.

Science isn't "magic trick engineering", it's Explanation. "Compressing tables of data" is something they do in the pseudosciences -- as you've seen, none of it is reproducible: "IQ" is just a compression of survey quizzes. Do you really think it exists?

Do you think you can just compress survey results and claim to have an explanatory model of the most complex system in the entire universe? (a person, society, and their joint interaction) etc.

ML is a temple to pseudoscience, permitted only because the situations it's used in are engineered and low-risk. The whole thing is a dumb trick. You cannot build models of the world from associations in data: that is called superstition.

1 more reply

d_burfoot2y ago· 4 in thread

(author here)

Recents events in ML make me feel about 2/3 vindicated of the claims made in the book. Based on the book's ideas, I began training LLMs based on large corpora in the early 2010s, well before it was "cool". I figured out that LLMs could scale to giga-parameter complexity without overfitting, and that the concepts developed under this training would be reusable for other tasks (I called this the Reusability Hypothesis, to emphasize that it was deeply non-obvious; other terms like "self-supervision" are more common in the literature).

I missed on two related points. Technically, I did not think DNNs would scale up forever; I thought that they would hit some barrier, and the engineers would not be able to debug the problem because of the black-box nature of DNNs. Philosophically, I wanted this work to resemble classical empirical science in that the humans involved should achieve a high degree of knowledge relating to the material. In the case of LLMs, I wanted researchers (including myself) to develop understanding of key concepts in linguistics such as syntax, semantics, morphology, etc.

This style of research actually worked! I built a statistical parser without using any labelled training data! And I did learn a ton about syntax by building these models. One nice insight was that the PCFG is a bad formalism for grammar; I wrote about this here:

https://ozoraresearch.wordpress.com/2017/03/17/chuckling-a-b...

Obviously, I feel into the "Bitter Lesson" trap described by Rich Sutton. The DNNs can scale up, and can improve up their understanding much faster than a group of human researchers can.

One funny memory is that in 2013 I went to CVPR and told a bunch of CV researchers that they should give up on modeling P(L|I) - label given image - and just model P(I) instead - the probability of an image. They weren't too happy to hear that. I'm not sure that approach has yet taken over the CV world, but based on the overwhelming success of GPT in the NLP world, I'm sure it's just a matter of time.

In hindsight, I regret the emphasis I placed on the keyword "compression". To me, compression is a nice and rigorous way to compare models, with a built-in Occam's principle. But "compression" means many different things to different people. The important idea is that we're modeling very large unlabelled datasets, using the most natural objective metric in this setting.

edit: I used the wrong name in reference to the Bitter Lesson idea, here is the essay: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

gwern2y ago

> I'm not sure that approach has yet taken over the CV world, but based on the overwhelming success of GPT in the NLP world, I'm sure it's just a matter of time.

Yeah, iGPT was the writing on the wall there, but CLIP gave cheap non-generative modeling a new lease on life. Contrastive learning sucks in many ways, but it's substantially cheaper: compare the cost of training a CLIP to the cost of training a DALL-E 1. (CLIP itself was originally generative, doing the obvious generation of caption & image separately, but they found it was like 8x cheaper to go full contrastive.) So, everyone flocked into that to avoid paying the Bitter Lesson. However, people increasingly run into the limits of contrastive learning (eg. about half the examples you'll see of DALL-E 2 or Midjourney or SD failing on a prompt are probably due solely to the use of contrastive embeddings) and compute/resources keep piling up, so we'll get to generative-everything in images eventually.

boberoniOP2y ago

Hi, Dan! Just want to say thanks for your work on this topic. I really loved your book so I wanted to share it with the HN community. Supervised learning always seemed to rub me the wrong way, both when I learned it in college and when I saw it used in practice in industry.

I was led to your book by recent research in self-supervised learning by LeCun et al [1] [2]. Since reading your book, I have been digging into the work by Rissanen [3], Grunwald [4], and Hinton [5], among many others. I'm trying to build up my knowledge so that I can apply it to TinyML [6] (e.g. running a neural network on a microcontroller with 256kb of RAM). In a TinyML context, power usage must be low and labeled data is non-existent. I have a vague intuition of how MDL can be used to guide the engineering constraints of TinyML, and I'm hoping to formalize this in my research.

Dan, if you know of any papers or research groups that would be related to this area, I'd love to read more about it.

[1] https://ai.meta.com/blog/self-supervised-learning-the-dark-m...

[2] https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/

[3] https://link.springer.com/book/10.1007/978-0-387-68812-1

[4] https://mitpress.mit.edu/9780262529631/the-minimum-descripti...

[5] https://www.researchgate.net/publication/5920308_To_recogniz...

[6] https://sites.google.com/g.harvard.edu/tinyml/home

d_burfoot2y ago

Hi Bob, thanks for the kind words and for sharing with HN. For TinyML, you need to go in almost the opposite direction of what my book suggests, since the model complexity limits are so strict! I think MDL should be very helpful, but make sure you understand the danger of "manual overfitting" that I described in the book. I would also encourage you to read Vapnik's book the Nature of Statistical Learning Theory, which shows the relation b/t MDL and VC theory. Feel free to reach out to me at firstname dot lastname at gmail.com, I'm always happy to chat about these ideas.

ed_westin2y ago

bitter lesson pdf can also be found here: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

guy982387102y ago· 2 in thread

Isn't noise in the data going to dominate output size of lossless compression? Wouldn't linguistics and vision be better off with direct measurements of predictive strength?

d_burfoot2y ago

(author)

Noise certainly affects the compression rate. But you are not concerned with the absolute compression rate, you are only concerned with the relative rate achieved by two theories A and B. Both theories will be negatively impacted to the same degree by the noise, so the comparison still works to select which theory is better.

harperlee2y ago

…to the extent that the difference between theories dominates over the possible noise variability.

1 more reply

j / k navigate · click thread line to collapse

30 comments

15 comments · 3 top-level

mjburgess2y ago· 6 in thread

Scientific models are causal. Compression is a condition on association.

I don't really know what more needs to be said here.

tgv2y ago

It's an idea about how to judge models. A model's predictive capabilities are modelled (indeed) as compression, like "how many bits do you need to set up your model and correct its output".

It might be nice to compare fairly complete models on a well defined domain, but I can't see it as a general guiding principle. It would get theorizing stuck in a local minimum.

mjburgess2y ago

Science isn't interested in this kind of prediction. That's just engineering.

Causal models give counter-factual predictions for existence claims (eg., that a planet exists because the orbit of two other planets doesn't follow the causal model).

Science, in most cases, prefers models with poor "engineering predictions" (ie., point estimates of observables) because they have vastly superior explanatory power.

This entire project is just, "what if science were like ML?" -- an interesting question only because how vast the gap is; and how absurd the suggestion.

1 more reply

Loquebantur2y ago

Scientific models clearly represent a compression of measurement data?

mjburgess2y ago

> Scientific models clearly represent a compression of measurement data?

Nope! No theory of heat is a compression of therometer readings; no theory of gravity, of orbits; no theory of atoms of spectra. No theories compress measurments!

Such a thing is pure superstition. Heat is not the motion of thermometer fluid.

> not the function's confabulated "causal" justifications.

Nope!

We test scientific models by creating new experiments; it isnt "the data" which matters here, but that the experiment is designed to test the causal assumptions of the model.

If the experiment doesn't: control causes, identify novel measures with potential causes, etc. then any data collected is useless.

This is why you need, you know: randomised controlled trials, microscopes, satellites, ... etc.

1 more reply

gwern2y ago

A good compressor also needs to compress data from experiments using randomization. Causal data is also data.

I don't really know what more needs to be said there.

mjburgess2y ago

There is no such thing as "causal data". A causal model is an interpretation of data.

Eg., to say "increasingly energetic motion of molecules leads to increasingly hot water" is an interpretation of a very wide class of equations.

It posits the existence of molecules (a scientific discovery), water, energy, motion, heat, etc. and it provides a means of creating equations&measures tied to each of these terms.

Science is the production of those interpretations. There is no bare "data" which tells you how reality is.

Do you think you can just compress survey results and claim to have an explanatory model of the most complex system in the entire universe? (a person, society, and their joint interaction) etc.

1 more reply

d_burfoot2y ago· 4 in thread

(author here)

https://ozoraresearch.wordpress.com/2017/03/17/chuckling-a-b...

Obviously, I feel into the "Bitter Lesson" trap described by Rich Sutton. The DNNs can scale up, and can improve up their understanding much faster than a group of human researchers can.

edit: I used the wrong name in reference to the Bitter Lesson idea, here is the essay: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

gwern2y ago

> I'm not sure that approach has yet taken over the CV world, but based on the overwhelming success of GPT in the NLP world, I'm sure it's just a matter of time.

boberoniOP2y ago

Dan, if you know of any papers or research groups that would be related to this area, I'd love to read more about it.

[1] https://ai.meta.com/blog/self-supervised-learning-the-dark-m...

[2] https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/

[3] https://link.springer.com/book/10.1007/978-0-387-68812-1

[4] https://mitpress.mit.edu/9780262529631/the-minimum-descripti...

[5] https://www.researchgate.net/publication/5920308_To_recogniz...

[6] https://sites.google.com/g.harvard.edu/tinyml/home

d_burfoot2y ago

ed_westin2y ago

bitter lesson pdf can also be found here: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

guy982387102y ago· 2 in thread

Isn't noise in the data going to dominate output size of lossless compression? Wouldn't linguistics and vision be better off with direct measurements of predictive strength?

d_burfoot2y ago

(author)

harperlee2y ago

…to the extent that the difference between theories dominates over the possible noise variability.

1 more reply

j / k navigate · click thread line to collapse