(0) https://en.m.wikipedia.org/wiki/Kolmogorov_complexity
(1) https://en.m.wikipedia.org/wiki/Normalized_compression_dista...
(2) for example: https://www.sciencedirect.com/science/article/pii/S037907381...
The better it gzips, the easier it is to read.
It's just one part of the score, I'm not certain what weight it has on it.
After all, take a dictionary. Lots of meaningful information there. But a lot less entropy than a random sequence of characters of the same length. That random sequence will have a lot more "information", like you said; its a real needle in the haystack, but its all haystack and no golden thread.
1. Gzip is not a suitable compressor for this use case, because it's limited to a 32KB window. So the input can only be correlated with the last 32KB of the reference texts.
2. You can save a great deal in computation by avoiding recompressing the reference texts over and over and over. Some compression algorithms support checkpointing the compression state so that it can be resumed from that point repeatedly ("dictionary-based compression", which is a distinct capability from just streaming compression, which generally can only be continued once).
I would personally shill for using Zstandard [0] instead for this purpose. Although I should disclose my bias: I'm a developer of Zstd. A few salient facts:
1. Zstd supports very large windows (up to 128MB, or up to 2GB in long mode).
2. Zstd is much faster than zlib.
3. Zstd has well-developed support for dictionary-based compression.
4. Additionally, it has a dictionary trainer that can reduce a corpus of reference documents to a compact summary document that aims to capture as much of the content as possible of the reference corpus. [1]
5. It has (more than one) python binding available. [2][3]
[0] https://github.com/facebook/zstd
[1] https://github.com/facebook/zstd/blob/dev/lib/zdict.h#L40
Is it some kind of memoization?
While it's not the best-performing paradigm for text sequence tagging, it is intellectually intriguing as you say because of the parallel between the concepts "compression" and "understanding", even in the human brain. If we can't understand s.th., we need to memorize it; if we understand it, it doesn't need much space or cognitive load at all, basically a name that is well-linked to other concepts.
So feeding AI compressed data might allow it to be more efficient with its limited resources … I had never considered that, it’s very interesting idea
Auto encoder networks, or networks that have an auto encoder like structure (U-net) employ essentially compression internally in the model to extract latent features.
This basically suggests that generalization in a ML model is a function of compression efficiency. ML models memorizing data isn't actually an issue. It's memorizing data INEFFICIENTLY that a problem. Models which overfit have learned inefficient representations of the underlying relationships.
Normalized compression distance, for example, is a good start about this approach. I've used this to classify documents in categories or even to find out the author of a document (0)
It worked decently well, and surprisingly better than a lot of other, more standard approaches just because of the wild non-uniformity of human-generated content on the web.
that's not unlike using an autoencoder to score a document as an anomaly amongst the other labels?
I was wondering, what if someone would like to try a similar approach on images rather than text? What kind of image compression algorithms (if any) would be fit for use in this scenario?
Looking at the current state of the art (Transformer etc), computational cost certainly is not the main issue.