Text Classification by Data Compression (opens in new tab)

(maxhalford.github.io)

105 pointsLemaxoxo5y ago41 comments

41 comments

38 comments · 19 top-level

woliveirajr5y ago· 3 in thread

I'm missing something or the author could have used: Kolmogovov complexity (0), normalized compression distance (1), some research of Rudy Cilibrasi and Vitaniy, and some research that come after it(2).

(0) https://en.m.wikipedia.org/wiki/Kolmogorov_complexity

(1) https://en.m.wikipedia.org/wiki/Normalized_compression_dista...

(2) for example: https://www.sciencedirect.com/science/article/pii/S037907381...

hakuseki5y ago

Definitely could not have used Kolmogorov complexity, as it is uncomputable.

makeworld5y ago

Is the article not doing normalized compression distance already? It just finds the closest match instead of the distance, but it seems to be the same algorithm.

woliveirajr5y ago

Yes, seems to be so, but never mentions it. Kind of reaching the same idea by not knowing previous research. And it might find some old pitfalls, e.g., limits of using zip to do it...

donatj5y ago· 3 in thread

I work for an educational company and a proprietary metric we license includes how well the text gzips as one part of it's scoring system for reading difficulty.

The better it gzips, the easier it is to read.

It's just one part of the score, I'm not certain what weight it has on it.

abriosi5y ago

Assuming the text is well written, from the standpoint of information theory, the better the text gzips the fewer information it contains

jonnycomputer5y ago

Ah, well, this kind of information--entropy--isn't really the same thing as meaningful information (semantic information).

After all, take a dictionary. Lots of meaningful information there. But a lot less entropy than a random sequence of characters of the same length. That random sequence will have a lot more "information", like you said; its a real needle in the haystack, but its all haystack and no golden thread.

abriosi5y ago

We could also argue that an understandable text has a high degree of redundancy, e.g. explaining the same concept with different perspectives

2 more replies

felixhandte5y ago· 2 in thread

Two points worth noting:

1. Gzip is not a suitable compressor for this use case, because it's limited to a 32KB window. So the input can only be correlated with the last 32KB of the reference texts.

2. You can save a great deal in computation by avoiding recompressing the reference texts over and over and over. Some compression algorithms support checkpointing the compression state so that it can be resumed from that point repeatedly ("dictionary-based compression", which is a distinct capability from just streaming compression, which generally can only be continued once).

I would personally shill for using Zstandard [0] instead for this purpose. Although I should disclose my bias: I'm a developer of Zstd. A few salient facts:

1. Zstd supports very large windows (up to 128MB, or up to 2GB in long mode).

2. Zstd is much faster than zlib.

3. Zstd has well-developed support for dictionary-based compression.

4. Additionally, it has a dictionary trainer that can reduce a corpus of reference documents to a compact summary document that aims to capture as much of the content as possible of the reference corpus. [1]

5. It has (more than one) python binding available. [2][3]

[0] https://github.com/facebook/zstd

[1] https://github.com/facebook/zstd/blob/dev/lib/zdict.h#L40

[2] https://pypi.org/project/zstandard

[3] https://pypi.org/project/zstd

19965y ago

> Some compression algorithms support checkpointing the compression state so that it can be resumed from that point repeatedly ("dictionary-based compression"

Is it some kind of memoization?

samus5y ago

Another toplevel comment claims it is relevant for the use case where you stuff the whole corpus into a single stream. When you want to add new data, you don't want to start over compressing everything.

https://news.ycombinator.com/item?id=27441474

jll295y ago· 2 in thread

The University of Waikato, New Zealand has had a lot of research going on to use compression for named entity tagging (name, location, date, person, ...) etc.

While it's not the best-performing paradigm for text sequence tagging, it is intellectually intriguing as you say because of the parallel between the concepts "compression" and "understanding", even in the human brain. If we can't understand s.th., we need to memorize it; if we understand it, it doesn't need much space or cognitive load at all, basically a name that is well-linked to other concepts.

malux855y ago

Yeah it’s interesting this got me thinking, lossless compression is just removing redundancy right - like it doesn’t introduce any ambiguity in the data (?)

So feeding AI compressed data might allow it to be more efficient with its limited resources … I had never considered that, it’s very interesting idea

teruakohatu5y ago

You are correct, compression is essentially extracting latent features in the data and discarding the rest.

Auto encoder networks, or networks that have an auto encoder like structure (U-net) employ essentially compression internally in the model to extract latent features.

lovasoa5y ago· 2 in thread

You don't have to recompress the whole corpus to add a single document to it. All the compression algorithms mentioned here work in a streaming fashion. You could "just" save the internal state of the algorithm after compressing the training data, and then reuse that state for each classification task.

spullara5y ago

Was going to come here to say that. Played around with this a bit for compressing small fields using a learned dictionary:

https://github.com/spullara/corpuscompression

LemaxoxoOP5y ago

I suspected this. However, I wasn't able to grok the documentation well enough but I didn't able to find a convincing example. It seems to me that these Python compressors get "frozen" and can't be used to compress further data.

Xcelerate5y ago· 2 in thread

The theory behind this kind of approach is actually very deep. If we had a “perfect compressor” that returned the shortest non-halting prefix-free Turing machine that, when run, output the original dataset, this would essentially be the best classifier we could create. But as the author notes, there are some heavy computational penalties to overcome for this, even to an approximation.

thunderbird1205y ago

Yeah it's an interesting concept. In the ML community recently there has been some informal chatter along similar lines. Essentially, you can prove that by modeling the probability of every element in a dataset given every other set of elements, what you are learning is the core underlying structure which defines the data you are modeling. A perfect model of this structure represents the most efficient possible representation of the information. This is kind of like how if you want to represent an arbitrarily long "game" in Conway's Game of Life you only need to give the starting position because we know exactly how the game state will look at every step because we know the rules of the game.

This basically suggests that generalization in a ML model is a function of compression efficiency. ML models memorizing data isn't actually an issue. It's memorizing data INEFFICIENTLY that a problem. Models which overfit have learned inefficient representations of the underlying relationships.

woliveirajr5y ago

Kolmogovov complexity and the proofs of its incomputability are there to explain it.

sean_pedersen5y ago· 2 in thread

Cool idea! Shouldn't this work also by concatenating the single document (you want to classify) with the compressed version of the conc. class corpus (saving compute time)?

woliveirajr5y ago

There are some "old" research about it...

Normalized compression distance, for example, is a good start about this approach. I've used this to classify documents in categories or even to find out the author of a document (0)

(0)https://pubmed.ncbi.nlm.nih.gov/23597746/

LemaxoxoOP5y ago

I think that is what is being suggested in the other comment. One would have to try! My instincts tell me the results would not be identical.

thomasluce5y ago· 2 in thread

I worked for an internet scraping/statistics gathering company some years ago, and we used this approach alongside a few others to find mailing addresses embedded in websites. Basically use LZW-type compression with entropy information only trained on known addresses, and then compress a document, looking for the section of the document with the highest compression ratio.

It worked decently well, and surprisingly better than a lot of other, more standard approaches just because of the wild non-uniformity of human-generated content on the web.

ta9885y ago

Does that mean you were doing an LZW compression but with a fixed table?

thomasluce5y ago

Yes, exactly. We pre-built the table with a ton of hand-picked mailing addresses copy-pasted out of a bunch of free-text and then just kept using that one.

1 more reply

pxx5y ago· 1 in thread

Aren't the block sizes too small? gzip uses 64k block sizes and it seems like the compressed sizes are several times larger.

w-m5y ago

How about interleaving the test data then, instead of appending it to the very end? For gzip, if the block size is 64k (another comment says 32k?), split the corpus text into 32k blocks, and interleave it with 32k blocks of the test set.

userbinator5y ago

I've made use of the property of compression algorithms in detecting redundancy and similarity in a slightly different way: Compress a codebase, and the most compressible source code files are also ones that are likely candidates for refactoring.

autokad5y ago

> "Take a labeled training set of documents... Return the label for which the size increased the least."

that's not unlike using an autoencoder to score a document as an anomaly amongst the other labels?

timeinput5y ago

The Hutter prize [0] is based around the idea that compressing Wikipedia will lead to more advanced AI research. It doesn't seem like it has so far, but the solutions aren't all open source so it's hard to say, but it's definitely along the same lines.

[0] https://en.wikipedia.org/wiki/Hutter_Prize

euske5y ago

It is worth pointing out that Yann LeCun, a prominent ML researcher, also worked with DjVu, an image compression algorithm.

cf. https://en.wikipedia.org/wiki/DjVu

nightcracker5y ago

I took a course on this and similar techniques called "Information Theoretic Data Mining". You can see a bunch of relevant references on the course site: https://eda.liacs.nl/teaching/itdm/.

Fragoel25y ago

I really enjoyed the article, it is (in my opinion) a nice case of curiosity-driven research.

I was wondering, what if someone would like to try a similar approach on images rather than text? What kind of image compression algorithms (if any) would be fit for use in this scenario?

carschno5y ago

> However, it would most probably not be worth using such an approach because of the prohibitive computational cost.

Looking at the current state of the art (Transformer etc), computational cost certainly is not the main issue.

dedalus5y ago

The same technique was extrapolated to images in this paper where a CDN's corpus of cached images were classified and apply the optimizations for each image as the exemplar for it bucket.

h2odragon5y ago

redundant information in natural language is often the more important. Now think of the compression algorithms input tokens as parsed stemmed etc words, instead of a bytestream; you get some interesting concept maps pretty easily.

totorovirus5y ago

this is truly interesting.. does this work for very long documents? i.e. entire book? And if this is the case, why are nlp reseachers still struggling with increasing transformer width? The answer is here

j / k navigate · click thread line to collapse

41 comments

38 comments · 19 top-level

woliveirajr5y ago· 3 in thread

(0) https://en.m.wikipedia.org/wiki/Kolmogorov_complexity

(1) https://en.m.wikipedia.org/wiki/Normalized_compression_dista...

(2) for example: https://www.sciencedirect.com/science/article/pii/S037907381...

hakuseki5y ago

Definitely could not have used Kolmogorov complexity, as it is uncomputable.

makeworld5y ago

Is the article not doing normalized compression distance already? It just finds the closest match instead of the distance, but it seems to be the same algorithm.

woliveirajr5y ago

Yes, seems to be so, but never mentions it. Kind of reaching the same idea by not knowing previous research. And it might find some old pitfalls, e.g., limits of using zip to do it...

donatj5y ago· 3 in thread

I work for an educational company and a proprietary metric we license includes how well the text gzips as one part of it's scoring system for reading difficulty.

The better it gzips, the easier it is to read.

It's just one part of the score, I'm not certain what weight it has on it.

abriosi5y ago

Assuming the text is well written, from the standpoint of information theory, the better the text gzips the fewer information it contains

jonnycomputer5y ago

Ah, well, this kind of information--entropy--isn't really the same thing as meaningful information (semantic information).

abriosi5y ago

We could also argue that an understandable text has a high degree of redundancy, e.g. explaining the same concept with different perspectives

2 more replies

felixhandte5y ago· 2 in thread

Two points worth noting:

1. Gzip is not a suitable compressor for this use case, because it's limited to a 32KB window. So the input can only be correlated with the last 32KB of the reference texts.

I would personally shill for using Zstandard [0] instead for this purpose. Although I should disclose my bias: I'm a developer of Zstd. A few salient facts:

1. Zstd supports very large windows (up to 128MB, or up to 2GB in long mode).

2. Zstd is much faster than zlib.

3. Zstd has well-developed support for dictionary-based compression.

5. It has (more than one) python binding available. [2][3]

[0] https://github.com/facebook/zstd

[1] https://github.com/facebook/zstd/blob/dev/lib/zdict.h#L40

[2] https://pypi.org/project/zstandard

[3] https://pypi.org/project/zstd

19965y ago

> Some compression algorithms support checkpointing the compression state so that it can be resumed from that point repeatedly ("dictionary-based compression"

Is it some kind of memoization?

samus5y ago

https://news.ycombinator.com/item?id=27441474

jll295y ago· 2 in thread

The University of Waikato, New Zealand has had a lot of research going on to use compression for named entity tagging (name, location, date, person, ...) etc.

malux855y ago

Yeah it’s interesting this got me thinking, lossless compression is just removing redundancy right - like it doesn’t introduce any ambiguity in the data (?)

So feeding AI compressed data might allow it to be more efficient with its limited resources … I had never considered that, it’s very interesting idea

teruakohatu5y ago

You are correct, compression is essentially extracting latent features in the data and discarding the rest.

Auto encoder networks, or networks that have an auto encoder like structure (U-net) employ essentially compression internally in the model to extract latent features.

lovasoa5y ago· 2 in thread

spullara5y ago

Was going to come here to say that. Played around with this a bit for compressing small fields using a learned dictionary:

https://github.com/spullara/corpuscompression

LemaxoxoOP5y ago

Xcelerate5y ago· 2 in thread

thunderbird1205y ago

woliveirajr5y ago

Kolmogovov complexity and the proofs of its incomputability are there to explain it.

sean_pedersen5y ago· 2 in thread

Cool idea! Shouldn't this work also by concatenating the single document (you want to classify) with the compressed version of the conc. class corpus (saving compute time)?

woliveirajr5y ago

There are some "old" research about it...

Normalized compression distance, for example, is a good start about this approach. I've used this to classify documents in categories or even to find out the author of a document (0)

(0)https://pubmed.ncbi.nlm.nih.gov/23597746/

LemaxoxoOP5y ago

I think that is what is being suggested in the other comment. One would have to try! My instincts tell me the results would not be identical.

thomasluce5y ago· 2 in thread

It worked decently well, and surprisingly better than a lot of other, more standard approaches just because of the wild non-uniformity of human-generated content on the web.

ta9885y ago

Does that mean you were doing an LZW compression but with a fixed table?

thomasluce5y ago

Yes, exactly. We pre-built the table with a ton of hand-picked mailing addresses copy-pasted out of a bunch of free-text and then just kept using that one.

1 more reply

pxx5y ago· 1 in thread

Aren't the block sizes too small? gzip uses 64k block sizes and it seems like the compressed sizes are several times larger.

w-m5y ago

userbinator5y ago

autokad5y ago

> "Take a labeled training set of documents... Return the label for which the size increased the least."

that's not unlike using an autoencoder to score a document as an anomaly amongst the other labels?

timeinput5y ago

[0] https://en.wikipedia.org/wiki/Hutter_Prize

euske5y ago

It is worth pointing out that Yann LeCun, a prominent ML researcher, also worked with DjVu, an image compression algorithm.

cf. https://en.wikipedia.org/wiki/DjVu

nightcracker5y ago

I took a course on this and similar techniques called "Information Theoretic Data Mining". You can see a bunch of relevant references on the course site: https://eda.liacs.nl/teaching/itdm/.

Fragoel25y ago

I really enjoyed the article, it is (in my opinion) a nice case of curiosity-driven research.

I was wondering, what if someone would like to try a similar approach on images rather than text? What kind of image compression algorithms (if any) would be fit for use in this scenario?

carschno5y ago

> However, it would most probably not be worth using such an approach because of the prohibitive computational cost.

Looking at the current state of the art (Transformer etc), computational cost certainly is not the main issue.

dedalus5y ago

The same technique was extrapolated to images in this paper where a CDN's corpus of cached images were classified and apply the optimizations for each image as the exemplar for it bucket.

h2odragon5y ago

totorovirus5y ago

j / k navigate · click thread line to collapse