78% MNIST accuracy using GZIP in under 10 lines of code (opens in new tab)

(jakobs.dev)

363 pointsjs982y ago137 comments

137 comments

91 comments · 26 top-level

GaggiX2y ago· 21 in thread

For a comparison with others techniques:

Linear SVC (best performance): 92 %

SVC rbf (best performance): 96.4 %

SVC poly (best performance): 94.5 %

Logistic regression (prev assignment): 89 %

Naive Bayes (prev assignment): 81 %

From this blog page: https://dmkothari.github.io/Machine-Learning-Projects/SVM_wi...

Also it seems from reading online articles that people are able to obtain much better results just by using K-NN, so I imagine that the author just made his job harder by using gzip but I could be wrong about this.

IKantRead2y ago

Thanks for posting this.

Most people don't realize that Logistic regression can get ~90% accuracy on MNIST.

As a big fan of starting with simple models first and adding complexity later, I've frequently been told that "logistic regression won't work!" for problems where it can in fact perform excellent.

When faced with this resistance to logistic regression I'll often ask what they think the baseline performance of it would be on MNIST. The guesses I hear most often are 20-30%.

People, even machine learning people, often don't realize the rapidly diminishing returns you get for adding a lot of complexity in your models. Sometimes it's worth it, but it's always good to start simple first.

It's also been my experience that if you don't get good performance with a simple model, you are very unlikely to get great performance from a more complex one.

nextos2y ago

Came here to say the same thing, actually NCD can probably do much better than 78%. Li & Vitanyi's book about Kolmogorov complexity has some interesting unsupervised examples.

A simple CNN as implemented in Keras tutorial can easily exceed 98%. 78% is very poor performance for MNIST even if model complexity is penalized.

mardifoufs2y ago

When would a 90% accuracy on a dataset like mnist ever be useful? And I mean useful as in usable for actual products or software. especially considering mnist is more of a toy dataset.

I think that's why machine learning is the way to go for this type of detection, why go with anything else than a CNN (in this case) when it is now trivial to set up and train? Again, unless it's just to mess around with, 90% mnist accuracy is not useful in the real world

1 more reply

KRAKRISMOTT2y ago

If you stack logistic regression in layers, you get a neural network.

bugglebeetle2y ago

> People, even machine learning people, often don't realize the rapidly diminishing returns you get for adding a lot of complexity in your models.

Are you me? I was just having this argument at work with someone about using an old (fasttext/word2vec) model vs. the overhead on fine-tuned BERT model for a fairly simple classification problem.

short_sells_poo2y ago

To a large degree this happens because logistic regression is not a sexy approach that one can add to their CV. Everyone wants to solve problems with big, complicated and buzzwordy models, because that sells (and is perhaps more interesting as well).

It's really a tragedy, because so many classic models would work fine for real world applications.

vjerancrnjak2y ago

Extending existing features with their non-linear mappings would improve logistic regression too, probably to the svc level (rbf or poly kernel is that but implicit).

Linear models are really well researched, and today with the compute and with proper training and data preparation they can easily get to satisfying levels of performance for a variety of tasks.

make32y ago

this is true unless you can use large pretrained models, which are very simple to use, are very resistant to ask sorts of noise, & would get ultra high accuracy with just logistic regression on the output of the model

salty_biscuits2y ago

You can get 80ish percent with a linear model (if you one hot the labels).

dmazzoni2y ago

That blog page is not showing state-of-the-art results, it's just taking relatively naive SVM implementations and comparing them.

The original research paper that introduced the MNIST data set was getting around 98% accuracy, and today neural nets are getting 99.87% accuracy.

https://paperswithcode.com/sota/image-classification-on-mnis...

GaggiX2y ago

Yep I was showing others simple techniques, of couse a neural network would be much better at it.

TimPC2y ago

The whole point isn't to do better. It's to show that enough information survives compression that you can still get very large signal. Compression is intended to make things harder and still does.

GaggiX2y ago

The article OP linked is inspired by Gzip+Knn paper that was released a few months ago, compression is not intended to make things harder but to extract useful features that can be used by simple algorithms like Knn or SVG etc...

For example it would be almost impossible to distinguish cat and dog images by using SVG or Knn directly on the data but it would be much easier if the images were first passed into an image encoder and a small embeddings vector would be used for each one instead. In the article OP linked it doesn't seem that Gzip is very good at extracting useful patterns for the MNIST dataset.

amelius2y ago

Why? I mean it's trivial to add a layer of encryption ...

(Also the terminology seems off here, as 100% of information survives gzip.)

minitech2y ago

No, that wasn’t the point at all. Compressibility was used to determine similarity.

1 more reply

grandma_tea2y ago

While it's cool that this works at all, I wish we would stop using MNIST as a benchmark given how trivial it is.

IKantRead2y ago

It's a good benchmark because it's so trivial.

Sure it's not great at differentiating between SotA techniques, but it's very useful for sanity checks like this one.

Even for SotA models, it's still useful to verify that you can get greater than 98% accuracy on MNIST, before exploring larger, more complex bench marks.

It certainly shouldn't be the only benchmark but it's a great place to start iterating on ideas.

2 more replies

dmazzoni2y ago

It's still a great data set because it's real-world, useful, high-quality, and an excellent size (~60k examples), and the "correct" classification isn't very subjective at all - humans can easily get 99% accuracy.

unoti2y ago

MNIST is pretty good as a proof of concept. While I agree it's trivial, I see MNIST as being more like how when you're making a toy programming language, the first thing you do with it is writing a recursive Fibonacci function.

antiquark2y ago

Take a look at the EMNIST - Extended MNIST dataset. It has both digits and letters of the alphabet.

westurner2y ago

So there is a more optimal compression algorithm for the relation between the MNIST inputs and outputs!

Other models tend to add noise somewhere in there; what about feature engineering before the gzip? Maybe a gaussian blur and a convolution first and then deep learning for feature selection

montebicyclelo2y ago· 9 in thread

I tried replacing the distance function in the code with some simpler distance measures:

    Gzip distance: ~3 minutes, 78% accuracy
    Euclidean distance: ~0.5 seconds, 93% accuracy
    Jaccard distance * : ~0.7 seconds, 94% accuracy
    Dice dissimilarity * : ~0.8 seconds, 94% accuracy

    * after binarising the images

So, as a distance measure for classifying MNIST digits, GZIP has lower accuracy, and is much more computationally demanding than other measures.

I'm not that familiar with how the GZIP algorithm works, it's kind of interesting that it's so much lower. I wonder whether image focused compression algorithms might do better?

(Edit: Btw, I enjoyed the post; it's a creative idea, the writing and code was great, and it's sparked some good discussion. But after having a closer look, I think the baselines above provide some context to the gzip scores.)

w-m2y ago

Best one I found is Normalized Mutual Information at 95%. It's a little more complex, but you could still compute it relatively quickly on binarized images.

    NMI skimage: ~30 seconds, 95% accuracy
    NMI numba *: ~0.6 seconds, 95% accuracy

Here's the code, courtesy of ChatGPT:

    @njit(cache=True)
    def compute_joint(img1, img2):
        stacked = 2 * img1.ravel() + img2.ravel()
        return np.bincount(stacked, minlength=4).reshape(2, 2)

    @njit(cache=True)
    def entropy(counts, total):
        # Convert counts to probabilities
        p = counts / total
        # Only calculate for non-zero entries to avoid log(0)
        return -np.sum(p[p > 0] * np.log2(p[p > 0]))

    @njit(cache=True)
    def normalized_mutual_information(img1, img2):
        joint_counts = compute_joint(img1, img2)
        total_pixels = 28*28
        
        # Marginal counts
        c1 = np.sum(joint_counts, axis=1)
        c2 = np.sum(joint_counts, axis=0)
        
        # Compute entropies directly from counts
        h1 = entropy(c1, total_pixels)
        h2 = entropy(c2, total_pixels)
        joint_entropy = entropy(joint_counts.flatten(), total_pixels)
        
        mutual_info = h1 + h2 - joint_entropy
        return 2 * mutual_info / (h1 + h2)

tysam_and2y ago

Holy cow, I knew that MNIST was simple, but not that simple.

Could you post a snippet of the code that you used to achieve this? It would be really, really nice to have a baseline to work from I'm sure, and I feel like this could be really useful to a few other areas (my personal obsession is speed-training on CIFAR10 :')))) )

Holy cow, that's insane. :O

montebicyclelo2y ago

I used the notebook linked in the original post [1]. It evaluates using 100 samples from the test set, (I'm guessing because the gzip method is slow - it would take ~7 hours on the full test set, on my machine).

I plugged in the distance measures, for the `compute_ncd` function. (Jaccard/Dice have been negated and the -1 removed.)

    def euclidean(x1, x2):
        return np.sum(np.square(x1 - x2))

    def jaccard(x1, x2):
        x1_binary = x1 > 0.5
        x2_binary = x2 > 0.5
        return np.logical_or(x1_binary, x2_binary).sum() / np.logical_and(x1_binary, x2_binary).sum()

    def dice(x1, x2):
        x1_binary = x1 > 0.5
        x2_binary = x2 > 0.5
       return 2 * np.logical_or(x1_binary, x2_binary).sum() / (x1_binary.sum() + x2_binary.sum())

[1] https://github.com/Jakob-98/mono/blob/73168bc0ea904e75865815...

1 more reply

KateLawson2y ago

MNIST is what the Post Office solved in the late 1990s. It's basically the "hello world" of visual recognition these days.

https://www.govexec.com/federal-news/1999/02/postal-service-...

sr1922y ago

ben recht's kernel method implementation in 10 lines hits 98%

https://github.com/benjamin-recht/mnist_1_pt_2/tree/main

tysam_and2y ago

Great link, thank you so much, lol.

This line freaking killed me:

https://github.com/benjamin-recht/mnist_1_pt_2/blob/c0fe96bc...

cqz2y ago

Very cool. I tried using PNG compression, it does actually do better, slightly:

    PNG : ~15.1 seconds, 83% accuracy

I also tried dropping in zstandard compression:

    Zstd (level=3) : ~3.5 seconds, 88% accuracy

Much faster than gzip, at least. And iff I use (x1-x2)*2 instead of x1+x2 to calculate Cx1x2 that pushes zstd up to 93% accuracy.

I'm somewhat interested in the fact that if I stack the two arrays on top of each other instead of adding them together, I get absolutely garbage performance (<20%), however as far as I can tell that actually works well when classifying strings.

w-m2y ago

Re garbage performance on stacking: I had tried interleaving the values (creating a 56x28 img), it dropped one percentage point of accuracy when using gzip.

jdthedisciple2y ago

That's pretty amazing.

So the whole gzip thing - while fancy - really is extra steps for less bang.

jszymborski2y ago· 9 in thread

In fairness you can run MNIST through UMAP and get near perfect seperation. I'm of the belief that you have to try pretty hard not to do well on MNIST these days.

https://github.com/lmcinnes/umap_paper_notebooks/blob/master...

EDIT: I should add, unless it isn't clear, that we really should retire the dataset. Something like the QuickDraw dataset makes a lot more sense to me.

js98OP2y ago

Author here: completely agree. I think it is not much of an achievement by itself, but it is interesting to see that it works!

I will add a comment/edit to the post once I am home to clarify the relative ease of solving MNIST

jszymborski2y ago

Sorry if my comment came off as dismissive of the post! I was just trying to frame the results for those unfamiliar with the properties of MNIST.

I think it is indeed interesting as it shows that Gzip can capture the same things that e.g. UMAP et co. are capturing when they acheive good scores on MNIST.

Also, I'll add, that even despite some of the suspicions people have cast on the Gzip results in other experiments, I'm bullish on the utility of reducing entropy for classification problems.

ekidd2y ago

> I should add, unless it isn't clear, that we really should retire the dataset.

From a research perspective, MNIST is basically solved. I think current performance on MNIST is better than humans, correct?

But it's still valuable as a teaching tool or a "Hello world" data set, because it's so simple and because most reasonable algorithms will reach 97% accuracy. Even if you build your tools from scratch, it makes a nice homework-sized problem. And it's an obviously useful task that anyone can understand. "Oh, digit recognition for mail!"

jszymborski2y ago

I was thinking about this just now... the real utility of MNIST is that, since it's basically impossible to get a bad result, it's a good sanity check.

krick2y ago

Yeah, but the thing is, gzip isn't "these days". It was a thing long before UMAP and even MNIST itself. And the approach isn't perticulary novel too, this is a very simple idea, if you do understand compression. It could've been written the first day MNIST was published, and it would be still 78% accuracy. This is kinda amazing to me.

sundarurfriend2y ago

Y'all are making the (rude) person below complaining about acronyms look reasonable. The repository doesn't define UMAP either, but if you believe ChatGPT it is:

> UMAP, which stands for Uniform Manifold Approximation and Projection, is a dimensionality reduction technique and data visualization method commonly used in machine learning and data analysis.

wenc2y ago

We work in our own ecological niches which sometimes entails having a specialized language that allows us to be precise when referring to things.

I sometimes see HN comments complaining that acronyms and terms are dropped with no explanation. These comments come across as entitled to me — instead of complaining, why not be curious and ask “what does that mean in this context?”

Imagine someone dropping in here and complaining, “the article sucks because it uses the word ‘compiler’ and assumes we all know what a ‘compiler’ is.” This is what it sounds like to those of us to work in specialized fields.

Instead if someone said, “I’m new to this, could someone help me understand what a ‘compiler’ is?”, it demonstrates curiosity and people are more inclined to explain.

jszymborski2y ago

It's[0] a non-linear dimensionality reduction technique in the vein of t-SNE[1] that is very well known in the field.

My two cents is that if your problem can be solved with something like UMAP + kNN[2], then you really shouldn't be using Deep Learning to solve it.

[0] https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduc...

[1] https://en.wikipedia.org/wiki/T-distributed_stochastic_neigh...

[2] https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

1 more reply

tomrod2y ago

UMAP is as ubiquitous as T-SNE these days, though to be fair those on HN that focus on JVM, LLVM, NPM, or TLS may not know about UMAP/T-SNE.

tdr2d2y ago· 7 in thread

Obviously, the code may be elegant and compact, 78% accuracy is considered very very bad for MNIST.

A dummy model written with Tensorflow easilly reaches 90% accuracy. The best models ranked at 99,87%, see the benchmark : https://paperswithcode.com/sota/image-classification-on-mnis...

esafak2y ago

The article emphasizes the wrong thing, in my view. The interesting part is that compression -- without learning a model -- can be used for classification. This raises the question of what other information-theoretic measures can be used; cheaper, lossy ones.

To Compress or Not to Compress- Self-Supervised Learning and Information Theory: A Review https://arxiv.org/abs/2304.09355\*

anitil2y ago

I remember seeing an example of using zip to classify languages. You take a set of documents of equal size where you know the languages, then individually concatenate and zip them with the unknown text. The smallest compressed output is likely to be the target language.

I can't find the original blog, but there's a note about it here - https://stackoverflow.com/questions/39142778/how-to-determin...

3 more replies

Quenty2y ago

I mean, they’re using knn underneath. You can apparently get 97% accuracy with normal knn, at n=4 if you compare pixel distance.

So another way to frame this might be that gzip costs a lot of accuracy but may lead to better performance.

https://newpblog.netlify.app/2018-01-24-knn-analysis-on-mnis...

a_wild_dandan2y ago

An increasingly common refrain in machine learning is “intelligence is compression.” Folks who believe that might bristle at the distinction between learning and compression.

1 more reply

sundarurfriend2y ago

The point is not to have "elegant and compact" code, this is meant to be a fun curiosity, and doing it in 10 lines is just an additional layer of challenge for the heck of it.

The interesting thing is not in whether GZip can achieve SOTA, it's that it can do a decent job at all. (The interesting thing is not in whether the bear can recreate Mozart exactly, it's that it can play the piano at all.)

ActivePattern2y ago

Yeah, it does demonstrate that you can use compression to measure similarity of two images.

But it also demonstrates that it's a pretty poor similarity measure. Something as simple as counting % of matches between the black and white pixels performs much better.

m00x2y ago

It's not trying to break records, it just shows a neat aspect of compression. It's still 8 times better than baseline, which showcases that compression can learn representation.

tjungblut2y ago· 5 in thread

I don't immediately find it, but couple of years back there was a "meta-feature" which was the size of the MNIST image. I think that scored about 90'ish % accurate results on its own - without even looking at the image.

13of402y ago

A few years back I worked on a project that involved fingerprinting screenshots of web pages, and compressed image size was pretty much as good as any fingerprinting method we could come up with for comparing the similarity between them.

momojo2y ago

The off-the-beaten-path nature of this reminds me of banks sending $<arbitrary decimal amount> as a PIN for auth.

tysam_and2y ago

Makes sense, considering that's not too terribly far off from what the KL divergence does

p1esk2y ago

What do you mean by “size”? Gzipped size? If you simply look at how dark a Mnist image is (count the percentage of dark pixels) you’ll get about 20% accuracy, which is twice better than random guess but a long way from 90’ish %.

yvdriess2y ago

What do you mean with accuracy here? Usually 50% accuracy means cointoss, meaning 20% accuracy is equal to 80% accuracy, which is better than the article's 78% and not that far from 90%.

4 more replies

wayeq2y ago· 3 in thread

I'm merely a hobbyist in this domain but isn't highly compressed data (like encrypted data) also high entropy? If this is finding patterns in the compressed data to figure out which digit the uncompressed data represents, shouldn't those patterns be exploitable for better compression?

fritzo2y ago

The demonstration isn't classifying based on compressed data, rather it's classifying on the compressibility of data. The idea is that "7 7" should be more compressible than "7 3", and similarly raster images of "7 7" should be more compressible than raster images of "7 3".

wayeq2y ago

Ah ok, that makes sense. Thank you for the explanation.

Zamicol2y ago

Encrypted data ideally is incompressible. Incompressibility is a hallmark of efficient cryptographic operations.

See the Wikipedia article on Kolmogorov complexity, which has a short section about compression. https://en.wikipedia.org/wiki/Kolmogorov_complexity#Compress...

Edit: One of my favorite concepts in the domain of compression is the pigeonhole principle, that states that for all compression algorithms, some outputs will be larger than the inputs.

Well designed encrypted payloads may be compressed, but the outputs should on average be larger than the inputs, rendering compression useless thus it is said to be "incompressible".

https://en.wikipedia.org/wiki/Pigeonhole_principle#Uses_and_...

_a_a_a_2y ago· 3 in thread

"MNIST"?

accuracy - but of what?

what's this about?

simonw2y ago

MNIST is a classic image classification exercise - a dataset of 60,000 training images and 10,000 testing images where each image is a handwritten numeral as a 28x28 pixel grayscale image.

The challenge is to build a computer vision model that can tell which numeral each handwritten digit represents.

https://en.wikipedia.org/wiki/MNIST_database

78% accuracy on a solution is pretty bad, but achieving it just using GZIP is a very neat hack.

Detrytus2y ago

What's the average human performance for this task?

1 more reply

_a_a_a_2y ago

thank you

3abiton2y ago· 2 in thread

Didn't it turn out that authors of that paper have made mistakes that catapulted their results to the top of the benchmark charts? I thought the theory was inconsistent after that incident. 78% accuracy from just GZIP is impressive.

0xfffafaCrash2y ago

haven’t looked through this post yet but I think this is what you have in mind: https://kenschutte.com/gzip-knn-paper/

vintermann2y ago

This is 78% accuracy from Gzip-based compression distance + KNN, which seems to be worse than any other distance measure you can think of + KNN.

petters2y ago· 2 in thread

I wonder why you started with a code-golfed version? That seems original to the main point of the post

seeknotfind2y ago

original -> orthogonal?

I like short code! It's like a joke "this is so easy, 10 loc" :)

hinkley2y ago

The internet has turned into a game of “is it him or is it autocorrect?”

I can’t count how many times I’ve looked up at a line of text to spot check it, type in a couple more letters, hit save without looking, and find later that I sound like I’m having a stroke. Never mind the times I’ve simply forgotten to look. Yes that’s a word, stop trying to replace it.

(Five minutes ago I had to type “laissez-faire” three times. Yesterday it turned “understanding” into god knows what. And if iOS doesn’t stop adding apostrophes to “its” I’m gonna fuckin lose my shit)

1 more reply

tysam_and2y ago· 2 in thread

Additionally, try flipping your images and averaging the size before comparing the distance between them. I'd expect a boost of about 78% -> 84% or so, based on how this typically works as TTA.

jdthedisciple2y ago

Can you elaborate more on that technique?

Why would it improve the result so much? Sounds very interesting so I'm curious.

tysam_and2y ago

facepalm I just realized I was talking about this in the context of MNIST -- my apologies.

It would be better in a problem with horizontal symmetry like CIFAR, esp if the zipping is serialized by unwinding the 2d pixel array into 1d.

One trick that _should_ work is by comparing the distances of starting the compression at the 4 different corners of the image, in the 2 separate directions for each corner. That should provide way more than enough information for k-means clustering.

My apologies again for my mistake, thank you for asking, I wouldn't have really seen that otherwise :')))).

benreesman2y ago· 1 in thread

My favorite book about the deep connections between information theory, compression, and learning algorithms is MacKay (most probably know about it but I didn’t for a long time so maybe some will benefit from the mention).

I gather this is common knowledge (if not sufficiently emphasized at times) among those with serious educations, but as a self-taught, practical application-type ML person this profound thread running through all these topics (and seemingly into heavy-ass particle physics and cosmology and stuff like that) was a blinding “Aha!” moment that I’ll venture this comment in the hopes that even one other person has that unforgettable moment.

doubloon2y ago

i have put MacKay on my to do list, thank you

i was quite struck when i learned the original Lempel Ziv compression (which gzip is partly based on) came out of their study of the "complexity of finite sequences" not necessarily trying to shrink things, https://ieeexplore.ieee.org/document/1055501

bob10292y ago· 1 in thread

General purpose compressors and information distance measures have become super interesting to me while I've been investigating alternative language models.

I've been playing around with an attention mechanism that combines the idea of using normalized compression distance (gzip) with discrete convolution between candidate sequences (sliding window of N bytes over each). Another round of normalization over the convolution outputs - accommodating varying lengths - allows for us to compare candidate sequences for relevant information on equal grounds.

The NCD formula I am using right now:

  NCD(x,y) = (C(xy) - MIN(C(x),C(y))) / MAX(C(x),C(y))

No weird parameters or any other things to tune. The only parameters are the source documents and the input context/query.

tripzilch2y ago

So, if I understand your sliding window explanation right, would the distance between two strings X and Y then be a feature vector of all the NCDs of its windows?

Kind of reminds me of auto correlation :)

Also, just a note about your NCD formula, if C(xy) is the compressed size of the concatenation of x and y, then I would recommend also trying (C(xy)+C(yx))/2 for that term, because a lot of compressors don't compress xy the same as yx, and you probably want your distance to be symmetrical.

tobeyey2y ago

Replacing the NCD

  distances = [(compute_ncd(x1, x), label) for x, _, label in compressed_lengths]

with the Euclidean distance

  distances = [(np.sqrt(np.sum(np.square(x1-x))), label) for x, _, label in compressed_lengths]

gives you +15% test accuracy and saves you a lot of compute.

jp572y ago

Leaving aside whether this problem is a good application of this compression trick, I want to say that everyone experimenting with this should stop using `gzip` and start using `zlib`.

If you change the first line from `gzip.compress` to `zlib.compress` you should get the same classifier performance with a 3x speedup.

yannccc22y ago

All these ideas date back to at least 2010, probably earlier. It was called "information distance". https://arxiv.org/abs/1006.3520

tysam_and2y ago

If you'd like to play around with MNIST yourself, I wrote a PyTorch training implementation that gets ~99.45%+ val accuracy in <13.6 seconds on a V100, est. < 6.5 seconds on an A100. Made to be edited/run in Colab: https://github.com/tysam-code/hlb-CIFAR10

It's originally kitted for CIFAR10, but I've found the parameters to be quite general. The code is very easy to read and well-commented, and is a great starting place for exploration.

Min-cut deltas to run MNIST:

  .datasets.CIFAR10(' -> .datasets.MNIST(' (both occurences)

  'whiten': Conv(3, -> 'whiten': Conv(1,

  crop_size = 28 - > `crop_size = 28

Compute for the project funded by Carter Brown and Daniel Gross, my appreciation to them both for helping make this possible. Their support via encouragement has been very helpful as well. <3 :)

m00x2y ago

It goes along the same lines as the recent paper from Deepmind stating that DL (language modeling in their case) is compression.

https://arxiv.org/pdf/2309.10668.pdf

thomasahle2y ago

Similar result (2019) using ZIP, getting 74%: https://www.blackhc.net/blog/2019/mnist-by-zip/

0xDEF2y ago

Has anyone tried how different compression algorithms compare when doing NCD classification?

gzip first does LZ77 and then Huffman coding. ANS is a more modern alternative to Huffman coding that can achieve higher compression ratios than Huffman coding.

Lerc2y ago

I'm sure there was something like this mentioned in "Managing Gigabytes" or thereabouts. It might have been using bitwise arithmetic compression which makes sense for the problem at hand.

benob2y ago

What explains the huge difference in perf with the approach from 2019 mentioned at the end?

For image classification, couldn't one use the residual from applying a jpeg codebook learned on a training example as metric?

SomewhatLikely2y ago

Wouldn't it make more sense to create ten streams of images of the same class and then see which one results in the smallest compressed size for a test image? That is, if I gzip a hundred '6's plus my test image and get a compressed size for my test image of 10 bytes, but doing the same for other digits gives me say 15 bytes then I conclude the test image is a '6'.

yieldcrv2y ago

I want someone to show me how to use compression to compile exploits on emulated logic gates, like NSOGroup keeps doing

Seems like a sufficiently novel and important advancement that should be taught in universities at this point, since we need to harden software around this kind of possibility.

great_psy2y ago

Having some artificial intelligence algorithm that is completely understood and tu able like gzip is would be great for many uses.

I think it’s pretty hard to just improve a NN without more data or some non trivial amount of effort.

tripzilch2y ago

What they mean with "“intelligence is compression”" is that there's actually an equivalence between them.

See, an "intelligence" is a predictor. Like a LLM, it predicts the next char/token depending on what it's seen before.

In order to turn this into a compressor, you store the "difference" between the predicted token and the actual token. If the predictor is good, this stream of data will be very low entropy which can be compressed with arithmetic encoding or something similar.

In the case of arithmetic encoding you get lossless compression. (additionally, because arithmetic encoding is based on probabilities (frequencies), if the predictor outputs probabilities instead of a single prediction, this can also be used to crunch the entropy)

Now look at for instance the speech codecs in GSM. They use Linear Prediction Coding, which has the same concept of using a predictor and storing the error stream. Except the latter's coefficients are rounded down, making it a form of lossy compression.

And yes, you can probably make a pretty good (lossless or perhaps even lossy, but I don't think you want that) text compressor by using an LLM to (deterministically) predict (the likelihood of) tokens and storing only the errors/differences. It should be able to outperform zip or gzip, because it can make use of language knowledge to make predictions.

There's a catch, however, which is that in the case of LLM compression you also need to store all those weights somewhere, cause you need the predictor to decode. This is always the catch with compression, there is always some kind of "model" or predictor algorithm that is implicit to the compressor. In the case of gzip it's a model that says "strings tend to repeat in patterns", which is of course "stored" (in some sense) in the gzip executable code. But with gzip we don't count the size of gzip to our compressed files either, because we get to compress a lot of files with one model. Similarly for this hypothetical LLM-text compression scheme, but you just need to compress a whole lot more text before it's worth it.

All that said, however, like many others pointed out, 78% isn't a great score for MNIST.

Then again, I also don't think that gzip compression is the best predictor for gray scale image similarity. For one if you have two pixels of grayscale values 65 and 66, gzip will see them as "different bytes", regardless of them being very similar in grayscale level. You might even be able to increase the score by thresholding the training set to BW 0/255.

Karellen2y ago

You solved some problem, including implementing the GZIP algorithm, in less than 10 lines of code?

...

Oh, ok, not that.

j / k navigate · click thread line to collapse

137 comments

91 comments · 26 top-level

GaggiX2y ago· 21 in thread

For a comparison with others techniques:

Linear SVC (best performance): 92 %

SVC rbf (best performance): 96.4 %

SVC poly (best performance): 94.5 %

Logistic regression (prev assignment): 89 %

Naive Bayes (prev assignment): 81 %

From this blog page: https://dmkothari.github.io/Machine-Learning-Projects/SVM_wi...

IKantRead2y ago

Thanks for posting this.

Most people don't realize that Logistic regression can get ~90% accuracy on MNIST.

As a big fan of starting with simple models first and adding complexity later, I've frequently been told that "logistic regression won't work!" for problems where it can in fact perform excellent.

When faced with this resistance to logistic regression I'll often ask what they think the baseline performance of it would be on MNIST. The guesses I hear most often are 20-30%.

It's also been my experience that if you don't get good performance with a simple model, you are very unlikely to get great performance from a more complex one.

nextos2y ago

Came here to say the same thing, actually NCD can probably do much better than 78%. Li & Vitanyi's book about Kolmogorov complexity has some interesting unsupervised examples.

A simple CNN as implemented in Keras tutorial can easily exceed 98%. 78% is very poor performance for MNIST even if model complexity is penalized.

mardifoufs2y ago

When would a 90% accuracy on a dataset like mnist ever be useful? And I mean useful as in usable for actual products or software. especially considering mnist is more of a toy dataset.

1 more reply

KRAKRISMOTT2y ago

If you stack logistic regression in layers, you get a neural network.

bugglebeetle2y ago

> People, even machine learning people, often don't realize the rapidly diminishing returns you get for adding a lot of complexity in your models.

Are you me? I was just having this argument at work with someone about using an old (fasttext/word2vec) model vs. the overhead on fine-tuned BERT model for a fairly simple classification problem.

short_sells_poo2y ago

It's really a tragedy, because so many classic models would work fine for real world applications.

vjerancrnjak2y ago

Extending existing features with their non-linear mappings would improve logistic regression too, probably to the svc level (rbf or poly kernel is that but implicit).

Linear models are really well researched, and today with the compute and with proper training and data preparation they can easily get to satisfying levels of performance for a variety of tasks.

make32y ago

salty_biscuits2y ago

You can get 80ish percent with a linear model (if you one hot the labels).

dmazzoni2y ago

That blog page is not showing state-of-the-art results, it's just taking relatively naive SVM implementations and comparing them.

The original research paper that introduced the MNIST data set was getting around 98% accuracy, and today neural nets are getting 99.87% accuracy.

https://paperswithcode.com/sota/image-classification-on-mnis...

GaggiX2y ago

Yep I was showing others simple techniques, of couse a neural network would be much better at it.

TimPC2y ago

The whole point isn't to do better. It's to show that enough information survives compression that you can still get very large signal. Compression is intended to make things harder and still does.

GaggiX2y ago

amelius2y ago

Why? I mean it's trivial to add a layer of encryption ...

(Also the terminology seems off here, as 100% of information survives gzip.)

minitech2y ago

No, that wasn’t the point at all. Compressibility was used to determine similarity.

1 more reply

grandma_tea2y ago

While it's cool that this works at all, I wish we would stop using MNIST as a benchmark given how trivial it is.

IKantRead2y ago

It's a good benchmark because it's so trivial.

Sure it's not great at differentiating between SotA techniques, but it's very useful for sanity checks like this one.

Even for SotA models, it's still useful to verify that you can get greater than 98% accuracy on MNIST, before exploring larger, more complex bench marks.

It certainly shouldn't be the only benchmark but it's a great place to start iterating on ideas.

2 more replies

dmazzoni2y ago

unoti2y ago

antiquark2y ago

Take a look at the EMNIST - Extended MNIST dataset. It has both digits and letters of the alphabet.

westurner2y ago

So there is a more optimal compression algorithm for the relation between the MNIST inputs and outputs!

Other models tend to add noise somewhere in there; what about feature engineering before the gzip? Maybe a gaussian blur and a convolution first and then deep learning for feature selection

montebicyclelo2y ago· 9 in thread

I tried replacing the distance function in the code with some simpler distance measures:

    Gzip distance: ~3 minutes, 78% accuracy
    Euclidean distance: ~0.5 seconds, 93% accuracy
    Jaccard distance * : ~0.7 seconds, 94% accuracy
    Dice dissimilarity * : ~0.8 seconds, 94% accuracy

    * after binarising the images

So, as a distance measure for classifying MNIST digits, GZIP has lower accuracy, and is much more computationally demanding than other measures.

I'm not that familiar with how the GZIP algorithm works, it's kind of interesting that it's so much lower. I wonder whether image focused compression algorithms might do better?

w-m2y ago

Best one I found is Normalized Mutual Information at 95%. It's a little more complex, but you could still compute it relatively quickly on binarized images.

    NMI skimage: ~30 seconds, 95% accuracy
    NMI numba *: ~0.6 seconds, 95% accuracy

Here's the code, courtesy of ChatGPT:

    @njit(cache=True)
    def compute_joint(img1, img2):
        stacked = 2 * img1.ravel() + img2.ravel()
        return np.bincount(stacked, minlength=4).reshape(2, 2)

    @njit(cache=True)
    def entropy(counts, total):
        # Convert counts to probabilities
        p = counts / total
        # Only calculate for non-zero entries to avoid log(0)
        return -np.sum(p[p > 0] * np.log2(p[p > 0]))

    @njit(cache=True)
    def normalized_mutual_information(img1, img2):
        joint_counts = compute_joint(img1, img2)
        total_pixels = 28*28
        
        # Marginal counts
        c1 = np.sum(joint_counts, axis=1)
        c2 = np.sum(joint_counts, axis=0)
        
        # Compute entropies directly from counts
        h1 = entropy(c1, total_pixels)
        h2 = entropy(c2, total_pixels)
        joint_entropy = entropy(joint_counts.flatten(), total_pixels)
        
        mutual_info = h1 + h2 - joint_entropy
        return 2 * mutual_info / (h1 + h2)

tysam_and2y ago

Holy cow, I knew that MNIST was simple, but not that simple.

Holy cow, that's insane. :O

montebicyclelo2y ago

I plugged in the distance measures, for the `compute_ncd` function. (Jaccard/Dice have been negated and the -1 removed.)

    def euclidean(x1, x2):
        return np.sum(np.square(x1 - x2))

    def jaccard(x1, x2):
        x1_binary = x1 > 0.5
        x2_binary = x2 > 0.5
        return np.logical_or(x1_binary, x2_binary).sum() / np.logical_and(x1_binary, x2_binary).sum()

    def dice(x1, x2):
        x1_binary = x1 > 0.5
        x2_binary = x2 > 0.5
       return 2 * np.logical_or(x1_binary, x2_binary).sum() / (x1_binary.sum() + x2_binary.sum())

[1] https://github.com/Jakob-98/mono/blob/73168bc0ea904e75865815...

1 more reply

KateLawson2y ago

MNIST is what the Post Office solved in the late 1990s. It's basically the "hello world" of visual recognition these days.

https://www.govexec.com/federal-news/1999/02/postal-service-...

sr1922y ago

ben recht's kernel method implementation in 10 lines hits 98%

https://github.com/benjamin-recht/mnist_1_pt_2/tree/main

tysam_and2y ago

Great link, thank you so much, lol.

This line freaking killed me:

https://github.com/benjamin-recht/mnist_1_pt_2/blob/c0fe96bc...

cqz2y ago

Very cool. I tried using PNG compression, it does actually do better, slightly:

    PNG : ~15.1 seconds, 83% accuracy

I also tried dropping in zstandard compression:

    Zstd (level=3) : ~3.5 seconds, 88% accuracy

Much faster than gzip, at least. And iff I use (x1-x2)*2 instead of x1+x2 to calculate Cx1x2 that pushes zstd up to 93% accuracy.

w-m2y ago

Re garbage performance on stacking: I had tried interleaving the values (creating a 56x28 img), it dropped one percentage point of accuracy when using gzip.

jdthedisciple2y ago

That's pretty amazing.

So the whole gzip thing - while fancy - really is extra steps for less bang.

jszymborski2y ago· 9 in thread

In fairness you can run MNIST through UMAP and get near perfect seperation. I'm of the belief that you have to try pretty hard not to do well on MNIST these days.

https://github.com/lmcinnes/umap_paper_notebooks/blob/master...

EDIT: I should add, unless it isn't clear, that we really should retire the dataset. Something like the QuickDraw dataset makes a lot more sense to me.

js98OP2y ago

Author here: completely agree. I think it is not much of an achievement by itself, but it is interesting to see that it works!

I will add a comment/edit to the post once I am home to clarify the relative ease of solving MNIST

jszymborski2y ago

Sorry if my comment came off as dismissive of the post! I was just trying to frame the results for those unfamiliar with the properties of MNIST.

I think it is indeed interesting as it shows that Gzip can capture the same things that e.g. UMAP et co. are capturing when they acheive good scores on MNIST.

Also, I'll add, that even despite some of the suspicions people have cast on the Gzip results in other experiments, I'm bullish on the utility of reducing entropy for classification problems.

ekidd2y ago

> I should add, unless it isn't clear, that we really should retire the dataset.

From a research perspective, MNIST is basically solved. I think current performance on MNIST is better than humans, correct?

jszymborski2y ago

I was thinking about this just now... the real utility of MNIST is that, since it's basically impossible to get a bad result, it's a good sanity check.

krick2y ago

sundarurfriend2y ago

Y'all are making the (rude) person below complaining about acronyms look reasonable. The repository doesn't define UMAP either, but if you believe ChatGPT it is:

> UMAP, which stands for Uniform Manifold Approximation and Projection, is a dimensionality reduction technique and data visualization method commonly used in machine learning and data analysis.

wenc2y ago

We work in our own ecological niches which sometimes entails having a specialized language that allows us to be precise when referring to things.

Instead if someone said, “I’m new to this, could someone help me understand what a ‘compiler’ is?”, it demonstrates curiosity and people are more inclined to explain.

jszymborski2y ago

It's[0] a non-linear dimensionality reduction technique in the vein of t-SNE[1] that is very well known in the field.

My two cents is that if your problem can be solved with something like UMAP + kNN[2], then you really shouldn't be using Deep Learning to solve it.

[0] https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduc...

[1] https://en.wikipedia.org/wiki/T-distributed_stochastic_neigh...

[2] https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

1 more reply

tomrod2y ago

UMAP is as ubiquitous as T-SNE these days, though to be fair those on HN that focus on JVM, LLVM, NPM, or TLS may not know about UMAP/T-SNE.

tdr2d2y ago· 7 in thread

Obviously, the code may be elegant and compact, 78% accuracy is considered very very bad for MNIST.

A dummy model written with Tensorflow easilly reaches 90% accuracy. The best models ranked at 99,87%, see the benchmark : https://paperswithcode.com/sota/image-classification-on-mnis...

esafak2y ago

To Compress or Not to Compress- Self-Supervised Learning and Information Theory: A Review https://arxiv.org/abs/2304.09355\*

anitil2y ago

I can't find the original blog, but there's a note about it here - https://stackoverflow.com/questions/39142778/how-to-determin...

3 more replies

Quenty2y ago

I mean, they’re using knn underneath. You can apparently get 97% accuracy with normal knn, at n=4 if you compare pixel distance.

So another way to frame this might be that gzip costs a lot of accuracy but may lead to better performance.

https://newpblog.netlify.app/2018-01-24-knn-analysis-on-mnis...

a_wild_dandan2y ago

An increasingly common refrain in machine learning is “intelligence is compression.” Folks who believe that might bristle at the distinction between learning and compression.

1 more reply

sundarurfriend2y ago

The point is not to have "elegant and compact" code, this is meant to be a fun curiosity, and doing it in 10 lines is just an additional layer of challenge for the heck of it.

ActivePattern2y ago

Yeah, it does demonstrate that you can use compression to measure similarity of two images.

But it also demonstrates that it's a pretty poor similarity measure. Something as simple as counting % of matches between the black and white pixels performs much better.

m00x2y ago

It's not trying to break records, it just shows a neat aspect of compression. It's still 8 times better than baseline, which showcases that compression can learn representation.

tjungblut2y ago· 5 in thread

13of402y ago

momojo2y ago

The off-the-beaten-path nature of this reminds me of banks sending $<arbitrary decimal amount> as a PIN for auth.

tysam_and2y ago

Makes sense, considering that's not too terribly far off from what the KL divergence does

p1esk2y ago

yvdriess2y ago

What do you mean with accuracy here? Usually 50% accuracy means cointoss, meaning 20% accuracy is equal to 80% accuracy, which is better than the article's 78% and not that far from 90%.

4 more replies

wayeq2y ago· 3 in thread

fritzo2y ago

wayeq2y ago

Ah ok, that makes sense. Thank you for the explanation.

Zamicol2y ago

Encrypted data ideally is incompressible. Incompressibility is a hallmark of efficient cryptographic operations.

See the Wikipedia article on Kolmogorov complexity, which has a short section about compression. https://en.wikipedia.org/wiki/Kolmogorov_complexity#Compress...

Edit: One of my favorite concepts in the domain of compression is the pigeonhole principle, that states that for all compression algorithms, some outputs will be larger than the inputs.

Well designed encrypted payloads may be compressed, but the outputs should on average be larger than the inputs, rendering compression useless thus it is said to be "incompressible".

https://en.wikipedia.org/wiki/Pigeonhole_principle#Uses_and_...

_a_a_a_2y ago· 3 in thread

"MNIST"?

accuracy - but of what?

what's this about?

simonw2y ago

MNIST is a classic image classification exercise - a dataset of 60,000 training images and 10,000 testing images where each image is a handwritten numeral as a 28x28 pixel grayscale image.

The challenge is to build a computer vision model that can tell which numeral each handwritten digit represents.

https://en.wikipedia.org/wiki/MNIST_database

78% accuracy on a solution is pretty bad, but achieving it just using GZIP is a very neat hack.

Detrytus2y ago

What's the average human performance for this task?

1 more reply

_a_a_a_2y ago

thank you

3abiton2y ago· 2 in thread

0xfffafaCrash2y ago

haven’t looked through this post yet but I think this is what you have in mind: https://kenschutte.com/gzip-knn-paper/

vintermann2y ago

This is 78% accuracy from Gzip-based compression distance + KNN, which seems to be worse than any other distance measure you can think of + KNN.

petters2y ago· 2 in thread

I wonder why you started with a code-golfed version? That seems original to the main point of the post

seeknotfind2y ago

original -> orthogonal?

I like short code! It's like a joke "this is so easy, 10 loc" :)

hinkley2y ago

The internet has turned into a game of “is it him or is it autocorrect?”

1 more reply

tysam_and2y ago· 2 in thread

Additionally, try flipping your images and averaging the size before comparing the distance between them. I'd expect a boost of about 78% -> 84% or so, based on how this typically works as TTA.

jdthedisciple2y ago

Can you elaborate more on that technique?

Why would it improve the result so much? Sounds very interesting so I'm curious.

tysam_and2y ago

facepalm I just realized I was talking about this in the context of MNIST -- my apologies.

It would be better in a problem with horizontal symmetry like CIFAR, esp if the zipping is serialized by unwinding the 2d pixel array into 1d.

My apologies again for my mistake, thank you for asking, I wouldn't have really seen that otherwise :')))).

benreesman2y ago· 1 in thread

doubloon2y ago

i have put MacKay on my to do list, thank you

bob10292y ago· 1 in thread

General purpose compressors and information distance measures have become super interesting to me while I've been investigating alternative language models.

The NCD formula I am using right now:

  NCD(x,y) = (C(xy) - MIN(C(x),C(y))) / MAX(C(x),C(y))

No weird parameters or any other things to tune. The only parameters are the source documents and the input context/query.

tripzilch2y ago

So, if I understand your sliding window explanation right, would the distance between two strings X and Y then be a feature vector of all the NCDs of its windows?

Kind of reminds me of auto correlation :)

tobeyey2y ago

Replacing the NCD

  distances = [(compute_ncd(x1, x), label) for x, _, label in compressed_lengths]

with the Euclidean distance

  distances = [(np.sqrt(np.sum(np.square(x1-x))), label) for x, _, label in compressed_lengths]

gives you +15% test accuracy and saves you a lot of compute.

jp572y ago

Leaving aside whether this problem is a good application of this compression trick, I want to say that everyone experimenting with this should stop using `gzip` and start using `zlib`.

If you change the first line from `gzip.compress` to `zlib.compress` you should get the same classifier performance with a 3x speedup.

yannccc22y ago

All these ideas date back to at least 2010, probably earlier. It was called "information distance". https://arxiv.org/abs/1006.3520

tysam_and2y ago

It's originally kitted for CIFAR10, but I've found the parameters to be quite general. The code is very easy to read and well-commented, and is a great starting place for exploration.

Min-cut deltas to run MNIST:

  .datasets.CIFAR10(' -> .datasets.MNIST(' (both occurences)

  'whiten': Conv(3, -> 'whiten': Conv(1,

  crop_size = 28 - > `crop_size = 28

Compute for the project funded by Carter Brown and Daniel Gross, my appreciation to them both for helping make this possible. Their support via encouragement has been very helpful as well. <3 :)

m00x2y ago

It goes along the same lines as the recent paper from Deepmind stating that DL (language modeling in their case) is compression.

https://arxiv.org/pdf/2309.10668.pdf

thomasahle2y ago

Similar result (2019) using ZIP, getting 74%: https://www.blackhc.net/blog/2019/mnist-by-zip/

0xDEF2y ago

Has anyone tried how different compression algorithms compare when doing NCD classification?

gzip first does LZ77 and then Huffman coding. ANS is a more modern alternative to Huffman coding that can achieve higher compression ratios than Huffman coding.

Lerc2y ago

I'm sure there was something like this mentioned in "Managing Gigabytes" or thereabouts. It might have been using bitwise arithmetic compression which makes sense for the problem at hand.

benob2y ago

What explains the huge difference in perf with the approach from 2019 mentioned at the end?

For image classification, couldn't one use the residual from applying a jpeg codebook learned on a training example as metric?

SomewhatLikely2y ago

yieldcrv2y ago

I want someone to show me how to use compression to compile exploits on emulated logic gates, like NSOGroup keeps doing

Seems like a sufficiently novel and important advancement that should be taught in universities at this point, since we need to harden software around this kind of possibility.

great_psy2y ago

Having some artificial intelligence algorithm that is completely understood and tu able like gzip is would be great for many uses.

I think it’s pretty hard to just improve a NN without more data or some non trivial amount of effort.

tripzilch2y ago

What they mean with "“intelligence is compression”" is that there's actually an equivalence between them.

See, an "intelligence" is a predictor. Like a LLM, it predicts the next char/token depending on what it's seen before.

All that said, however, like many others pointed out, 78% isn't a great score for MNIST.

Karellen2y ago

You solved some problem, including implementing the GZIP algorithm, in less than 10 lines of code?

...

Oh, ok, not that.

j / k navigate · click thread line to collapse