Gzip distance: ~3 minutes, 78% accuracy
Euclidean distance: ~0.5 seconds, 93% accuracy
Jaccard distance * : ~0.7 seconds, 94% accuracy
Dice dissimilarity * : ~0.8 seconds, 94% accuracy
* after binarising the images
So, as a distance measure for classifying MNIST digits, GZIP has lower accuracy, and is much more computationally demanding than other measures.I'm not that familiar with how the GZIP algorithm works, it's kind of interesting that it's so much lower. I wonder whether image focused compression algorithms might do better?
(Edit: Btw, I enjoyed the post; it's a creative idea, the writing and code was great, and it's sparked some good discussion. But after having a closer look, I think the baselines above provide some context to the gzip scores.)
NMI skimage: ~30 seconds, 95% accuracy
NMI numba *: ~0.6 seconds, 95% accuracy
Here's the code, courtesy of ChatGPT: @njit(cache=True)
def compute_joint(img1, img2):
stacked = 2 * img1.ravel() + img2.ravel()
return np.bincount(stacked, minlength=4).reshape(2, 2)
@njit(cache=True)
def entropy(counts, total):
# Convert counts to probabilities
p = counts / total
# Only calculate for non-zero entries to avoid log(0)
return -np.sum(p[p > 0] * np.log2(p[p > 0]))
@njit(cache=True)
def normalized_mutual_information(img1, img2):
joint_counts = compute_joint(img1, img2)
total_pixels = 28*28
# Marginal counts
c1 = np.sum(joint_counts, axis=1)
c2 = np.sum(joint_counts, axis=0)
# Compute entropies directly from counts
h1 = entropy(c1, total_pixels)
h2 = entropy(c2, total_pixels)
joint_entropy = entropy(joint_counts.flatten(), total_pixels)
mutual_info = h1 + h2 - joint_entropy
return 2 * mutual_info / (h1 + h2)Could you post a snippet of the code that you used to achieve this? It would be really, really nice to have a baseline to work from I'm sure, and I feel like this could be really useful to a few other areas (my personal obsession is speed-training on CIFAR10 :')))) )
Holy cow, that's insane. :O
I plugged in the distance measures, for the `compute_ncd` function. (Jaccard/Dice have been negated and the -1 removed.)
def euclidean(x1, x2):
return np.sum(np.square(x1 - x2))
def jaccard(x1, x2):
x1_binary = x1 > 0.5
x2_binary = x2 > 0.5
return np.logical_or(x1_binary, x2_binary).sum() / np.logical_and(x1_binary, x2_binary).sum()
def dice(x1, x2):
x1_binary = x1 > 0.5
x2_binary = x2 > 0.5
return 2 * np.logical_or(x1_binary, x2_binary).sum() / (x1_binary.sum() + x2_binary.sum())
[1] https://github.com/Jakob-98/mono/blob/73168bc0ea904e75865815...https://www.govexec.com/federal-news/1999/02/postal-service-...
This line freaking killed me:
https://github.com/benjamin-recht/mnist_1_pt_2/blob/c0fe96bc...
PNG : ~15.1 seconds, 83% accuracy
I also tried dropping in zstandard compression: Zstd (level=3) : ~3.5 seconds, 88% accuracy
Much faster than gzip, at least. And iff I use (x1-x2)*2 instead of x1+x2 to calculate Cx1x2 that pushes zstd up to 93% accuracy.I'm somewhat interested in the fact that if I stack the two arrays on top of each other instead of adding them together, I get absolutely garbage performance (<20%), however as far as I can tell that actually works well when classifying strings.
So the whole gzip thing - while fancy - really is extra steps for less bang.
Linear SVC (best performance): 92 %
SVC rbf (best performance): 96.4 %
SVC poly (best performance): 94.5 %
Logistic regression (prev assignment): 89 %
Naive Bayes (prev assignment): 81 %
From this blog page: https://dmkothari.github.io/Machine-Learning-Projects/SVM_wi...
Also it seems from reading online articles that people are able to obtain much better results just by using K-NN, so I imagine that the author just made his job harder by using gzip but I could be wrong about this.
Most people don't realize that Logistic regression can get ~90% accuracy on MNIST.
As a big fan of starting with simple models first and adding complexity later, I've frequently been told that "logistic regression won't work!" for problems where it can in fact perform excellent.
When faced with this resistance to logistic regression I'll often ask what they think the baseline performance of it would be on MNIST. The guesses I hear most often are 20-30%.
People, even machine learning people, often don't realize the rapidly diminishing returns you get for adding a lot of complexity in your models. Sometimes it's worth it, but it's always good to start simple first.
It's also been my experience that if you don't get good performance with a simple model, you are very unlikely to get great performance from a more complex one.
A simple CNN as implemented in Keras tutorial can easily exceed 98%. 78% is very poor performance for MNIST even if model complexity is penalized.
I think that's why machine learning is the way to go for this type of detection, why go with anything else than a CNN (in this case) when it is now trivial to set up and train? Again, unless it's just to mess around with, 90% mnist accuracy is not useful in the real world
Are you me? I was just having this argument at work with someone about using an old (fasttext/word2vec) model vs. the overhead on fine-tuned BERT model for a fairly simple classification problem.
It's really a tragedy, because so many classic models would work fine for real world applications.
Linear models are really well researched, and today with the compute and with proper training and data preparation they can easily get to satisfying levels of performance for a variety of tasks.
The original research paper that introduced the MNIST data set was getting around 98% accuracy, and today neural nets are getting 99.87% accuracy.
https://paperswithcode.com/sota/image-classification-on-mnis...
For example it would be almost impossible to distinguish cat and dog images by using SVG or Knn directly on the data but it would be much easier if the images were first passed into an image encoder and a small embeddings vector would be used for each one instead. In the article OP linked it doesn't seem that Gzip is very good at extracting useful patterns for the MNIST dataset.
(Also the terminology seems off here, as 100% of information survives gzip.)
Sure it's not great at differentiating between SotA techniques, but it's very useful for sanity checks like this one.
Even for SotA models, it's still useful to verify that you can get greater than 98% accuracy on MNIST, before exploring larger, more complex bench marks.
It certainly shouldn't be the only benchmark but it's a great place to start iterating on ideas.
Other models tend to add noise somewhere in there; what about feature engineering before the gzip? Maybe a gaussian blur and a convolution first and then deep learning for feature selection
A dummy model written with Tensorflow easilly reaches 90% accuracy. The best models ranked at 99,87%, see the benchmark : https://paperswithcode.com/sota/image-classification-on-mnis...
To Compress or Not to Compress- Self-Supervised Learning and Information Theory: A Review https://arxiv.org/abs/2304.09355\*
I can't find the original blog, but there's a note about it here - https://stackoverflow.com/questions/39142778/how-to-determin...
So another way to frame this might be that gzip costs a lot of accuracy but may lead to better performance.
https://newpblog.netlify.app/2018-01-24-knn-analysis-on-mnis...
The interesting thing is not in whether GZip can achieve SOTA, it's that it can do a decent job at all. (The interesting thing is not in whether the bear can recreate Mozart exactly, it's that it can play the piano at all.)
But it also demonstrates that it's a pretty poor similarity measure. Something as simple as counting % of matches between the black and white pixels performs much better.
distances = [(compute_ncd(x1, x), label) for x, _, label in compressed_lengths]
with the Euclidean distance distances = [(np.sqrt(np.sum(np.square(x1-x))), label) for x, _, label in compressed_lengths]
gives you +15% test accuracy and saves you a lot of compute.I gather this is common knowledge (if not sufficiently emphasized at times) among those with serious educations, but as a self-taught, practical application-type ML person this profound thread running through all these topics (and seemingly into heavy-ass particle physics and cosmology and stuff like that) was a blinding “Aha!” moment that I’ll venture this comment in the hopes that even one other person has that unforgettable moment.
i was quite struck when i learned the original Lempel Ziv compression (which gzip is partly based on) came out of their study of the "complexity of finite sequences" not necessarily trying to shrink things, https://ieeexplore.ieee.org/document/1055501
https://github.com/lmcinnes/umap_paper_notebooks/blob/master...
EDIT: I should add, unless it isn't clear, that we really should retire the dataset. Something like the QuickDraw dataset makes a lot more sense to me.
I will add a comment/edit to the post once I am home to clarify the relative ease of solving MNIST
I think it is indeed interesting as it shows that Gzip can capture the same things that e.g. UMAP et co. are capturing when they acheive good scores on MNIST.
Also, I'll add, that even despite some of the suspicions people have cast on the Gzip results in other experiments, I'm bullish on the utility of reducing entropy for classification problems.
From a research perspective, MNIST is basically solved. I think current performance on MNIST is better than humans, correct?
But it's still valuable as a teaching tool or a "Hello world" data set, because it's so simple and because most reasonable algorithms will reach 97% accuracy. Even if you build your tools from scratch, it makes a nice homework-sized problem. And it's an obviously useful task that anyone can understand. "Oh, digit recognition for mail!"
> UMAP, which stands for Uniform Manifold Approximation and Projection, is a dimensionality reduction technique and data visualization method commonly used in machine learning and data analysis.
I sometimes see HN comments complaining that acronyms and terms are dropped with no explanation. These comments come across as entitled to me — instead of complaining, why not be curious and ask “what does that mean in this context?”
Imagine someone dropping in here and complaining, “the article sucks because it uses the word ‘compiler’ and assumes we all know what a ‘compiler’ is.” This is what it sounds like to those of us to work in specialized fields.
Instead if someone said, “I’m new to this, could someone help me understand what a ‘compiler’ is?”, it demonstrates curiosity and people are more inclined to explain.
My two cents is that if your problem can be solved with something like UMAP + kNN[2], then you really shouldn't be using Deep Learning to solve it.
[0] https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduc...
[1] https://en.wikipedia.org/wiki/T-distributed_stochastic_neigh...
[2] https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
See the Wikipedia article on Kolmogorov complexity, which has a short section about compression. https://en.wikipedia.org/wiki/Kolmogorov_complexity#Compress...
Edit: One of my favorite concepts in the domain of compression is the pigeonhole principle, that states that for all compression algorithms, some outputs will be larger than the inputs.
Well designed encrypted payloads may be compressed, but the outputs should on average be larger than the inputs, rendering compression useless thus it is said to be "incompressible".
https://en.wikipedia.org/wiki/Pigeonhole_principle#Uses_and_...
If you change the first line from `gzip.compress` to `zlib.compress` you should get the same classifier performance with a 3x speedup.
I've been playing around with an attention mechanism that combines the idea of using normalized compression distance (gzip) with discrete convolution between candidate sequences (sliding window of N bytes over each). Another round of normalization over the convolution outputs - accommodating varying lengths - allows for us to compare candidate sequences for relevant information on equal grounds.
The NCD formula I am using right now:
NCD(x,y) = (C(xy) - MIN(C(x),C(y))) / MAX(C(x),C(y))
No weird parameters or any other things to tune. The only parameters are the source documents and the input context/query.Kind of reminds me of auto correlation :)
Also, just a note about your NCD formula, if C(xy) is the compressed size of the concatenation of x and y, then I would recommend also trying (C(xy)+C(yx))/2 for that term, because a lot of compressors don't compress xy the same as yx, and you probably want your distance to be symmetrical.
It's originally kitted for CIFAR10, but I've found the parameters to be quite general. The code is very easy to read and well-commented, and is a great starting place for exploration.
Min-cut deltas to run MNIST:
.datasets.CIFAR10(' -> .datasets.MNIST(' (both occurences)
'whiten': Conv(3, -> 'whiten': Conv(1,
crop_size = 28 - > `crop_size = 28
Compute for the project funded by Carter Brown and Daniel Gross, my appreciation to them both for helping make this possible. Their support via encouragement has been very helpful as well. <3 :)gzip first does LZ77 and then Huffman coding. ANS is a more modern alternative to Huffman coding that can achieve higher compression ratios than Huffman coding.
For image classification, couldn't one use the residual from applying a jpeg codebook learned on a training example as metric?
Seems like a sufficiently novel and important advancement that should be taught in universities at this point, since we need to harden software around this kind of possibility.
I like short code! It's like a joke "this is so easy, 10 loc" :)
I can’t count how many times I’ve looked up at a line of text to spot check it, type in a couple more letters, hit save without looking, and find later that I sound like I’m having a stroke. Never mind the times I’ve simply forgotten to look. Yes that’s a word, stop trying to replace it.
(Five minutes ago I had to type “laissez-faire” three times. Yesterday it turned “understanding” into god knows what. And if iOS doesn’t stop adding apostrophes to “its” I’m gonna fuckin lose my shit)
I think it’s pretty hard to just improve a NN without more data or some non trivial amount of effort.
Why would it improve the result so much? Sounds very interesting so I'm curious.
It would be better in a problem with horizontal symmetry like CIFAR, esp if the zipping is serialized by unwinding the 2d pixel array into 1d.
One trick that _should_ work is by comparing the distances of starting the compression at the 4 different corners of the image, in the 2 separate directions for each corner. That should provide way more than enough information for k-means clustering.
My apologies again for my mistake, thank you for asking, I wouldn't have really seen that otherwise :')))).
See, an "intelligence" is a predictor. Like a LLM, it predicts the next char/token depending on what it's seen before.
In order to turn this into a compressor, you store the "difference" between the predicted token and the actual token. If the predictor is good, this stream of data will be very low entropy which can be compressed with arithmetic encoding or something similar.
In the case of arithmetic encoding you get lossless compression. (additionally, because arithmetic encoding is based on probabilities (frequencies), if the predictor outputs probabilities instead of a single prediction, this can also be used to crunch the entropy)
Now look at for instance the speech codecs in GSM. They use Linear Prediction Coding, which has the same concept of using a predictor and storing the error stream. Except the latter's coefficients are rounded down, making it a form of lossy compression.
And yes, you can probably make a pretty good (lossless or perhaps even lossy, but I don't think you want that) text compressor by using an LLM to (deterministically) predict (the likelihood of) tokens and storing only the errors/differences. It should be able to outperform zip or gzip, because it can make use of language knowledge to make predictions.
There's a catch, however, which is that in the case of LLM compression you also need to store all those weights somewhere, cause you need the predictor to decode. This is always the catch with compression, there is always some kind of "model" or predictor algorithm that is implicit to the compressor. In the case of gzip it's a model that says "strings tend to repeat in patterns", which is of course "stored" (in some sense) in the gzip executable code. But with gzip we don't count the size of gzip to our compressed files either, because we get to compress a lot of files with one model. Similarly for this hypothetical LLM-text compression scheme, but you just need to compress a whole lot more text before it's worth it.
All that said, however, like many others pointed out, 78% isn't a great score for MNIST.
Then again, I also don't think that gzip compression is the best predictor for gray scale image similarity. For one if you have two pixels of grayscale values 65 and 66, gzip will see them as "different bytes", regardless of them being very similar in grayscale level. You might even be able to increase the score by thresholding the training set to BW 0/255.
accuracy - but of what?
what's this about?
The challenge is to build a computer vision model that can tell which numeral each handwritten digit represents.
https://en.wikipedia.org/wiki/MNIST_database
78% accuracy on a solution is pretty bad, but achieving it just using GZIP is a very neat hack.
...
Oh, ok, not that.