story

The $5000 Compression Challenge (2001) (opens in new tab)

patrickcraig.co.uk

181 pointsekiauhce1y ago185 comments

185 comments

The original email thread was from 2001, and it gets posted to HN periodically:

https://news.ycombinator.com/from?site=patrickcraig.co.uk

For another compression challenge that is still ongoing, try "500000€ Prize for Compressing Human Knowledge" (also known as "Hutter Prize"):

http://prize.hutter1.net/

https://news.ycombinator.com/item?id=37502329 - Hutter Prize for compressing human knowledge (2023-09-13, 215 comments)

vlovich1231y ago

I have a fundamental problem with the Hutter prize stating that intelligence is related to compression & then sponsoring a prize for lossless compression. Intelligence is related to lossy compression. Lossless is mainly a mechanistic act.

_hark1y ago

Say we have some dataset composed of D bytes.

Next, say I find some predictive model of the data M, where M is composed of N bytes. Furthermore, let us say that the entropy of the dataset under the model is H bytes.

Then, if N + H < D, my model compresses the data.

It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.

One more argument for compression being equivalent to intelligence: Across many fields of statistical machine learning, there are generalization bounds which have the form:

test error <= train error + model complexity

That is, we don't expect to do any worse on new (test) data, than the sum of the train error and the model complexity (smallest compressed size). Notice that if you interpret the train error as the entropy of the data (i.e. error under a cross entropy loss), then the models which satisfy the statistical generalization bound correspond to those which best compress the data. In other words: the model which produces the shortest description of the data is the one which is expected to generalize best.

2 more replies

aleph_minus_one1y ago

Marcus Hutter is free to sponsor a prize according to his definition of intelligence. You are free to sponsor a prize according to your intelligence definition involving a lossy compression.

1 more reply

echoangle1y ago

Isn’t the intelligence shown by compressing lossless the scheme you use? Applying the algorithm is the easy part, the proof of intelligence is inventing the algorithm which compresses.

2 more replies

patmcc1y ago

Intelligence is knowing what you can lose, and losing that and no more.

1 more reply

pizza1y ago

"If it were a fact, it wouldn't be called intelligence."

I think the other way to read it is that you're fundamentally going to have to choose (by your freedom to control your algorithm's design) which of the maps from (2^|x|) -> (2^|compress(x)|) are the ones that actually end up getting no compression (ie |compress(x)| actually > |x| - bc of pigeonhole principle) and which ones are the ones that do get to be called compressed. So you do have an implied kind of lossiness w.r.t. to what parts of the input domain are actually compressed.

Dwedit1y ago

Totally just misread "hutter1" as "hunter2".

teo_zero1y ago

Mike supports $SPORT team the Compressors. He's so sure they are unbeatable that he accepts 50:1 bets against them. Patrick bets 100$ that they won't win this year's championship; Mike accepts. Later, the Compressors announce financial troubles and can't pay the fee to enter the championship, which is then won by another team. Patrick reclaims his 5000$. Mike refuses to pay saying that the Compressors have not been beaten, and the championship result does not disprove his claim that the Compressors are unbeatable, which was the spirit of the bet since the beginning. Mike offers to refund the 100$. Patrick holds that the terms of the bet were met and he deserves the 5000$.

optimalsolver1y ago

>$SPORT team the Compressors

They were lossless last season. :-D

l33t73322731y ago

And the FAQ for the bet said that if a team can’t afford to enter the playoffs then the bet is off.

Dylan168071y ago

1. Which line in the FAQ are you making an analogy to?

2. A FAQ for the newsgroup is not automatically part of the rules for the challenge.

3. If the entire FAQ is treated as rules text, then the rules directly say you cannot win, and that's not an acceptable way to do rules.

1 more reply

stavros1y ago

Sorry, if you're trying to hustle people by xstging $100 per try, don't catch the sleight of hand in the "multiple files" question, and accept, you were beaten at your own game, fair and square.

l33t73322731y ago

I feel like if the FAQ requires not using filename shenanigans then the slight of hand was illegal the whole way.

stavros1y ago

He didn't use filenames, he used files, and if that were illegal, Mike shouldn't have accepted it.

1 more reply

spott1y ago

The FAQ was not part of the challenge statement. It was part of the newsgroup I believe.

fxtentacle1y ago

This guy clearly failed because he didn't actually do any compression, he just ab-used the filesystem to store parts of the data and then tried to argue that metadata was not data...

But FYI someone else actually managed to compress that exact same data: https://jasp.net/tjnn/2018/1.xhtml

Timwi1y ago

He clearly failed at doing any actual compression, but he did not fail at the challenge. He satisfied every single condition.

He obviously knows that. He knows exactly what compression is and that his solution is not it. He showed (successfully) that meeting the challenge as it was worded did not require compression, which the setter of the challenge didn't realize.

stavros1y ago

If Mike didn't want multiple files, he should have said no to multiple files. The fact that he wanted $100 per try shows that he just got outhustled at his own hustle. He should have been more particular about the rules, and upheld the ones he himself set.

PaulHoule1y ago

Should have charged $100 a file.

recursive1y ago

That argument was made, and addressed in the emails. How are you defining compression? Actual compassion doesn't seem to have been a requirement in the challenge.

maxmcd1y ago

Is there any more information about this solution?

elahieh1y ago

I'm rather skeptical of this claim that the data was compressed in 2018 because there is no further information, apart from a hash value given.

If it's a true claim they must have identified some "non-random" aspect of the original data, and then they could have given more info.

tugu771y ago

Easy.

Save the sha256 hash of original.dat in compressed.dat. The decompressor cats /dev/random until data of the right size comes out with the correct hash.

Now there are two cases.

1. The reconstructed data is actually equal to original.dat. Challenge won, cash in $5000.

2. The reconstructed data differs from original.dat. It has the same hash though, so you found a collision in sha256. World fame.

In either case, win!

hgomersall1y ago

Not necessarily. Consider a big file of random uniformly distributed bytes. It's easy to show that in practice some bytes are more common than others (because random), and that necessarily therefore the expected spacing between those specific bytes is less than 256, which gives you a small fraction of a bit you can save in a recoding of those specific bytes (distance from last byte of a specific value).

With a big enough file those fractions of a bit add up to a non trivial number of bits. You can be cunning about how you encode your deltas too (next delta makes use of remaining unused bits from previous delta).

I haven't worked through all the details, so it might be in the end result everything rebalances to say no, but I'd like to withhold judgement for the moment.

1 more reply

nkrisc1y ago

If the challenge allowed for a decompressor + compressed file size as little as 1 byte less than the original file, was it ever really about compression then? If that’s not in the spirit of the competition, then why was it allowed by the competition?

ball_of_lint1y ago

This strategy or something like it legitimately wins the challenge.

The challenge (based on expected value) is not to find some compression scheme that will compress 100% or even 50% of files to less than their original size. Instead it's to find any (set of) compression schemes that will compress more than 1/50 = 2% of arbitrary random files to less than their original size. You can construct such a program essentially as they described by asking for a large file size and then searching for a valid decompression program within that file. With standard tools you can make some valid decompression program quite short, which will allow you a reasonable probability that it will appear within a large randomly-generated file.

The cost that you pay for using a compression scheme like this is that where the compression scheme works you'll win some bytes, but in the more common case where it doesn't you'll lose some, even for the simple `cat $1 > original` default decompressor.

Now, a clever contest creator might arbitrarily choose files that win against such decompressors that they can think of, but Mike appears to have not have done so.

LegionMammal9781y ago

The problem is what you do outside the file: even if it contains a working decompressor, you have to actually run that decompressor, which most likely requires specifying the byte offset and doing some other work to extract it. And then the decompressor needs to shorten the file even past the length of that byte offset and extractor, which can't be done with reasonable probability.

There are some formats which search for a valid header anywhere in the file, but then they'd get tripped up by all the mostly-valid headers that don't actually yield a decompressed file.

hinkley1y ago

No. You sound as if you’re familiar with the culture but it’s clear that you’re not.

The problem with comp.compression was always that its Eternal September is new people showing up every week claiming they’ve found a universal compression algorithm that compresses everything - the perpetual motion machine of information theory. Having gotten tired of explaining to people who think they’re fighting “dogma” not the laws of the universe, people start trying to find other ways to make them put up or shut up, so they could have some peace and quiet.

Like sending them a file full of RNG output and wait for them to realize this was their teachable moment.

Winning the challenge - without engaging in out of band bullshit (compressor + output should be the giveaway) doesn’t prove you have found a universal compression algorithm. It only proves the RNG is flawed. Which would be very interesting but not break Shannon.

The problem is compressor + file means “no out of band data” to a reasonable person and we have already established we are dealing with unreasonable people.

Not

> My main motivation was to "out-trick the tricker". I thought the chances of me making any money were very remote.

ball_of_lint1y ago

I'm not familiar with the culture.

Nothing in the challenge says anything about doing universal compression. In fact the key point I'm trying to make is that you _don't_ have to make a universal compressor or break entropy to win, instead just find some scheme that makes 2% of random files shorter than the original, even if it makes 98% of files much much longer. The overall entropy is increased. I wouldn't use this for anything else. But it does beat the challenge as stated.

I'd agree that Patrick didn't follow the "spirit" of the challenge or do anything interesting with compression. But in doing this I think he made a good point, which is roughly that you need to be very careful and explicit when posing challenges like this or people are going to use your sloppy wording against you.

1 more reply

ttshaw11y ago

If they're trying to dissuade "universal compressors" then Mike needed to ask for the algorithm first, and then generate his file. If you tell me "I bet you can't compress this file!" then I can do whatever I want to write some stupid one-off compressor to shave a byte off and take your money.

2 more replies

GuB-421y ago

For a truly random file, your 1/50 compression scheme that will make it smaller than the original will only make it smaller by 5 or 6 bits, maybe more if you are lucky, but it is an exponential, so realistically, you won't be able to save more than a byte or two, and you can't write that decompressor in two bytes.

The only way to win is to reverse the random number generator used to create the contest file, or to exploit some bias if there is one. But if the file is generated properly, most likely with some CSPRNG, good luck going that route.

The contest creator has to use a random file, any attempt to be clever will actually decrease entropy and may result in the contestant winning.

teo_zero1y ago

> The only way to win is to reverse the random number generator used to create the contest file

In that case Mike, the contest creator, may declare the result invalid because you haven't satisfied the intent of the challenge.

1 more reply

piannucci1y ago

:shocked_pikachu:

Renegadry aside, for those who are more interested in the Information Theory perspective on this:

Kolmogorov complexity is a good teaching tool, but hardly used in engineering practice because it contains serious foot-guns.

One example of defining K complexity(S, M) is the length of the shortest initial tape contents P for a given abstract machine M such that, when M is started on this tape, the machine halts with final tape contents P+S. Obviously, one must be very careful to define things like “initial state”, “input”, “halt”, and “length”, since not all universal machines look like Turing machines at first glance, and the alphabet size must either be consistent for all strings or else appear as an explicit log factor.

Mike’s intuitive understanding was incorrect in two subtle ways:

1. Without specifying the abstract machine M, the K complexity of a string S is not meaningful. For instance, given any S, one may define an abstract machine with a single instruction that prints S, plus other instructions to make M Turing complete. That is, for any string S, there is an M_S such that complexity(S, M_S) = 1 bit. Alternatively, it would be possible to define an abstract machine M_FS that supports filesystem operations. Then the complexity using Patrick’s solution could be made well-defined by measuring the length of the concatenation of the decompressor P with a string describing the initial filesystem state.

2. Even without adversarial examples, and with a particular M specified, uniform random strings’ K complexity is only _tightly concentrated around_ the strings’ length plus a machine-dependent constant. As Patrick points out, for any given string length, some individual string exemplars may have much smaller K complexity; for instance, due to repetition.

l33t73322731y ago

>uniform random strings’ K complexity is only _tightly concentrated around_ the strings’ length plus a machine-dependent constant

What is the distribution of the complexity of a string? Is there some Chernof-like bound?

its-summertime1y ago

> I think that I did not make a mistake except to believe that you truly intended to attempt to compress the data I was sending you.

Weird phrase to hear from a digital carnival scam artist.

ccleve1y ago

I wonder if it would have been possible to win the challenge legitimately?

If a randomly-generated file happened to contain some redundancy through sheer chance, you could hand-craft a compressor to take advantage of it. This compressor would not work in general for random data, but it could work for this one particular case.

It's a bet worth taking, because the payoff, 50:1 ($5,000 to $100), is pretty good. Play the game 50 times and you might get a file you could compress.

The challenge, then, would be for the person offering the bet to generate a really random file that contained no such redundancy. That might not be easy.

dmurray1y ago

I think you can make some argument about why this isn't possible at 50:1 odds.

A plausible "decompressor" is at least, say, 30 or 100 bytes, so the random file needs to have 30 bytes less entropy than you expected, which happens with probability X where X << 1/50. Sum over the whole domain of reasonable decompressors, and you still don't get there.

This argument could do with more rigor, but I think it's correct. Give me 100 million to 1 odds, though, and I'll take my chances trying to brute force a compressor.

lambdaone1y ago

This is actually an extremely interesting question. 'Weak' files that are more easily compressable than others certainly exist, but with low probability.

For example, the all-zeros file is a member of the set of all random 3 megabyte files, and it would certainly be possible to compress that, if by great good fortune you were lucky enough to receive it - albeit something that is unlikely to ever happen in the possible lifetime of the universe, given current physical theories.

Is it possible to quantify this idea of a 'weak' file more accurately?

4 more replies

kittoes1y ago

What if we didn't even attempt to compress the file? Theoretically, there is a random number generator + seed combination that would output the same bytes as the source file.

changoplatanero1y ago

Neat idea but chances are the length of the seed is equal to the length of the original file.

2 more replies

Retr0id1y ago

Somewhere (discussed on HN) someone devised a "better-than-perfect" compressor. Most inputs get compressed (smaller than input), except for one input that does not. That one input is cryptographically impossible to find - or something along those lines.

Unfortunately I can't find the article I'm describing here, maybe someone else can? It was a long time ago so I might be misrepresenting it slightly.

phire1y ago

It can't exist.

Presume this compressor guarantees the output will always be at least one byte smaller (with the exception of the impossible to find input)

So just keep running your data in a loop through the compressor. If you start with a 1MB file, it will take a maximum of a million iterations until the output shrinks down to zero bytes, which is the smallest possible file. Record how many iterations it too.

You can now extract your file by feeding a zero byte file into the decompressor and running the same number of iterations. Which You can now store every 1MB (or smaller) file in the world in just 20 bits.... But that would means there are only 1 million possible 1MB files?

Even if you put some minimum output size limitation on the compressor, say it can't produce any file less than 512 bits, the same argument applies. It's just that the numbers get bigger.

spencerflem1y ago

That's how all compressors work, in that likely files (eg. ASCII, obvious patterns, etc) become smaller and unlikely files become bigger.

3 more replies

Dylan168071y ago

You could design the opposite, a system where ultra-rare files get compressed a lot but most don't compress.

But as stated, what you said is not possible. To compress by a single byte, only 1/256 files can be compressible, or you'd have multiple different inputs compressing down to the same output.

teaearlgraycold1y ago

Don’t offer up a game you aren’t willing to lose. This should have had a mathematical definition and required a formal proof if Mike wanted to play it this way. By allowing for word play he made this a game of interpretation and not a true mathematical game.

echoangle1y ago

I didn’t quite get the method used to „compress“ the data from the article, maybe this rephrasing helps someone:

You basically split the file every time you encounter a specific character, and your compressor just combines all files it finds with the character you split by. If you split at every „X“ Char which might occur 1000 times in the file, the compressor only needs a small script which joins all files and puts an „X“ between them, which is less than 1000 bytes. The „trick“ is storing the location of the Xs you removed in the file sizes of the individual files.

mjevans1y ago

Almost.

The trick is the missing character is static, in the 'decompressor'. It's inserted between every segment which was trimmed to end where that character existed in the original.dat file. The numbers at the end of each segment file correspond to the segment number, as bourne shells increment numbers. It does not handle errors or validate the output.

It does fulfill the discussed terms of the challenge, which fail to include any reference to an external set of rules.

oezi1y ago

My take was that the information is stored in the ordering of the files. The decompressor doesn't care about the file size of each file, right?

Dylan168071y ago

Both are needed. If you want to transmit this solution from one computer to another you need to store the size of each file (or insert a fancy delimiter that takes even more space).

johnfn1y ago

This is hilarious, but Mike’s behavior boils my blood. To switch the rules when fairly beaten is just so scummy!

Of course Patrick’s solution occurred to me immediately once he started talking about splitting up the file. :)

Then I got to wondering if it would be possible to compress a large enough file and save a few bytes, in the spirit of what Mike actually wanted. For instance, if you tried encrypting with a bunch of random keys, you’ll find some such that the result ends with n 0s (say), so then your result could be something like encrypted + key + n. Typically you’d expect to find that the length of the encryption key would be roughly the length of the number of 0s, so you wouldn’t actually net any savings. But I think if you searched long enough, you might “get lucky” and find a key that generated more 0s then the length of the key, so you could theoretically win the challenge. Of course you’d have to somehow embed a program which could decrypt and reassemble everything in a tiny amount of space, but it still doesn’t seem impossible like Mike seems to suggest. Perhaps just requires an intractable amount of computing power. :)

I’m not sure if any of that made sense. I think I’m using the wrong terms for things.

crazygringo1y ago

I remember reading from a previous thread how yes you can do something like that (or lots of other techniques) but they will only work on a small percentage of files.

That sometimes the smallest decompressor (including the key, in your example) will result in overall less space, but usually it will result in more (because the smallest key that generates 10 zeros happens to be 14 digits long, etc.).

And so ultimately the more interesting question becomes about those probabilities, and therefore what is the right entry price for the bet to be even?

Xcelerate1y ago

There's another interesting loophole to these general compression challenges. A universal Turing machine will necessarily compress some number of strings (despite the fact that almost all strings are incompressible). The set of compressible strings varies depending on the choice of UTM, and if your UTM if fixed, you're out of luck for random data. But if the UTM is unspecified, then there exist an infinite number of UTMs that will compress any specific string.

To some extent, this resembles the approach of "hiding the data in the decompressor". But the key difference here is that you can make it less obvious by selecting a particular programming language capable of universal computation, and it is the choice of language that encodes the missing data. For example, suppose we have ~17k programming languages to choose from—the language selection itself encodes about 14 bits (log2(17000)) of information.

If there are m bits of truly random data to compress and n choices of programming languages capable of universal computation, then as n/m approaches infinity, the probability of at least one language being capable of compressing an arbitrary string approaches 1. This ratio is likely unrealistically large for any amount of random data more than a few bits in length.

There's also the additional caveat that we're assuming the set of compressible strings is algorithmically independent for each choice of UTM. This certainly isn't the case. The invariance theorem states that ∀x|K_U(x) - K_V(x)| < c for UTMs U and V, where K is Kolmogorov complexity and c is a constant that depends only on U and V. So in our case, c is effectively the size of the largest minimal transpiler between any two programming languages that we have to choose from, and I'd imagine c is quite small.

Oh, and this all requires computing the Kolmogorov complexity of a string of random data. Which we can't, because that's uncomputable.

Nevertheless it's an interesting thought experiment. I'm curious what the smallest value of m is such that we could realistically compress a random string of length 2^m given the available programming languages out there. Unfortunately, I imagine it's probably like 6 bits, and no one is going to give you an award for compressing 6 bits of random data.

SideQuark1y ago

The problem with the UTM argument designed for a specific string is that the UTM size grows without bound. You also don’t need a UTM, which will be much larger than a simple TM or finite automaton. And now we’re back to the original problem: designing a machine capable of compressing the specific string and with smaller description.

UTM provides no gain.

Xcelerate1y ago

You cannot specify the “size” of an arbitrary Turing machine without knowing the index of that machine in some encoding scheme. If we are to account for the possibility of any string being selected as the one to compress, and if we wish to consider all possible algorithmic ways to do this, then the encoding scheme necessarily corresponds to that of a universal Turing machine.

It’s a moot point anyway as most programming languages are capable of universal computation.

Again, I’m not claiming we can compress random data. I’m claiming we can (sometimes) make it look like we can compress random data by selecting a specific programming language from one of many possibilities. This selection itself constitutes the “missing” information.

1 more reply

_hark1y ago

The issue with the invariance theorem you point out always bugged me.

Let s be an algorithmically random string relative to UTM A. Is it the case that there exists some pathological UTM S, such that K(s|S) (the Kolmogorov complexity of s relative to S) is arbitrarily small? I.e. the blank print statement of S produces s. And there always exists such an S for any s?

Is there some way of defining a meta-complexity measure, the complexity of some UTM without a reference UTM? It seems intuitive that although some pathological UTM might exist that can "compress" whichever string you have, its construction appears very unnatural. Is there some way of formalizing this "naturalness"?

Xcelerate1y ago

> Is it the case that there exists some pathological UTM S, such that K(s|S) (the Kolmogorov complexity of s relative to S) is arbitrarily small

Yes. It’s not even that hard to create. Just take a standard UTM and perform a branching “if” statement to check if the input is the string of interest before executing any other instructions.

> Is there some way of defining a meta-complexity measure, the complexity of some UTM without a reference UTM?

Haha, not that anyone knows of. This is one of the issues with Solomonoff induction as well. Which UTM do we pick to make our predictions? If no UTM is privileged over any other, then some will necessarily give very bad predictions. Averaged over all possible induction problems, no single UTM can be said to be superior to the others either. Solomonoff wrote an interesting article about this predicament a while ago.

(A lot of people will point to the constant offset of Kolmogorov complexity due to choice of UTM as though it somehow trivializes the issue. It does not. That constant is not like the constant in time complexity which is usually safe to ignore. In the case of Solomonoff induction, it totally changes the probability distribution over possible outcomes.)

1 more reply

Ono-Sendai1y ago

You are correct to be bugged IMO, I agree with you. My thoughts: https://forwardscattering.org/page/Kolmogorov%20complexity

Kolmogorov complexity is useless as an objective measure of complexity.

2 more replies

avmich1y ago

I particularly like this:

> It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.

PaulKeeble1y ago

I would not have accepted multiple files nor scripting. Scripts offers up the entire contents of the linux operating system and this could allow you to use operating system files as partial data. To really test compression/decompression the rules need to be a lot tighter. The challenge was met and beat because the criteria was bad. It leaves open a number of dishonest approaches.

The decompression program needs to be constrained to itself and the singular file to decompress and can not utilise anything other than basic operating system functions for the purpose of reading and writing the constrained files. Any bytes the program accesses otherwise can be used as part of the total count.

It's very hard to constrain this without some form of chest being possiblemespecially if you have some years to set it up.

echoangle1y ago

> Scripts offers up the entire contents of the linux operating system and this could allow you to use operating system files as partial data.

I also thought about this, but when thinking more about the argument by the challenge creator, I understood that the data in the OS doesn’t help you.

If you’re given a file with 100 bits, there are 2^100 possibilities. If you want to compress to 99 bits, there are only 2^99 possible compressed outputs. So when generating and trying to compress every 100 bit file, you’re not going to be able to compress half of them. The external data doesn’t help you unless you get to modify it after receiving the file to compress. If you can get patched merged into the Linux kernel, you could use the data, but if the data is fixed, you can’t reference the data with less data than the original file (for some cases atleast).

Edit: actually, when thinking about it: you could actually use the OS as another few bits of out of band information you get for free, which makes it possible again. If you have 8 different OS to choose from, you could use their differences to compress any file. At least theoretically, the 3 bits you gain practically probably aren’t enough to make a difference.

cozzyd1y ago

My decompressor:

curl ftp://path/to/original.dat

PaulKeeble1y ago

One possibility I considered is that your script could use an "apt install" to provide out of band data, you could put the information you need into the operating system from the universe repositories/your own package server.

1 more reply

dang1y ago

Related. Others?

The $5000 Compression Challenge - https://news.ycombinator.com/item?id=9163782 - March 2015 (168 comments)

The $5000 Compression Challenge - https://news.ycombinator.com/item?id=5025211 - Jan 2013 (61 comments)

The $5000 Compression Challenge - https://news.ycombinator.com/item?id=4616704 - Oct 2012 (119 comments)

theyapper1y ago

This compression challenge discussion raises fascinating questions about how we approach data representation. While I agree with the core information theory principles being cited, I wanted to share something relevant - US Patent 12,136,933 B2 just granted for a fascinating approach called "Kinetic Data Primers" (KDP).

Let's step back and consider a different perspective that builds on some key mathematical principles:

Every binary file, regardless of its content or randomness, can be viewed as representing one specific large number. This isn't controversial - it's just a different way of looking at the data.

Many large numbers can be expressed more efficiently using mathematical notation (e.g., 10^100 is far more compact than writing out 1 followed by 100 zeros). Furthermore, this efficiency advantage tends to increase with larger numbers.

These numerical conversions are perfectly lossless and reversible. We can go from binary to decimal and back without losing any information.

This naturally leads to some interesting questions: What happens to our usual compression impossibility proofs when we consider perfect numerical transformations rather than traditional pattern matching? Could mathematical expressions capture patterns that aren't obvious in binary form? As numbers get larger, does the potential for more efficient mathematical representation increase?

The KDP patent explores some of these ideas in depth. I'm not claiming this solves all compression challenges - but I think it highlights how fresh perspectives and mathematical approaches might reveal new ways of thinking about data representation.

Would be curious to hear others' thoughts, especially from those with expertise in number theory or information theory. How do these mathematical properties interact with our traditional understanding of compression limits?

Panzer041y ago

I would have expected it to be possible to compress a single arbitrary random file with a small program. I would have thought an RNG could generate a file with some weakness that allows you to compress it, although said compressor would be worse at other inputs.

Likewise this challenge would have been stronger if the requirement was to compress two provided arbitrary files :P

bo10241y ago

Absolutely, but here's the catch. You're designing the compression and decompression program based on the particular file. So the decompression program is really part of the data that's used to reconstruct the program. (For example you could replace a certain 1MB stretch of random data with a short magic word, but the decompressor would have to contain that 1MB so it knows what to insert when it sees the magic word.) That's why the original challenge measured the length of the data plus the decompressor program.

codazoda1y ago

I expected this too, so I asked an LLM to write a quick Go program that would tell me how many bytes repeated. I created a 1.44MB file from /dev/random and then ran it through the program. Here are the bytes that repeat the most.

  Top 10 repeated byte patterns and their counts:
  Bytes: 7eda16, Count: 3
  Bytes: 65b1a4, Count: 3
  Bytes: 71d745, Count: 3
  Bytes: b72808, Count: 2
  Bytes: 60e3ee, Count: 2
  Bytes: 6e9152, Count: 2
  Bytes: 26446b, Count: 2
  Bytes: e4a05a, Count: 2
  Bytes: 67f86a, Count: 2
  Bytes: 92c487, Count: 2

Since the most common three byte sequence only appears 3 times, it seems like a non-starter. No longer byte sequences appeared more than twice either.

I generated the random digits with this:

  dd if=/dev/random of=random.bin bs=1024 count=1440

brody_hamer1y ago

I thought the same thing. Surely a random binary sequence will have some (small) repeating sequences? So as long as we can find them and (efficiently) mark their locations, we can have some small size reduction.

Which is basically what he’s done here. He’s cheating by marking the location of the repeating sequence using unique files, rather than some other actually more efficient location system.

This is a fun discussion! I think it helps to make compression feel more approachable for armchair enthusiasts.

vunderba1y ago

Story time. In our Data Structures and Algorithms class during my undergrad, we had a similar challenge posed to us by an associate professor who was quite proud of the fact that it had never been beaten.

The sophistication of my compression understanding at the time extended to lempel-ziv and huffman and well... that's about it. So I was like, “Okay, I’m not going to create the world’s best compression method on a whim.” So I started thinking: if I can’t compress the file, maybe I can just fetch it from somewhere else.

The first thing I tried was writing a program that searches online by uploading the file onto a public FTP. From there, I could fetch it. Simple enough, right?

Well, here’s the issue with that: there’s no guarantee I’d even have internet connectivity. That’s probably the first thing he’d block.

So, he gives us the file. I think it was something like 10 MB, and when I looked through it, the entropy was high. Like, really high. Standard compression tools like 7Z or rar wanted NOTHING to do with it.

Then I started thinking again. Maybe this guy’s running all the submissions back-to-back. He’s probably processing them in sequence because he’s got tons of entries coming in. So I thought: if I can’t download the file, maybe there's a chance it's already on the machine from somebody's previous simulation - I can find it somewhere on the hard drive. My program might not be a great decompression system, but you know what it is? A fantastic file search.

So, I threw together a program that tree searched the entire disk with a simple checksum to verify I’d found the exact right file. If it matched, I’d just copy the file into the current directory and say, “I’m done.” - though if it found the file too quickly it would continue idling for a while so as to avoid suspicion. (I was also banking on the fact that someone else’s program might’ve run before mine, leaving the file lying around somewhere for me to grab.)

Of course, I couldn’t make this look too obvious. I mean, imagine submitting a tiny C++ program claiming it could compress a 10 MB file into just a few kilobytes. He’d immediately know it was fake.

So, I padded the compressed binary, adding a bunch of random hexadecimal and binary junk to bulk it up to around 5 MB. It was still phenomenal—better than any other method on the market for this type of file—but at least it seemed somewhat plausible. I mean, maybe I’d "discovered" some revolutionary new compression algorithm or something and smashed the weissman score.

The contest went on for nearly a month, and the guy had a leaderboard tracking scores. Most people were hovering just slightly above standard compression rates—around 9 MB. Nothing groundbreaking, though.

So, it’s a Friday night. I finish the program, pack it all up, and send it off. Then I go to bed. Needless to say, I woke up with a quite an excited response from my professor.

3eb7988a16631y ago

AND?! What was his response once you had to come clean?

Edit: I loved this (minus the missing ending). Really sucked me in with the level of detail.

alt2271y ago

Come on dude, great story, missing ending! Fill us in please.

samatman1y ago

I think Mike is correct. My reasoning by analogy follows.

There's a game called Skedaddle, and a Usenet group, rec.players.skedaddle. A well-known player, Mike, offers a $5000 challenge for a feat, called the four-froot-hoot, which he believes is simply impossible to achieve.

A fellow skedaddling enthusiast, Patrick, takes him up on the challenge, some baseline terms are negotiated, and Patrick presents his putative four-froot-hoot. Some might say it meet the terms of the challenge.

Mike objects, he says "you're a rat, Pat! look at the FAQ, Jack, that's not even skedaddle" and indeed, the F.A.Q. of rec.players.skedaddle does indeed state in plain language, that it ain't even skedaddle so it can't be a four-froot-hoot.

You make a bet in the club, you play by the club rules.

sfink1y ago

But the challenge wasn't for performing four-froot-hoot, the challenge was described in text and adhered to. Mike thought he was describing four-froot-hoot, but accidentally only made the challenge about four-froot. The club rules even described why four-froot was not an interesting challenge unless it were four-froot-hoot, which makes it doubly on Mike to issue the challenge for four-froot-hoot and not just four-froot, yet he flubbed it.

The "it ain't even skedaddle" aspect is just incorrect. Compression is a term of art, and it is very general. Saying that "hiding" things in the filesystem is cheating is no different than saying that using the previous frame to reduce the size of the following frame in a video compressor is cheating. Yes, you do have to be careful of exactly what your inputs are, but there is great value in taking advantage of additional forms of input even if they might be initially unexpected.

Besides, Mike's smugness deserves some comeuppance.

ultimafan1y ago

This is really how I see it too. If this spirit was really so important to the bet (and not just vaguely implied) it should have been stated clearly. Friendly ribbing over "you won by rules but not in spirit" is one thing when it's a free wager between buddies but not with real money on the line. Falling back on a position of spirit and intent trumping hard written rules in text in real money wagers basically guarantees that the banker has no real incentive to rule fairly on a win that can cost him money if he can get away with the loss of reputation- he can always weasel out of it one way or another by acting in bad faith and engaging in pedantry during the judging. No one would ever take up someone on a bet if they could turn around last second and say, well you didn't mind-read the additional rules and stipulations I'm adding on now after the fact to our original written agreement so in actuality you lost.

Additionally Mike's behavior during the exchange makes him feel all the more untrustworthy. His condescension, playing the fool and intentionally misinterpreting the test to "win", remarks made that seem specifically placed to needle and aggravate Pat knowing there's no way to force him to pay up, and threats of accusations of fraud were a show of really poor character. At the least it would've been more of a class act (even if he never paid out) to admit that Pat outplayed him due to naivety and a self inflated sense of cleverness on Mike's part, to admit that he is not familiar with betting culture and got in over his head.

moonchild1y ago

suppose the data were generated by a csprng with 256 bits of state. then in fact there are just slightly more than 256 bits of entropy there (prng state plus the length of the generated file). only, you'd have a snail's chance in hell of actually finding that seed

echoangle1y ago

If he really wanted to make sure he kept the money, he could just use a hardware RNG. I don’t know how common they were in 2001 but you could probably even generate some practically unpredictable random numbers using radio noise.

tugu771y ago

Mike thanked random.org for the data. It's exactly what they are doing.

andrei-akopian1y ago

I though he would try to net profit through probability by requesting to regenerate the file and hope to get a compressible one in at least 5000/100=50 times. (Though I don't think that would work.)

nojvek1y ago

I’d take this bet.

> With this offer, you can tune your algorithm to my data.

One can generate a 1GB file or 10GB file. It is highly likely that there is some form of a repeatable pattern in there to shave off 50-100 bytes by sliding window search.

Then the decompressor is essentially - at this index, expand this pattern. The compressed file excludes the range of pattern.

One may not always get such a pattern, but on multiple tries it’s very feasible. Payout just needs one win in 50 tries.

You could generate a 100GB file. The bigger the file the higher the chance of repeating pattern.

The challenge is won if compressed_file + decompressor is less one byte than original file.

One could have a self executing decompressor to save some file overhead bytes.

SideQuark1y ago

Your idea fails pretty easily. Simply do the math, and you’ll find your idea won’t work.

One argument: if your method worked, then we could compress nearly any file. This violates the pigeonhole principle.

Another example: for a 4GB file you’d need roughly a 4 byte integer to specify where the repeat started and where the second was, for 8 bytes. Then you need a length byte. Now you need a repeated 10+ bytes sequence to make this compress anything. There are 256^10=2^80 possible such 10 bytes sequence sequences, and only ~2^32 of them in a 4 gb file. So the odds of a repeated are around 1 in 2^50.

Tweak methods and estimate as you wish. It fails.

alt2271y ago

The whole point is Mike had control over creating the source file, and so obviously checked for random repeating patterns and removed them before giving the file to the challenger.

On top of this, it takes more data to store the address of the repeating byte pattern than the original pattern does. Therefore you are creating more data than you are saving.

If it was this easy then hundreds of people would have taken on the challenge and won the £5k.

crazygringo1y ago

As a general principle, absolutely.

In practice, I wonder what size of file we're talking about that would result in net compression on random data 50% of the time?

I have no intuition whether it's 1 GB, 1 TB, 1 PB, or beyond.

SideQuark1y ago

Nope. As your file gets longer so does the data needed to specify where the repeats are. See my other estimate in this thread. It fails spectacularly.

Jerrrry1y ago

good compression = pattern eventually

maximum compression = indelineable from random data

Uptrenda1y ago

Yeah, I'd strongly, strongly, in fact plead for people not to try this challenge. When it comes to 'compressing' random data: every single bit in random information is 'significant.' I am quite sure that the problem is logically meaningless, mathematically impossible, and a complete waste of effort. But more-so: if it weren't a waste of effort (it is) - I would still highly doubt any algorithm existed that was more efficient than brute force search. In this case -- the search for a function that solves the constraints is going to be like trying to brute force a cryptographic private key value. Simply because every 'bit' of entropy is valuable information.

Now lets look at how modern compression actually: works. Images, movies, text-documents, audio files... there is fundamental --structure-- in all this data. In text it might be using only a certain character range, a certain lexical word list, and so on. There may be geometric information in images and movies that could be turned into code to drastically cut down file size. With audio -- you can sample it and discard what you don't need. And so on and so fourth. Structure. But how do you apply any of these methods to random data? There is no structure, you can't sample it, reduce it, simplify it... Every bit has meaning.

So please, dear human thinking about compression of random data. It is an insane waste of intellectual effort. It's impossible. Don't do it. I made the same mistake (and setup vast amounts of HPC to work on algorithms.) COMPLETE. Waste of time.

hyperpape1y ago

It’s amazing that this commentary on the value of prediction markets was written in 2001.

GrumpyNl1y ago

Is there a link to the file that he had to compress?

trod12341y ago

This gets reposted quite often.

The crux of the ongoing debate over the post has to do with the ambiguity of the stated requirements.

For there to be any rational discussion, you need to have proper unique definitions for what is being communicated. Without a definition (identity) no system of truth seeking will provide a definitive truth.

Most notably, "compression" has many different meanings currently in practical use. For example many people use the word compression and simply mean that the uncompressed size is more than the compressed size; lossy vs lossless doesn't matter when its good enough, all that matters is its been reduced, and an example of that might be H264 video.

Other more rigorous individuals will use it to mean entropy encoding, which has a strict definition, part of which is lossless, and in these cases they often view Shannon's theorems as gospel, and at that point they often stop thinking and may not try to sharpen their teeth on a seemingly impossible task.

These theorems are often very nuanced, with the bounds very theoretical or still undiscovered, though still seemingly correct. When someone is told something is impossible, they don't fully apply themselves to try to find a solution. Its a worn path that leads nowhere, or so they imagine.

This mindset of learned helplessness prevents a nuanced intuitive understanding of such problems and any advances stagnate. People take the impossibility at face value, and overgeneralize it (without that nuanced understanding), this is fallacy and it can be very difficult to recognize it when there is no simple way to refute it on its face.

Here is something to consider. Shannon's source coding theorem relies on the average uncertainty of probabilities as a measure of entropy (information). Some exploration is needed to understand what makes this true, or where it fails. It has been my experience that only a very few exceptional people ever attempt this process.

The main question that jumps out in my mind since compression has been a thing I've been interested in for years; is, can an average entropy in statistics tell us accurately whether this is true in systems with probabilities and distributions that are a mix of mathematical chaos but also contain regularity within their domains? Statistics can be unwieldy beasts, and act as regular sources of misunderstanding and falsehood without a very rigorous approach.

For a similar problem, people may find this article particularly relevant with regards to the chaotic 3-body problem in astrophysics.

https://www.aanda.org/articles/aa/full_html/2024/09/aa49862-...

Supermancho1y ago

> It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.

It's not interesting to see people play word games with each other in the HN comments, so why is it interesting to see this kind of disingenuous behavior? Trolls have always existed. I don't get what's interesting about it.

recursive1y ago

The act of offering a cash prize to prove you wrong kind of makes you look like a jerk. It's assumed that you will use every loop hole at your disposal. Seeing the situation reversed so elegantly makes me root for the challenger.

stavros1y ago

As Patrick said, if he had gotten a 1000-byte file and sent a 999-byte file/decompressor file back, nobody would have counted the inode data as a part of the solution. Why count it now?

Supermancho1y ago

Because the solution specifically was for 1 file submitted, 2 files back (compressed and decompressor). This topic ise the next level trolling that went on. The summary and ng faw, explained why these file metagames aren't useful. They obscure the lack of a compression strategy (eg if the inode equivalent was 2ft of tape) by nature of dispersing information into the fs.

1 more reply

ncruces1y ago

Who's the bigger troll?

Putting up the initial challenge, especially with the accompanying commentary, is just as much trolling as that.

Timwi1y ago

I do agree that this last sentence (the one you quoted) is very unfortunate and doesn't match the wit of the rest. A better closing remark might have been something on the lines of, “I never claimed to have performed any compression. I only claimed to have met the terms of the challenge. I asked if I can send multiple files and the response was ‘sure’. I provided the files; they meet the total file size criterion; and the script correctly reconstructs the original files.”

permo-w1y ago

this sounds like a Nigerian Prince scammer set-up

evilotto1y ago

This is why we can't have nice things.

Some volunteer puts in time maintaining something and makes a claim that is obviously correct and - most likely in jest - promises cash to anyone who shows it to be wrong. Then some rules lawyer decides that he's better than the unpaid volunteer and does his best to humiliate him, just for the lulz.

Is it any surprise that volunteers rarely stick around for long?

Nowadays lots of companies run bug bounty programs where if you find a bug and report it responsibly you can get real cash. This is a valuable service. And I'm positive that these programs get gobs of submissions from people trying to see if they can find a bug in the rules, rather than in the software. If the people vetting the submissions to such programs weren't getting paid for it, I'm sure they would quit in disgust.

ultimafan1y ago

The "unpaid" volunteer in question was raking in $100 per attempt for an arguably impossible task. That unquestionably moves the situation from a bounty/prize to a gambling house taking rigged bets and it's pretty clear by his language in the emails that he took pleasure in his position. How many fools did he part from their money before getting one upped by Patrick? And how did he act when he finally realized he wasn't as clever as he thought?

I'd agree that if this was a free entry situation he'd be fully within his rights turning down trolls, rules lawyers, etc. for the same reasons you mentioned. Trying to scam a well intentioned but naive bounty post would be sad behavior. But this guy was clearly taking money on a position he never intended to pay out on and not losing any sleep over it.

evilotto1y ago

That's not what it was at all.

Any time you say that something can't be done, you will attract people who say "I have found a way to do it!" Even more so if you have a proof that it can't be done. This phenomenon is far older than the internet; the term "morbus cyclometricus" suggests its origins in antiquity.

The FAQ maintainer was not looking to profit off fools. They wanted to chase them off; after spending several pages explaining why the task is impossible it is tedious to explain to a crank (the term generally used for such people, as described by Underwood Dudley) that they are in fact wrong, the task is impossible, and here's why. It's easier to say "Pay me if you want to waste me time" and reasonably assume that no one will do it.

It's similar in spirit to the James Randi challenge. You're not going to win it, because you're not going to prove that math or science is wrong. But the JREF had as its goal to expose the charlatans, and had the time and resources to devote to the task. A newsgroup FAQ maintainer has neither the time nor the resources. So they shouldn't have volunteered? Then you have no volunteers. Hooray for the internet.

Now yes, every once in a great while someone will come along and legitimately do something that was claimed un-doable; the obvious case is George Dantzig. But that case also shows that any scientist or mathematician who can be shown an error in something thought impossible will be thrilled at the discovery, because that is new knowledge, which is the whole point. Poking a hole in the rules of the contest, finding a weasel way around them, is not something interesting at all.

1 more reply

benchmarkist1y ago

Take random bytes from /dev/urandom, encrypt it. There is no way any compressor can reduce the size of that file. I'm pretty sure there is no way the second player can win this game. Actually, the encryption part isn't even necessary. There is no compressor that can reduce the size of a random stream of bytes.

cortesoft1y ago

There is no general purpose compressor that can achieve compression on all random streams of bytes... but there is a possibility to compress a specific random stream of bytes, if you get lucky

lambdaone1y ago

Absolutely! Can this probability be quantified?

1 more reply

quuxplusone1y ago

The encryption part might actually be harmful: for (silly) example, if you encrypt using the Bacon cipher, which turns every letter into five letters all of which are either "a" or "b". :) You'd need to pick an encryption method that doesn't expand the text at all, not even via a header or footer block. Better to stick with bytes generated uniformly at random, in which case you are correct.

[1] - https://en.wikipedia.org/wiki/Bacon%27s_cipher

Retric1y ago

But you don’t need to compress every possible file to make playing such a game a good idea.

Suppose if you shave 1 bit off of a sample bit stream you win and you lose if that fails. Your compressor looks at the first 2 bits and if it starts 11 it starts with 1 and otherwise it starts 0 then the bit stream. Now you have a 1 in 4 chance to compress a random string by 1 bit and. A 3 in 4 chance of making it longer by 1 bit.

It’s ultimately not about compression but what odds are worth it.

misiek081y ago

I read „decompressor and compressed file”, not „fileS”. There is only one person correct here

stavros1y ago

You should read the rest of the thread.

optimalsolver1y ago

He asks Mike if he can send multiple files. Mike says "sure".

j / k navigate · click thread line to collapse

185 comments

omoikane1y ago

The original email thread was from 2001, and it gets posted to HN periodically:

https://news.ycombinator.com/from?site=patrickcraig.co.uk

For another compression challenge that is still ongoing, try "500000€ Prize for Compressing Human Knowledge" (also known as "Hutter Prize"):

http://prize.hutter1.net/

https://news.ycombinator.com/item?id=37502329 - Hutter Prize for compressing human knowledge (2023-09-13, 215 comments)

vlovich1231y ago

_hark1y ago

Say we have some dataset composed of D bytes.

Next, say I find some predictive model of the data M, where M is composed of N bytes. Furthermore, let us say that the entropy of the dataset under the model is H bytes.

Then, if N + H < D, my model compresses the data.

It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.

One more argument for compression being equivalent to intelligence: Across many fields of statistical machine learning, there are generalization bounds which have the form:

test error <= train error + model complexity

2 more replies

aleph_minus_one1y ago

Marcus Hutter is free to sponsor a prize according to his definition of intelligence. You are free to sponsor a prize according to your intelligence definition involving a lossy compression.

1 more reply

echoangle1y ago

Isn’t the intelligence shown by compressing lossless the scheme you use? Applying the algorithm is the easy part, the proof of intelligence is inventing the algorithm which compresses.

2 more replies

patmcc1y ago

Intelligence is knowing what you can lose, and losing that and no more.

1 more reply

pizza1y ago

"If it were a fact, it wouldn't be called intelligence."

Dwedit1y ago

Totally just misread "hutter1" as "hunter2".

teo_zero1y ago

optimalsolver1y ago

>$SPORT team the Compressors

They were lossless last season. :-D

l33t73322731y ago

And the FAQ for the bet said that if a team can’t afford to enter the playoffs then the bet is off.

Dylan168071y ago

1. Which line in the FAQ are you making an analogy to?

2. A FAQ for the newsgroup is not automatically part of the rules for the challenge.

3. If the entire FAQ is treated as rules text, then the rules directly say you cannot win, and that's not an acceptable way to do rules.

1 more reply

stavros1y ago

Sorry, if you're trying to hustle people by xstging $100 per try, don't catch the sleight of hand in the "multiple files" question, and accept, you were beaten at your own game, fair and square.

l33t73322731y ago

I feel like if the FAQ requires not using filename shenanigans then the slight of hand was illegal the whole way.

stavros1y ago

He didn't use filenames, he used files, and if that were illegal, Mike shouldn't have accepted it.

1 more reply

spott1y ago

The FAQ was not part of the challenge statement. It was part of the newsgroup I believe.

fxtentacle1y ago

This guy clearly failed because he didn't actually do any compression, he just ab-used the filesystem to store parts of the data and then tried to argue that metadata was not data...

But FYI someone else actually managed to compress that exact same data: https://jasp.net/tjnn/2018/1.xhtml

Timwi1y ago

He clearly failed at doing any actual compression, but he did not fail at the challenge. He satisfied every single condition.

stavros1y ago

PaulHoule1y ago

Should have charged $100 a file.

recursive1y ago

That argument was made, and addressed in the emails. How are you defining compression? Actual compassion doesn't seem to have been a requirement in the challenge.

maxmcd1y ago

Is there any more information about this solution?

elahieh1y ago

I'm rather skeptical of this claim that the data was compressed in 2018 because there is no further information, apart from a hash value given.

If it's a true claim they must have identified some "non-random" aspect of the original data, and then they could have given more info.

tugu771y ago

Easy.

Save the sha256 hash of original.dat in compressed.dat. The decompressor cats /dev/random until data of the right size comes out with the correct hash.

Now there are two cases.

1. The reconstructed data is actually equal to original.dat. Challenge won, cash in $5000.

2. The reconstructed data differs from original.dat. It has the same hash though, so you found a collision in sha256. World fame.

In either case, win!

hgomersall1y ago

I haven't worked through all the details, so it might be in the end result everything rebalances to say no, but I'd like to withhold judgement for the moment.

1 more reply

nkrisc1y ago

ball_of_lint1y ago

This strategy or something like it legitimately wins the challenge.

Now, a clever contest creator might arbitrarily choose files that win against such decompressors that they can think of, but Mike appears to have not have done so.

LegionMammal9781y ago

There are some formats which search for a valid header anywhere in the file, but then they'd get tripped up by all the mostly-valid headers that don't actually yield a decompressed file.

hinkley1y ago

No. You sound as if you’re familiar with the culture but it’s clear that you’re not.

Like sending them a file full of RNG output and wait for them to realize this was their teachable moment.

The problem is compressor + file means “no out of band data” to a reasonable person and we have already established we are dealing with unreasonable people.

Not

> My main motivation was to "out-trick the tricker". I thought the chances of me making any money were very remote.

ball_of_lint1y ago

I'm not familiar with the culture.

1 more reply

ttshaw11y ago

2 more replies

GuB-421y ago

The contest creator has to use a random file, any attempt to be clever will actually decrease entropy and may result in the contestant winning.

teo_zero1y ago

> The only way to win is to reverse the random number generator used to create the contest file

In that case Mike, the contest creator, may declare the result invalid because you haven't satisfied the intent of the challenge.

1 more reply

piannucci1y ago

:shocked_pikachu:

Renegadry aside, for those who are more interested in the Information Theory perspective on this:

Kolmogorov complexity is a good teaching tool, but hardly used in engineering practice because it contains serious foot-guns.

Mike’s intuitive understanding was incorrect in two subtle ways:

l33t73322731y ago

>uniform random strings’ K complexity is only _tightly concentrated around_ the strings’ length plus a machine-dependent constant

What is the distribution of the complexity of a string? Is there some Chernof-like bound?

its-summertime1y ago

> I think that I did not make a mistake except to believe that you truly intended to attempt to compress the data I was sending you.

Weird phrase to hear from a digital carnival scam artist.

ccleve1y ago

I wonder if it would have been possible to win the challenge legitimately?

It's a bet worth taking, because the payoff, 50:1 ($5,000 to $100), is pretty good. Play the game 50 times and you might get a file you could compress.

The challenge, then, would be for the person offering the bet to generate a really random file that contained no such redundancy. That might not be easy.

dmurray1y ago

I think you can make some argument about why this isn't possible at 50:1 odds.

This argument could do with more rigor, but I think it's correct. Give me 100 million to 1 odds, though, and I'll take my chances trying to brute force a compressor.

lambdaone1y ago

This is actually an extremely interesting question. 'Weak' files that are more easily compressable than others certainly exist, but with low probability.

Is it possible to quantify this idea of a 'weak' file more accurately?

4 more replies

kittoes1y ago

What if we didn't even attempt to compress the file? Theoretically, there is a random number generator + seed combination that would output the same bytes as the source file.

changoplatanero1y ago

Neat idea but chances are the length of the seed is equal to the length of the original file.

2 more replies

Retr0id1y ago

Unfortunately I can't find the article I'm describing here, maybe someone else can? It was a long time ago so I might be misrepresenting it slightly.

phire1y ago

It can't exist.

Presume this compressor guarantees the output will always be at least one byte smaller (with the exception of the impossible to find input)

Even if you put some minimum output size limitation on the compressor, say it can't produce any file less than 512 bits, the same argument applies. It's just that the numbers get bigger.

spencerflem1y ago

That's how all compressors work, in that likely files (eg. ASCII, obvious patterns, etc) become smaller and unlikely files become bigger.

3 more replies

Dylan168071y ago

You could design the opposite, a system where ultra-rare files get compressed a lot but most don't compress.

But as stated, what you said is not possible. To compress by a single byte, only 1/256 files can be compressible, or you'd have multiple different inputs compressing down to the same output.

teaearlgraycold1y ago

echoangle1y ago

I didn’t quite get the method used to „compress“ the data from the article, maybe this rephrasing helps someone:

mjevans1y ago

Almost.

It does fulfill the discussed terms of the challenge, which fail to include any reference to an external set of rules.

oezi1y ago

My take was that the information is stored in the ordering of the files. The decompressor doesn't care about the file size of each file, right?

Dylan168071y ago

Both are needed. If you want to transmit this solution from one computer to another you need to store the size of each file (or insert a fancy delimiter that takes even more space).

johnfn1y ago

This is hilarious, but Mike’s behavior boils my blood. To switch the rules when fairly beaten is just so scummy!

Of course Patrick’s solution occurred to me immediately once he started talking about splitting up the file. :)

I’m not sure if any of that made sense. I think I’m using the wrong terms for things.

crazygringo1y ago

I remember reading from a previous thread how yes you can do something like that (or lots of other techniques) but they will only work on a small percentage of files.

And so ultimately the more interesting question becomes about those probabilities, and therefore what is the right entry price for the bet to be even?

Xcelerate1y ago

Oh, and this all requires computing the Kolmogorov complexity of a string of random data. Which we can't, because that's uncomputable.

SideQuark1y ago

UTM provides no gain.

Xcelerate1y ago

It’s a moot point anyway as most programming languages are capable of universal computation.

1 more reply

_hark1y ago

The issue with the invariance theorem you point out always bugged me.

Xcelerate1y ago

> Is it the case that there exists some pathological UTM S, such that K(s|S) (the Kolmogorov complexity of s relative to S) is arbitrarily small

Yes. It’s not even that hard to create. Just take a standard UTM and perform a branching “if” statement to check if the input is the string of interest before executing any other instructions.

> Is there some way of defining a meta-complexity measure, the complexity of some UTM without a reference UTM?

1 more reply

Ono-Sendai1y ago

You are correct to be bugged IMO, I agree with you. My thoughts: https://forwardscattering.org/page/Kolmogorov%20complexity

Kolmogorov complexity is useless as an objective measure of complexity.

2 more replies

avmich1y ago

I particularly like this:

> It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.

PaulKeeble1y ago

It's very hard to constrain this without some form of chest being possiblemespecially if you have some years to set it up.

echoangle1y ago

> Scripts offers up the entire contents of the linux operating system and this could allow you to use operating system files as partial data.

I also thought about this, but when thinking more about the argument by the challenge creator, I understood that the data in the OS doesn’t help you.

cozzyd1y ago

My decompressor:

curl ftp://path/to/original.dat

PaulKeeble1y ago

1 more reply

dang1y ago

Related. Others?

The $5000 Compression Challenge - https://news.ycombinator.com/item?id=9163782 - March 2015 (168 comments)

The $5000 Compression Challenge - https://news.ycombinator.com/item?id=5025211 - Jan 2013 (61 comments)

The $5000 Compression Challenge - https://news.ycombinator.com/item?id=4616704 - Oct 2012 (119 comments)

theyapper1y ago

Let's step back and consider a different perspective that builds on some key mathematical principles:

Every binary file, regardless of its content or randomness, can be viewed as representing one specific large number. This isn't controversial - it's just a different way of looking at the data.

These numerical conversions are perfectly lossless and reversible. We can go from binary to decimal and back without losing any information.

Panzer041y ago

Likewise this challenge would have been stronger if the requirement was to compress two provided arbitrary files :P

bo10241y ago

codazoda1y ago

  Top 10 repeated byte patterns and their counts:
  Bytes: 7eda16, Count: 3
  Bytes: 65b1a4, Count: 3
  Bytes: 71d745, Count: 3
  Bytes: b72808, Count: 2
  Bytes: 60e3ee, Count: 2
  Bytes: 6e9152, Count: 2
  Bytes: 26446b, Count: 2
  Bytes: e4a05a, Count: 2
  Bytes: 67f86a, Count: 2
  Bytes: 92c487, Count: 2

Since the most common three byte sequence only appears 3 times, it seems like a non-starter. No longer byte sequences appeared more than twice either.

I generated the random digits with this:

  dd if=/dev/random of=random.bin bs=1024 count=1440

brody_hamer1y ago

Which is basically what he’s done here. He’s cheating by marking the location of the repeating sequence using unique files, rather than some other actually more efficient location system.

This is a fun discussion! I think it helps to make compression feel more approachable for armchair enthusiasts.

vunderba1y ago

The first thing I tried was writing a program that searches online by uploading the file onto a public FTP. From there, I could fetch it. Simple enough, right?

Well, here’s the issue with that: there’s no guarantee I’d even have internet connectivity. That’s probably the first thing he’d block.

So, it’s a Friday night. I finish the program, pack it all up, and send it off. Then I go to bed. Needless to say, I woke up with a quite an excited response from my professor.

3eb7988a16631y ago

AND?! What was his response once you had to come clean?

Edit: I loved this (minus the missing ending). Really sucked me in with the level of detail.

alt2271y ago

Come on dude, great story, missing ending! Fill us in please.

samatman1y ago

I think Mike is correct. My reasoning by analogy follows.

You make a bet in the club, you play by the club rules.

sfink1y ago

Besides, Mike's smugness deserves some comeuppance.

ultimafan1y ago

moonchild1y ago

echoangle1y ago

tugu771y ago

Mike thanked random.org for the data. It's exactly what they are doing.

andrei-akopian1y ago

I though he would try to net profit through probability by requesting to regenerate the file and hope to get a compressible one in at least 5000/100=50 times. (Though I don't think that would work.)

nojvek1y ago

I’d take this bet.

> With this offer, you can tune your algorithm to my data.

One can generate a 1GB file or 10GB file. It is highly likely that there is some form of a repeatable pattern in there to shave off 50-100 bytes by sliding window search.

Then the decompressor is essentially - at this index, expand this pattern. The compressed file excludes the range of pattern.

One may not always get such a pattern, but on multiple tries it’s very feasible. Payout just needs one win in 50 tries.

You could generate a 100GB file. The bigger the file the higher the chance of repeating pattern.

The challenge is won if compressed_file + decompressor is less one byte than original file.

One could have a self executing decompressor to save some file overhead bytes.

SideQuark1y ago

Your idea fails pretty easily. Simply do the math, and you’ll find your idea won’t work.

One argument: if your method worked, then we could compress nearly any file. This violates the pigeonhole principle.

Tweak methods and estimate as you wish. It fails.

alt2271y ago

The whole point is Mike had control over creating the source file, and so obviously checked for random repeating patterns and removed them before giving the file to the challenger.

On top of this, it takes more data to store the address of the repeating byte pattern than the original pattern does. Therefore you are creating more data than you are saving.

If it was this easy then hundreds of people would have taken on the challenge and won the £5k.

crazygringo1y ago

As a general principle, absolutely.

In practice, I wonder what size of file we're talking about that would result in net compression on random data 50% of the time?

I have no intuition whether it's 1 GB, 1 TB, 1 PB, or beyond.

SideQuark1y ago

Nope. As your file gets longer so does the data needed to specify where the repeats are. See my other estimate in this thread. It fails spectacularly.

Jerrrry1y ago

good compression = pattern eventually

maximum compression = indelineable from random data

Uptrenda1y ago

hyperpape1y ago

It’s amazing that this commentary on the value of prediction markets was written in 2001.

GrumpyNl1y ago

Is there a link to the file that he had to compress?

trod12341y ago

This gets reposted quite often.

The crux of the ongoing debate over the post has to do with the ambiguity of the stated requirements.

For a similar problem, people may find this article particularly relevant with regards to the chaotic 3-body problem in astrophysics.

https://www.aanda.org/articles/aa/full_html/2024/09/aa49862-...

Supermancho1y ago

> It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.

recursive1y ago

stavros1y ago

As Patrick said, if he had gotten a 1000-byte file and sent a 999-byte file/decompressor file back, nobody would have counted the inode data as a part of the solution. Why count it now?

Supermancho1y ago

1 more reply

ncruces1y ago

Who's the bigger troll?

Putting up the initial challenge, especially with the accompanying commentary, is just as much trolling as that.

Timwi1y ago

permo-w1y ago

this sounds like a Nigerian Prince scammer set-up

evilotto1y ago

This is why we can't have nice things.

Is it any surprise that volunteers rarely stick around for long?

ultimafan1y ago

evilotto1y ago

That's not what it was at all.

1 more reply

benchmarkist1y ago

cortesoft1y ago

There is no general purpose compressor that can achieve compression on all random streams of bytes... but there is a possibility to compress a specific random stream of bytes, if you get lucky

lambdaone1y ago

Absolutely! Can this probability be quantified?

1 more reply

quuxplusone1y ago

[1] - https://en.wikipedia.org/wiki/Bacon%27s_cipher

Retric1y ago

But you don’t need to compress every possible file to make playing such a game a good idea.

It’s ultimately not about compression but what odds are worth it.

misiek081y ago

I read „decompressor and compressed file”, not „fileS”. There is only one person correct here

stavros1y ago

You should read the rest of the thread.

optimalsolver1y ago

He asks Mike if he can send multiple files. Mike says "sure".

j / k navigate · click thread line to collapse