What does "strategy" mean here? I might just happen to have a strategy which involves betting on the exact sequence of heads and tails in a given sequence. The analogy in terms of languages is that my language might just happen to have a short keyword that represents a given sequence of heads and tails.
I don't know much about Kolmogorow complexity so I'm certainly missing something here. Potentially there is a subtle clause in the technical definition that doesn't make it through to these articles.
That's a very narrow program.
> The analogy in terms of languages is that my language might just happen to have a short keyword that represents a given sequence of heads and tails.
The sequence still needs to be generated "somehow". Either by executing the program and producing the sequence, or by explicitly stating it. Even if you have it "cached" and "represented" in your language, you still need to generate the sequence. The resources spent here is the Kolmogorov complexity.
The easiest way to expand your little program is to say that you have a seed s.t. any consecutive generation results in a consecutive sequence that matches up to the period of the generator. Now it is more generic, but has a period. You can then expand this to accept multiple seeds and once it has reached a period, to simply take the next seed.
Should this sequence be finite, you are in luck. Your program can have length O(generator + N/P) where N is length of sequence, and P is the period of your RNG.
All this is is just compression which plays into the whole Kolmogorov complexity.
If you know everything about the process and still can't beat chance at predicting it, that's the quality we are after. In this definition "random" just means unpredictable, which is another way to explain why it can only be a meaningful distinction when you don't yet know the result.
Any function that outputs bets.
Lossless encoding works by taking advantage of the more commonly observed sequences of data having lower information entropy. For things like audio encoding, where discontinuous sequences aren't naturally observed (or pleasing to listen to), lossless encoding has a lot to work with.
However Huffman isn’t a fixed compression scheme since it makes a different frequency tree for different corpora.
https://en.wikipedia.org/wiki/Algorithmically_random_sequenc...
[1] An Introduction to Kolmogorov Complexity and Its Applications, M. Li & P. Vitnányi
[2] Grundlagen der Wahrscheinlichkeitsrechnung, R. von Mises
Why not specify it?
> That gives us the true language-agnostic definition of Kolmogorov Complexity as follows:
Choosing the language of Turing Machines does not make the definition language agnostic.
Aiming for the simplest definition of description complexity, I instead based my definitions on the older computational model of lambda calculus in [1].
Unlike the assumed UTM above, the universal lambda machine is easy to describe in detail:
(λ 1 1) (λ λ λ 1 (λ λ λ λ 3 (λ 5 (3 (λ 2 (3 (λ λ 3 (λ 1 2 3))) (4 (λ 4 (λ 3 1 (2 1)))))) (1 (2 (λ 1 2)) (λ 4 (λ 4 (λ 2 (1 4))) 5)))) (3 3) 2) (λ 1 ((λ 1 1) (λ 1 1)))
Furthermore, it allows almost identical definitions of various variations of descriptional complexity, namely1) plain complexity
2) prefix complexity
3) monotone complexity
all of which have their application in Algorithmic Information Theory [2].
[1] https://gist.github.com/tromp/86b3184f852f65bfb814e3ab0987d8...
Read on a bit and it looks like proof by contradiction:
> However, let’s bring back the paradox we discussed above. According to that paradox, U cannot exist or U cannot provide shorter descriptions than every arbitrary L.
If you have a Turing machine which can only print out binary digits, then it can't print out a chinese character, no matter how long the input program is.
Yeah, you can do something like unicode, and associate a binary string with each chinese character--but printing out that binary string is not printing out the chinese character. It's printing out a binary string.
In particular, your lambda-calculus based turing machine cannot print out chinese characters. It therefore cannot be used to define a universal complexity for any string.
Did you know the best compression methods out all have a variable length (measured in bits) alphabet? eg. Dynamic Markov Coding will start with just '0' and '1' and then predict the next bit but as it see's more symbols it will extend this to single characters (so see 'a' or 'b' and predict the next bit). They'll then continue as they learn more of the file and their alphabet will essentially include common pairwise letters, then words and entire common phrases.
This is actually a commonly missed aspect of Shannon entropy. A file of 0111101110111 repeated will give you a different result if you consider a 1 bit alphabet of 25% '0' and 75% '1' than a 4 bit alphabet of 100% '0111'. No one in the real world is using the character frequencies of english characters as a measure of Shannon entropy or Kolmogorov complexity. No algorithm expects that. They all work at the binary level and they will try to adjust the symbol lengths of the alphabet to common sequences to achieve the best result.
This is in fact the reason Kolmogorov complexity is used rather than Shannon entropy. Shannon entropy doesn't tell you how to define an optimal alphabet. That part is actually undefinable. It just tells you what to do if you have that already. Kolmogorov complexity says more completely 'find the optimal alphabet and the symbol probabilities and make a minimal sequence from that'.
Different human languages don't figure into this at all and are completely irrelevant.
It defines a complexity for anything which can be represented in binary. Which in practice is all we want. Who wants to define a new complexity measure for every new alphabet of symbols?
But how I see it is that for solving KC in full generality you'll have to:
- Start with the program that explicitly returns the original string. Let's say it has length N - run all possible programs that are shorter than N (just try all combinations of characters) - look at the results and pick the shortest program that compiles and outputs the original string
The problem there is that you have to wait for all programs to end, and you don't know if they will end or not. So you have a problem that's equivalent to the halting problem (and that's not solvable) (and the halting problem is loosely related to the interesting number problem).
(This is not a proof and I don't have a background in the field btw)
https://en.wikipedia.org/wiki/Kolmogorov_complexity#Chaitin'...
Of course trivially some KC can be proven, ex a language with 1 or 0 characters that is interpreted to a specific string. Or to prove KC(x) where the compressed value has length N and you can list out all the results for all strings of length less than N, and they don't equal x, proves KC(x)=N.
The interesting number paradox (Berry's paradox) is more related to Chaitin incompleteness.
Basically, given a language there’s some code which enumerates proofs that KC of a string is more than some constant L, and returns the first one it finds.
If the constant L is large enough, it becomes larger than the entire proof generating code. So the proof generating code will never find a proof of any KC larger than L.
It's interesting to think about that the language gets more complex, proofs for larger strings become possible. And what it would mean for the languages to keep getting more complex indefinitely.
it's a similar train of thought to busy beaver numbers and how systems of logic (PA,ZFC) become independent to values like BB(745), and what it could mean to have more and more advanced types of logic which don't become independent until some high target n.
There is an important difference between semantically complete and syntactically complete that may cause some barriers.
Gödels completeness theorem is about semantic completeness while his incompleteness theorems are about syntactic completeness.
From Wikipedia: > A formal system is syntactically complete if and only if no unprovable sentence can be added to it without introducing an inconsistency.
'This statement is false', which Gödel mapped to natural numbers is an example of that inconsistency.
If KC was computable, there would be an infinity of paradoxes like the interesting number paradox.
The Berry paradox that is linked to in the INP link in the page has a subheading that relates it to KC computability.
Given: TM length of a JS runtime is 1,000,000 cells.
Assume: KC is computable, and TM length of a `function KolmoglorovComplexity(string s)` is 4,000,000 cells.
Known: KC's of values grow infinitely large - only 2^n-1 possible values can ever be encoded by n bits.
Take: function Shortcut() { for (const s in generateEveryStringFromShortestUp()) { if ( KolomoglorovComplexity(s > 10,000,000) ) return s } }
You see that the Shortcut function is encoded in 5,000,135 cells (plus that string generator, but that's small/constant), but it computes a value of arbitrarily large complexity (rather, one cell increase in the program length causes 10x increase in the complexity). A contradiction.
..and Bob is 100% right.
>Bob credits his excellent luck. Alice is smart and cannot be easily convinced. She get’s back at Bob by claiming that probability cannot be used in this context as it reveals no information regarding the randomness of the obtained sequences. One can take a quick glance at the obtained sequences and easily point out that Alice’s sequence is more random than Bob’s sequence.
No, it is not. Given a perfectly random coin toss Bob's sequence is indeed just as likely as Alice's sequence and in no way "less random" because both sequences result from the same randomness with equal probability.
A nice example of human intuition being at odds with probability math, though. Bob's result seems less likely but it really is not. Which reminds me that I actually had to write my own computer simulation of the Monty Hall Problem before I was willing to believe the correct answer. I think (most?) human brains have a bug in the "understanding probability" subroutine.
But the question posed is different: given a specific sequence, how likely it to have come from independent coin tosses? That is, how likely is it that Bob is cheating and his sequence was in fact not a sequence of a fair coin tosses.
And for this KC is a reasonable measure. My 2c.
THE book https://link.springer.com/book/10.1007/978-0-387-49820-1 is absolutely a thing to read. It was for me 30 years ago and it aged well.
For example, in the article, the author first defines the basic idea of KC. But then they correctly point out that the basic idea depends very much on the exact language that is chosen. So they describe how theorists have defined the notion of universal computation. But even this adjustment doesn't seem to escape the fact the we still depend on a system of mathematical symbols to describe the theory. And the notion of a Turing machine itself depends on other abstract concepts such as time and space, each with their own inherent, conceptual complexity. What sorts of minds (i.e. brains) are required to make sense out of the theory and what physical system is required for them to operate correctly? If the definition of KC includes a notion of how complex the Turing machine is that is required to compute a string, then the further down you go, the less the difference in complexity should be between any one string and another. After all, they all exist in the same universe!
I guess it just goes to show how much the idea of KC lives in the realm of theory. As soon as you pose the question of complexity so abstractly, you invite in all kinds of theoretical considerations that make the meaning more slippery. That's why KC really doesn't deserve to be compared to Shannon entropy as it often is.
But let me draw a comparison anyway like I said you shouldn't! Because Alice from the article could also have made a strong argument against Bob by just pointing out that the Shannon entropy of his string was lower, which is very relevant in terms of the number of heads or tails and the likelihood of seeing a particular count of them.
2. If you want something with less physical grounding, you could use lambda calculus instead of Turing machines.
3. Kolmogorov Complexity and Shannon Entropy are compared with one another because they both are talking about the same thing: optimal compression. Kolmogorov Complexity talks about the compressibility of individual objects and Shannon Entropy talks about compressibility of streams of i.i.d. random variables.
As noted, Kolomogorov complexity depends on the specific UTM only up to a constant factor (this is known as the invariance theorem). But even the change in runtime, memory usage, and essentially anything else you might think are important are bounded by a factor that is either a constant or a “slow-growth” function (e.g. a polynomial) when you swap out one computational model for another. These small terms are generally dwarfed by the size of the data itself (even for small datasets) and the complexity of the algorithms used.
That said, I also share some of your confusion on the “specifics” when it comes to Solomonoff induction. I have yet to understand why the universal distribution uses negative exponentiated program size to weight the universal a priori probability of a particular string as opposed to some measure that involves program runtime or frequency over an equivalence class of programs that implement the same algorithm.
Solomonoff was careful to point out that his universal distribution is more of a class of distributions that have certain “universally optimal” convergence properties given a reasonable assumption on the underlying model of data generation: a deterministic algorithm with a short description that has access to a source of randomness. But I think many people since then have made the unwarranted leap that Solomonoff induction is the best induction scheme for all models of data generation, including data obtained via observation within our universe. I’m not sure that has been proven true. And if it has, I certainly haven’t come across the paper showing it.
For those now thinking about how to calculate Shannon entropy using the defined formula what are you using for the symbols? If you used one bit symbols of '1' and '0' and a probability of each appearing a file that was just 11101110... repeating would you would find a different Shannon entropy to someone using 4 bit symbols. Shannon entropy is literally uncomputable in the real world. You can only compute it if you are given a fixed alphabet and frequencies but in the real world the optimal alphabet for a given file to calculate the minimum Shannon entropy is actually unknowable.
That's where Kolmogorov complexity comes in. It states that "well we don't actually have a way to define the alphabet in Shannon entropy in the real world but if we pretend we have a system (the universal computation) we could calculate it". They then add in the size of the program length that does the calculation as well to prevent cheating by having a language that has a dictionary specific to the thing to encode and call that Kolmogorov complexity. But that's it. They are literally the same thing in essence.
Kolmogorov complexity is in fact better than Shannon entropy for real world usage. It's every bit as computable in the real world (ie. not at all but at the very least you can do the best compression you can and make a guess!) but it at least states that upfront.
For anyone wanting to claim that they had a CS assignment to calculate Shannon entropy and it's totally computable your teacher should probably have explained that the symbol frequencies for the alphabet given aren't actually computable like that in the real world as the optimal symbol lengths themselves aren't actually computable. You cannot in the real world just say "compute the Shannon entropy of an alphabet with two symbols - B 30% and A 70%" because you don't actually know if B and A are the optimal alphabet to define to minimize Shannon entropy. BBBAAAAAAA repeated has no entropy but it fits the definition of the question given and would give you a different result.
Sounds a bit puzzling. Surely for a particular programming language we can enumerate all programs, ordered by length etc. and check which is the shortest one giving the given string. So what's uncomputable here? For long strings that could take long time, but - ?
Deciding whether the universal machine will ever halt on a particular input. I.e. the good old halting problem.
The top rated answer for "how do i measure Shannon entropy" on stack overflow for example has an accepted answer of "count the probabilities of all 8bit sequences and then multiply the log of those probabilities together as per the equation". Which is a problematic answer. A file of all 8bit characters in sequence repeated many times over won't have any entropy but will have high entropy by this particular arbitrary measure. The problem with Shannon Entropy is that you have no way to define the optimal symbol lengths and frequencies for any given file.
Kolmogorov Complexity on the other hand at least gives some way for us to get a rough estimate. It's just as incalculable as Shannon entropy but at least by essentially explicitly stating "compress it using the best tool you have at hand and see how small it gets and also include the size of the compression program in the calculation to prevent cheating by using a dictionary" you can get some rough estimate.
Basically Kolmogorov Complexity is the best tool we have. It's not perfect because just like Shannon Entropy it's incalculable in reality but unlike Shannon Entropy we do have a good way to measure if one tool of calculating Kolmogorov Complexity is better than another tool. That measure is simply "does it compress better?".
It's literally the best way to measure randomness of an arbitrary file. Any other way is pretty game-able. If someone uses Shannon entropy to measure randomness just look at the alphabet they use for that measurement and repeat that alphabet sequentially over and over again and you'll have a high shannon entropy for a clearly non-random file. Likewise other measurements might be game-able with large dictionaries to lookup. Kolmogorov complexity includes the entire program so that game doesn't work here.
https://en.m.wikipedia.org/wiki/Compression_of_genomic_seque...
1.) Conflating usage of the term "random" and "complexity". After all, a set of "randomly" drawn sample permutations from an alphabet are all equally likely. However, their "complexity" may differ, which is basically the point of the article, but the term more or less "random" keeps being used to refer to permutations with more or less "complexity", which I think is probably going to perpetuate confusion on this topic.
2.) From the article: "Moreover, a string cannot be compressed if its KC(x)≥|x|". Shouldn't the expression accompanying this statement be KC(x)=|x| ?
Regarding 2), No, most strings x do not satisfy KC(x) = |x|, since you need to use some bits to specify that you're giving x literally. See the first theorem of [1].
[1] https://gist.github.com/tromp/86b3184f852f65bfb814e3ab0987d8...
re #2: No. Basically the > part of it handles the case when the smallest program which prints out the string is actually LARGER than the length of the string. In that case, the string is still incompressible. Compression means mapping from larger strings to smaller strings.
1. Note, kolmogorov complexity is defined by the length of the shortest program which prints out the string. What counts is the number of instructions, and not the complexity of those instructions.
2. So say S is a very complex spring. We can always construct a turing machine which could print out S using a zero length program: it could just start in a state which prints out S when you turn it on, and then halts.
3. So there is no such thing as a turing machine which prints out every string shorter than any other turing machine prints it out, QED.
That's the bad news. The good news is we don't even need to do that. For any string S, say that M and N are any two universal turing machines. Without loss of generality, specify that KM(S) <= KN(S). Then there is always some C for which KM(S) <= KN(S) + C. The constant C being the length of the program required to emulate machine M on machine N.
We are used to abstracting out constant sums and constant factors like this. The strings we are dealing with (as a species) are growing in length exponentially--that's why we went from 8-bit, to 16bit, etc computers. So as the length of S goes to infinity, the difference between the its complexity for any two machines becomes negligible.
Talk about a cliffhanger :)
Using [0] you get 32B for Alice and 40B for Bob.
[0] It has been demonstrated that KC(x), can be reasonably estimated by the number of bits required to encode x using a compressor C (such as gzip)