undefined | Better HN

0 pointsAnotherGoodName2y ago0 comments

This is quite misguided as you seem to think the alphabet for Shannon entropy or Kolmogorov complexity is in any way what we think of as an alphabet.

Did you know the best compression methods out all have a variable length (measured in bits) alphabet? eg. Dynamic Markov Coding will start with just '0' and '1' and then predict the next bit but as it see's more symbols it will extend this to single characters (so see 'a' or 'b' and predict the next bit). They'll then continue as they learn more of the file and their alphabet will essentially include common pairwise letters, then words and entire common phrases.

This is actually a commonly missed aspect of Shannon entropy. A file of 0111101110111 repeated will give you a different result if you consider a 1 bit alphabet of 25% '0' and 75% '1' than a 4 bit alphabet of 100% '0111'. No one in the real world is using the character frequencies of english characters as a measure of Shannon entropy or Kolmogorov complexity. No algorithm expects that. They all work at the binary level and they will try to adjust the symbol lengths of the alphabet to common sequences to achieve the best result.

This is in fact the reason Kolmogorov complexity is used rather than Shannon entropy. Shannon entropy doesn't tell you how to define an optimal alphabet. That part is actually undefinable. It just tells you what to do if you have that already. Kolmogorov complexity says more completely 'find the optimal alphabet and the symbol probabilities and make a minimal sequence from that'.

Different human languages don't figure into this at all and are completely irrelevant.

0 comments

6 comments · 1 top-level

rhelz2y ago· 5 in thread

> Different human languages don't figure into this at all and are completely irrelevant.

Back to basics: A Turing machine is specified by a set of symbols it can read/write to a tape, and a state machine matching current state and read symbol to next state and actions.

If that set of symbols is just {1,0}, then it absolutely, positively, cannot print out the string "ABC".

> the best compression methods out all have a variable length (measured in bits) alphabet.

This is a category error....if the compression algorithm reads and writes binary data, its alphabet is "zero" and "one." The symbols read and written by a Turing machine are atomic--they are not composed of any simpler parts.

Sure, external to the turning machine, you can adopt some conventions which map bit patterns to letters in a different alphabet. But a binary Turing machine cannot print out those letters--it can only print out a string of 1's and 0's.

The mapping from strings in the binary language to letters in the target language cannot participate in the calculations for the complexity of a string in another language. Because if it did, again, you could make the Kolmogorov complexity of any arbitrary string S you choose to be 0, because you could just say the null output maps to S.

This is a subtle problem, often glossed over or just missed entirely. We are so used to encoding things in binary that it might not occur to us unless we think about it deeply.

Nevertheless, it is a real, genuine problem.

AnotherGoodNameOP2y ago

Just because a turing machine prints out 0 and 1 at each step doesn't mean the sequences to factor into the calculation of what to print out next can't be longer binary sequences.

Pretty much all the best compression methods are language agnostic and work on bit wise sequences. They also pretty much all predict the next bit and feed that into an alogithmic encoder.

Eg. look up dynamic markov coding which is commonly used by Hutter prize winners. The paper is short and readable. They dynamically create a binary tree and binary sequences are seen so if the pattern '01101000 01100101' comes in it walks down the binary tree. It'll probably predict the next bit as '0' as '0110100001100101' just so happened to be a common sequence in English that will likely have a next bit of '0' but the Dynamic Markov coding model has no idea of that. It just has binary sequences of bits and a prediction of the next bit given that.

Likewise it can continue reading the bits of the file in and walking down its binary tree where history of the next bit are stored in every node and it see's '111001001011110110100000'. It makes the prediction of the next bit as a likely '1' that it feeds into an arithmetic coder that takes predictions and forms an optimally minimal bitsequence from that. That second binary sequence forms part of 你好.

In both cases the turing machine doesn't care about that. It's also just writing 1's and 0's as per a turing machine. Eventually those 1's and 0's form sequences that happen to map to characters in various languages but it doesn't care about that.

>The mapping from strings in the binary language to letters in the target language cannot participate in the calculations for the complexity of a string in another language. Because if it did, again, you could make the Kolmogorov complexity of any arbitrary string S you choose to be 0, because you could just say the null output maps to S.

One other thing to address here is that Kolmogorov complexity explicitly includes any dictionary you use in it's calculation. A dictionary of the file you wish to compress would just blow out your Kolmogorov complexity to that size exactly. That's why Kolmogorov complexity is an excellent tool. You explicitly cannot cheat in this way.

rhelz2y ago

> the turing machine doesn't care about that. It's also just writing 1's and 0's as per a turing machine.

But a Turing machine does not have to be restricted to just printing out zeros and ones. It can be any finite set of symbols. For example, he Soviets built a computer which used base 3--its symbol set was {-1, 0, 1}. It didn't have bits which could just store "1" or "0", it had trits which could store "-1", "0", or "1".

And why the soviets built such a computer is germane to Kolmogorov complexity just because you can make shorter strings in base 3 than you can in base 2. The choice of symbol set absolutely impacts the length of strings and programs, and therefore impacts the Kolmogorov complexity of strings relative to the computer.

With this in mind, please consider three Turing Machines: A, B, and C

The symbols A prints out on its tape are {"A", "B", "C"}

The symbols B prints out on its tape are {"0", "1"}

The symbols C prints out on its tape are {"0", "1", "A", "B", "C"}

Now consider two strings: "1000001" and "A". Turing machine B could print out "1000001". Turing machine A could print out "A".

You might be tempted to equate "1000001" and "A". But consider the same two strings printed out by machine C:

"1000001" and "A"

Clearly, these are different strings. If C printed out "A", it did not print out "1000001", and vice versa.

> Kolmogorov complexity explicitly includes any dictionary you use in it's calculation.

Sure, but on a binary turing machine, like Machine B above, that dictionary is not going to be matching between binary strings and, say Roman letters. Its going to be mapping from one binary string to another binary string.

Using Machine C, you certainly could write a program which inputed "1000001" and output "A". But you absolutely, positively, cannot write such a program in either machines A or B. Machine A cannot print the string "1000001". And B cannot print the string "A".

Different strings, different things.

2 more replies

tromp2y ago

> cannot print out the string "ABC"

When you write a program in any modern programming language to print "ABC", that program merely outputs the 3 bytes 01000001 01000010 01000011, ASCII codes which your console window (not the program you wrote) decides to display graphically as those characters.

So any machine outputting bits can be said to print out "ABC" just as well.

Furthermore, your comment above was transmitted by your browser to Hacker News not as English characters but as those very bits, and everyone reading your comment was receiving those bits. The way that the browser decides to display those bits does not change their character.

rhelz2y ago

> When you write a program in any modern programming language

Think about it this way. Say you have 1 bit of memory. Can you store any of the three numbers -1, 0, 1 with that a bit?? No. The only thing a bit can store is 0 or 1.

There have been trinary computers built. They don't use bits, they use trits. A Trit can store -1,0, or 1.

A bit cannot. Saying that a turing machine, whose symbol set is {0,1} can print "A" on its tape, is like saying you can store "-1" >>>in a single bit<<<< on a binary computer.

1 more reply

DemocracyFTW22y ago

> The way that the browser decides to display those bits does not change their character

... pun intended—?

j / k navigate · click thread line to collapse

0 comments

6 comments · 1 top-level

rhelz2y ago· 5 in thread

> Different human languages don't figure into this at all and are completely irrelevant.

Back to basics: A Turing machine is specified by a set of symbols it can read/write to a tape, and a state machine matching current state and read symbol to next state and actions.

If that set of symbols is just {1,0}, then it absolutely, positively, cannot print out the string "ABC".

> the best compression methods out all have a variable length (measured in bits) alphabet.

This is a subtle problem, often glossed over or just missed entirely. We are so used to encoding things in binary that it might not occur to us unless we think about it deeply.

Nevertheless, it is a real, genuine problem.

AnotherGoodNameOP2y ago

Just because a turing machine prints out 0 and 1 at each step doesn't mean the sequences to factor into the calculation of what to print out next can't be longer binary sequences.

Pretty much all the best compression methods are language agnostic and work on bit wise sequences. They also pretty much all predict the next bit and feed that into an alogithmic encoder.

rhelz2y ago

> the turing machine doesn't care about that. It's also just writing 1's and 0's as per a turing machine.

With this in mind, please consider three Turing Machines: A, B, and C

The symbols A prints out on its tape are {"A", "B", "C"}

The symbols B prints out on its tape are {"0", "1"}

The symbols C prints out on its tape are {"0", "1", "A", "B", "C"}

Now consider two strings: "1000001" and "A". Turing machine B could print out "1000001". Turing machine A could print out "A".

You might be tempted to equate "1000001" and "A". But consider the same two strings printed out by machine C:

"1000001" and "A"

Clearly, these are different strings. If C printed out "A", it did not print out "1000001", and vice versa.

> Kolmogorov complexity explicitly includes any dictionary you use in it's calculation.

Different strings, different things.

2 more replies

tromp2y ago

> cannot print out the string "ABC"

So any machine outputting bits can be said to print out "ABC" just as well.

rhelz2y ago

> When you write a program in any modern programming language

Think about it this way. Say you have 1 bit of memory. Can you store any of the three numbers -1, 0, 1 with that a bit?? No. The only thing a bit can store is 0 or 1.

There have been trinary computers built. They don't use bits, they use trits. A Trit can store -1,0, or 1.

A bit cannot. Saying that a turing machine, whose symbol set is {0,1} can print "A" on its tape, is like saying you can store "-1" >>>in a single bit<<<< on a binary computer.

1 more reply

DemocracyFTW22y ago

> The way that the browser decides to display those bits does not change their character

... pun intended—?

j / k navigate · click thread line to collapse