> in 5000 years time we'll need both the Unicode specification and those 2-4000 bytes to decipher the author's post.
Nitpick: In all likelihood, an English dictionary would be enough. Even if the Unicode spec is lost, the text can probably be deciphered by using frequency analysis plus the dictionary to associate codepoints with characters.
Sure, but the English dictionary is a bit longer than the Unicode spec :)
My point was that there is a lot of implied data that undermines this "compression" of information into "simply bytes" today: I am pointing out the implicit in OP's assumption.
Maybe I did not choose the right addendum, but a pretty comprehensive addendum is needed (in all likelihood, an "ancient English dictionary" is needed anyways).