hobbes@namagiri:~/scratch$ echo gYxseNjwafVPfgsoHnzLblmmAxZUiOnGcchqEAEwjyxwjUIfpXfJQcdLapTmFaqHGCFsdvpLarmPJLOZYMEILGNIPwNOgEazuBVJcyVjBRL|ent
Entropy = 5.352821 bits per byte.
Optimum compression would reduce the size
of this 108 byte file by 33 percent.
Chi square distribution for 108 samples is 650.52, and randomly
would exceed this value less than 0.01 percent of the times.
Arithmetic mean value of data bytes is 92.6019 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is 0.142915 (totally uncorrelated = 0.0).
...Which is meaningless without context. Let's see if we can create some context... hobbes@namagiri:~/scratch$ echo "You have no way of knowing that. All you see are some random-looking strings. These could be numbers (random or not) that were encoded as base52. These could also be numbers encoded as base64, or base88, or any other base52+. Or they could be randomly-generated strings and there is no meaningful number underlying them." |base64 |sed -e :a -e '$!N; s/\n//; ta'
WW91IGhhdmUgbm8gd2F5IG9mIGtub3dpbmcgdGhhdC4gQWxsIHlvdSBzZWUgYXJlIHNvbWUgcmFuZG9tLWxvb2tpbmcgc3RyaW5ncy4gVGhlc2UgY291bGQgYmUgbnVtYmVycyAocmFuZG9tIG9yIG5vdCkgdGhhdCB3ZXJlIGVuY29kZWQgYXMgYmFzZTUyLiBUaGVzZSBjb3VsZCBhbHNvIGJlIG51bWJlcnMgZW5jb2RlZCBhcyBiYXNlNjQsIG9yIGJhc2U4OCwgb3IgYW55IG90aGVyIGJhc2U1MisuIE9yIHRoZXkgY291bGQgYmUgcmFuZG9tbHktZ2VuZXJhdGVkIHN0cmluZ3MgYW5kIHRoZXJlIGlzIG5vIG1lYW5pbmdmdWwgbnVtYmVyIHVuZGVybHlpbmcgdGhlbS4K
ok, that's base64. Let's turn that into base52... CL-USER> (to-base 52 (from-base 64 "WW91IGhhdmUgbm8gd2F5IG9mIGtub3dpbmcgdGhhdC4gQWxsIHlvdSBzZWUgYXJlIHNvbWUgcmFuZG9tLWxvb2tpbmcgc3RyaW5ncy4gVGhlc2UgY291bGQgYmUgbnVtYmVycyAocmFuZG9tIG9yIG5vdCkgdGhhdCB3ZXJlIGVuY29kZWQgYXMgYmFzZTUyLiBUaGVzZSBjb3VsZCBhbHNvIGJlIG51bWJlcnMgZW5jb2RlZCBhcyBiYXNlNjQsIG9yIGJhc2U4OCwgb3IgYW55IG90aGVyIGJhc2U1MisuIE9yIHRoZXkgY291bGQgYmUgcmFuZG9tbHktZ2VuZXJhdGVkIHN0cmluZ3MgYW5kIHRoZXJlIGlzIG5vIG1lYW5pbmdmdWwgbnVtYmVyIHVuZGVybHlpbmcgdGhlbS4K"))
"6lH81qbtcGjfFOclhdtt7ieuljADG17Ou6swolu2A2qlnO52zmMkG8Nfk2APbEbso4idFF44wsLIGfrOg5atP0ucg5gubxAcD9ztMcboeC4sAui28skbwtEiuv64OuD8fbFnn1M01oeq3bH7n9bvu8p3P1MwirdDHxKDONktDvtNOLE1srOz2I4wNLsBpgOGIlLs1i11xt58JpOC1whJ54Krmln1ahmrvksODe7kqjtBazKzKamu5tygI6hGHq3h123Ighyw8s2MxE4dl9rBdBNG1o7tJM2HvzOh955NpLB33nuPr4OwfhjB618y1BP9y2euMquIMszOuH1rPAEkBccOu9qIJBqKgGf1qH42bBd7GOMsbKExosd8CErJlMIAcEyyytFzoKOfkJy1ExEnKc0iE7OGkpLJM3ybB2JNlnwtrC5F0cH8kB41IIv0BwNk9DO"
Astute readers will note that this output contains numeric digits. There's more than one representation of "base-52". The one above goes from 0 through p. An alternative would go from A-z. But none of that matters to measure the entropy of base-52 encoded English text, ya dig? hobbes@namagiri:~/scratch$ echo "6lH81qbtcGjfFOclhdtt7ieuljADG17Ou6swolu2A2qlnO52zmMkG8Nfk2APbEbso4idFF44wsLIGfrOg5atP0ucg5gubxAcD9ztMcboeC4sAui28skbwtEiuv64OuD8fbFnn1M01oeq3bH7n9bvu8p3P1MwirdDHxKDONktDvtNOLE1srOz2I4wNLsBpgOGIlLs1i11xt58JpOC1whJ54Krmln1ahmrvksODe7kqjtBazKzKamu5tygI6hGHq3h123Ighyw8s2MxE4dl9rBdBNG1o7tJM2HvzOh955NpLB33nuPr4OwfhjB618y1BP9y2euMquIMszOuH1rPAEkBccOu9qIJBqKgGf1qH42bBd7GOMsbKExosd8CErJlMIAcEyyytFzoKOfkJy1ExEnKc0iE7OGkpLJM3ybB2JNlnwtrC5F0cH8kB41IIv0BwNk9DO" | ent
Entropy = 5.610518 bits per byte.
Optimum compression would reduce the size
of this 452 byte file by 29 percent.
Chi square distribution for 452 samples is 2093.27, and randomly
would exceed this value less than 0.01 percent of the times.
Arithmetic mean value of data bytes is 86.3496 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is 0.018253 (totally uncorrelated = 0.0).
Conclusion: I don't think the original text was pseudorandom. If you can convince me that the original isn't "base-52 encoded" at all (and I'd be easily convinced), I'm willing to reevaluate that conclusion. I'm also interested to see if anyone sees a flaw in my process.So the question is why a test update would have "hidden" meaning underneath the random-looking strings.
As for the flaw in your methodology, Your ent command (not familiar with this, so just basing this off what I see) is assuming full use of the binary space (hence assuming 127.5 as mean of random data). No base52 data will use the full 256-value space, by definition. Base52 (azAZ) will actually have a mean of 93.5 for random data, extremely close to the measured mean. Serial correlation coefficient should also be higher for azAZ than 09azAP, because more of the alphabet is contiguous.
I did a little bit of analysis on the data as well to determine if the data was random or gibberish typed by a human on a keyboard, and found that most of the data lined up well for true (or pseudo) random. (Ugly code here: http://pastebin.com/9YN93xhi)
Home row % expected: 34.6%
Home row % actual: 38.3%
Expected upper: 50%
Actual upper: 47.7%
Expected sequential case match: 50%
Actual sequential case match: 55.7% $ dd if=/dev/urandom bs=512 count=1 2>/dev/null | base64 -w 0 | ent
Entropy = 5.933850 bits per byte.
Optimum compression would reduce the size
of this 684 byte file by 25 percent.
Chi square distribution for 684 samples is 2310.15, and randomly
would exceed this value 0.01 percent of the times.
Arithmetic mean value of data bytes is 85.0731 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is -0.018269 (totally uncorrelated = 0.0).
I don't have the means to convert base64 to 52 but the result shouldn't be much different.