undefined | Better HN

0 pointspeterwwillis10y ago0 comments

It's not random letters. It's random numbers encoded in base52. Because the number strings are encoded, they're probably not random at all.

0 comments

5 comments · 2 top-level

dpark10y ago· 3 in thread

You have no way of knowing that. All you see are some random-looking strings. These could be numbers (random or not) that were encoded as base52. These could also be numbers encoded as base64, or base88, or any other base52+. Or they could be randomly-generated strings and there is no meaningful number underlying them.

daveloyall10y ago

    hobbes@namagiri:~/scratch$ echo gYxseNjwafVPfgsoHnzLblmmAxZUiOnGcchqEAEwjyxwjUIfpXfJQcdLapTmFaqHGCFsdvpLarmPJLOZYMEILGNIPwNOgEazuBVJcyVjBRL|ent
    Entropy = 5.352821 bits per byte.
    
    Optimum compression would reduce the size
    of this 108 byte file by 33 percent.
    
    Chi square distribution for 108 samples is 650.52, and randomly
    would exceed this value less than 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 92.6019 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is 0.142915 (totally uncorrelated = 0.0).

...Which is meaningless without context. Let's see if we can create some context...

    hobbes@namagiri:~/scratch$ echo "You have no way of knowing that. All you see are some random-looking strings. These could be numbers (random or not) that were encoded as base52. These could also be numbers encoded as base64, or base88, or any other base52+. Or they could be randomly-generated strings and there is no meaningful number underlying them." |base64 |sed -e :a -e '$!N; s/\n//; ta'
    WW91IGhhdmUgbm8gd2F5IG9mIGtub3dpbmcgdGhhdC4gQWxsIHlvdSBzZWUgYXJlIHNvbWUgcmFuZG9tLWxvb2tpbmcgc3RyaW5ncy4gVGhlc2UgY291bGQgYmUgbnVtYmVycyAocmFuZG9tIG9yIG5vdCkgdGhhdCB3ZXJlIGVuY29kZWQgYXMgYmFzZTUyLiBUaGVzZSBjb3VsZCBhbHNvIGJlIG51bWJlcnMgZW5jb2RlZCBhcyBiYXNlNjQsIG9yIGJhc2U4OCwgb3IgYW55IG90aGVyIGJhc2U1MisuIE9yIHRoZXkgY291bGQgYmUgcmFuZG9tbHktZ2VuZXJhdGVkIHN0cmluZ3MgYW5kIHRoZXJlIGlzIG5vIG1lYW5pbmdmdWwgbnVtYmVyIHVuZGVybHlpbmcgdGhlbS4K

ok, that's base64. Let's turn that into base52...

    CL-USER> (to-base 52 (from-base 64 "WW91IGhhdmUgbm8gd2F5IG9mIGtub3dpbmcgdGhhdC4gQWxsIHlvdSBzZWUgYXJlIHNvbWUgcmFuZG9tLWxvb2tpbmcgc3RyaW5ncy4gVGhlc2UgY291bGQgYmUgbnVtYmVycyAocmFuZG9tIG9yIG5vdCkgdGhhdCB3ZXJlIGVuY29kZWQgYXMgYmFzZTUyLiBUaGVzZSBjb3VsZCBhbHNvIGJlIG51bWJlcnMgZW5jb2RlZCBhcyBiYXNlNjQsIG9yIGJhc2U4OCwgb3IgYW55IG90aGVyIGJhc2U1MisuIE9yIHRoZXkgY291bGQgYmUgcmFuZG9tbHktZ2VuZXJhdGVkIHN0cmluZ3MgYW5kIHRoZXJlIGlzIG5vIG1lYW5pbmdmdWwgbnVtYmVyIHVuZGVybHlpbmcgdGhlbS4K"))
    "6lH81qbtcGjfFOclhdtt7ieuljADG17Ou6swolu2A2qlnO52zmMkG8Nfk2APbEbso4idFF44wsLIGfrOg5atP0ucg5gubxAcD9ztMcboeC4sAui28skbwtEiuv64OuD8fbFnn1M01oeq3bH7n9bvu8p3P1MwirdDHxKDONktDvtNOLE1srOz2I4wNLsBpgOGIlLs1i11xt58JpOC1whJ54Krmln1ahmrvksODe7kqjtBazKzKamu5tygI6hGHq3h123Ighyw8s2MxE4dl9rBdBNG1o7tJM2HvzOh955NpLB33nuPr4OwfhjB618y1BP9y2euMquIMszOuH1rPAEkBccOu9qIJBqKgGf1qH42bBd7GOMsbKExosd8CErJlMIAcEyyytFzoKOfkJy1ExEnKc0iE7OGkpLJM3ybB2JNlnwtrC5F0cH8kB41IIv0BwNk9DO"

Astute readers will note that this output contains numeric digits. There's more than one representation of "base-52". The one above goes from 0 through p. An alternative would go from A-z. But none of that matters to measure the entropy of base-52 encoded English text, ya dig?

    hobbes@namagiri:~/scratch$ echo "6lH81qbtcGjfFOclhdtt7ieuljADG17Ou6swolu2A2qlnO52zmMkG8Nfk2APbEbso4idFF44wsLIGfrOg5atP0ucg5gubxAcD9ztMcboeC4sAui28skbwtEiuv64OuD8fbFnn1M01oeq3bH7n9bvu8p3P1MwirdDHxKDONktDvtNOLE1srOz2I4wNLsBpgOGIlLs1i11xt58JpOC1whJ54Krmln1ahmrvksODe7kqjtBazKzKamu5tygI6hGHq3h123Ighyw8s2MxE4dl9rBdBNG1o7tJM2HvzOh955NpLB33nuPr4OwfhjB618y1BP9y2euMquIMszOuH1rPAEkBccOu9qIJBqKgGf1qH42bBd7GOMsbKExosd8CErJlMIAcEyyytFzoKOfkJy1ExEnKc0iE7OGkpLJM3ybB2JNlnwtrC5F0cH8kB41IIv0BwNk9DO" | ent
    Entropy = 5.610518 bits per byte.
    
    Optimum compression would reduce the size
    of this 452 byte file by 29 percent.
    
    Chi square distribution for 452 samples is 2093.27, and randomly
    would exceed this value less than 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 86.3496 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is 0.018253 (totally uncorrelated = 0.0).

Conclusion: I don't think the original text was pseudorandom. If you can convince me that the original isn't "base-52 encoded" at all (and I'd be easily convinced), I'm willing to reevaluate that conclusion. I'm also interested to see if anyone sees a flaw in my process.

dpark10y ago

Well, we know that it was a test update that was released unintentionally. http://www.zdnet.com/article/microsoft-accidentally-issued-a...

So the question is why a test update would have "hidden" meaning underneath the random-looking strings.

As for the flaw in your methodology, Your ent command (not familiar with this, so just basing this off what I see) is assuming full use of the binary space (hence assuming 127.5 as mean of random data). No base52 data will use the full 256-value space, by definition. Base52 (azAZ) will actually have a mean of 93.5 for random data, extremely close to the measured mean. Serial correlation coefficient should also be higher for azAZ than 09azAP, because more of the alphabet is contiguous.

I did a little bit of analysis on the data as well to determine if the data was random or gibberish typed by a human on a keyboard, and found that most of the data lined up well for true (or pseudo) random. (Ugly code here: http://pastebin.com/9YN93xhi)

  Home row % expected: 34.6% 
  Home row % actual: 38.3%
  Expected upper: 50%
  Actual upper: 47.7%
  Expected sequential case match: 50%
  Actual sequential case match: 55.7%

1 more reply

kaoD10y ago

    $ dd if=/dev/urandom bs=512 count=1 2>/dev/null | base64 -w 0 | ent
    Entropy = 5.933850 bits per byte.
    
    Optimum compression would reduce the size
    of this 684 byte file by 25 percent.

    Chi square distribution for 684 samples is 2310.15, and randomly
    would exceed this value 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 85.0731 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is -0.018269 (totally uncorrelated = 0.0).

I don't have the means to convert base64 to 52 but the result shouldn't be much different.

1 more reply

jrochkind110y ago

I don't think anyone actually knows that. Which is exactly why it was inaccurate to describe it as "Base52", it made you think someone did know that.

j / k navigate · click thread line to collapse

0 comments

5 comments · 2 top-level

dpark10y ago· 3 in thread

daveloyall10y ago

    hobbes@namagiri:~/scratch$ echo gYxseNjwafVPfgsoHnzLblmmAxZUiOnGcchqEAEwjyxwjUIfpXfJQcdLapTmFaqHGCFsdvpLarmPJLOZYMEILGNIPwNOgEazuBVJcyVjBRL|ent
    Entropy = 5.352821 bits per byte.
    
    Optimum compression would reduce the size
    of this 108 byte file by 33 percent.
    
    Chi square distribution for 108 samples is 650.52, and randomly
    would exceed this value less than 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 92.6019 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is 0.142915 (totally uncorrelated = 0.0).

...Which is meaningless without context. Let's see if we can create some context...

    hobbes@namagiri:~/scratch$ echo "You have no way of knowing that. All you see are some random-looking strings. These could be numbers (random or not) that were encoded as base52. These could also be numbers encoded as base64, or base88, or any other base52+. Or they could be randomly-generated strings and there is no meaningful number underlying them." |base64 |sed -e :a -e '$!N; s/\n//; ta'
    WW91IGhhdmUgbm8gd2F5IG9mIGtub3dpbmcgdGhhdC4gQWxsIHlvdSBzZWUgYXJlIHNvbWUgcmFuZG9tLWxvb2tpbmcgc3RyaW5ncy4gVGhlc2UgY291bGQgYmUgbnVtYmVycyAocmFuZG9tIG9yIG5vdCkgdGhhdCB3ZXJlIGVuY29kZWQgYXMgYmFzZTUyLiBUaGVzZSBjb3VsZCBhbHNvIGJlIG51bWJlcnMgZW5jb2RlZCBhcyBiYXNlNjQsIG9yIGJhc2U4OCwgb3IgYW55IG90aGVyIGJhc2U1MisuIE9yIHRoZXkgY291bGQgYmUgcmFuZG9tbHktZ2VuZXJhdGVkIHN0cmluZ3MgYW5kIHRoZXJlIGlzIG5vIG1lYW5pbmdmdWwgbnVtYmVyIHVuZGVybHlpbmcgdGhlbS4K

ok, that's base64. Let's turn that into base52...

    CL-USER> (to-base 52 (from-base 64 "WW91IGhhdmUgbm8gd2F5IG9mIGtub3dpbmcgdGhhdC4gQWxsIHlvdSBzZWUgYXJlIHNvbWUgcmFuZG9tLWxvb2tpbmcgc3RyaW5ncy4gVGhlc2UgY291bGQgYmUgbnVtYmVycyAocmFuZG9tIG9yIG5vdCkgdGhhdCB3ZXJlIGVuY29kZWQgYXMgYmFzZTUyLiBUaGVzZSBjb3VsZCBhbHNvIGJlIG51bWJlcnMgZW5jb2RlZCBhcyBiYXNlNjQsIG9yIGJhc2U4OCwgb3IgYW55IG90aGVyIGJhc2U1MisuIE9yIHRoZXkgY291bGQgYmUgcmFuZG9tbHktZ2VuZXJhdGVkIHN0cmluZ3MgYW5kIHRoZXJlIGlzIG5vIG1lYW5pbmdmdWwgbnVtYmVyIHVuZGVybHlpbmcgdGhlbS4K"))
    "6lH81qbtcGjfFOclhdtt7ieuljADG17Ou6swolu2A2qlnO52zmMkG8Nfk2APbEbso4idFF44wsLIGfrOg5atP0ucg5gubxAcD9ztMcboeC4sAui28skbwtEiuv64OuD8fbFnn1M01oeq3bH7n9bvu8p3P1MwirdDHxKDONktDvtNOLE1srOz2I4wNLsBpgOGIlLs1i11xt58JpOC1whJ54Krmln1ahmrvksODe7kqjtBazKzKamu5tygI6hGHq3h123Ighyw8s2MxE4dl9rBdBNG1o7tJM2HvzOh955NpLB33nuPr4OwfhjB618y1BP9y2euMquIMszOuH1rPAEkBccOu9qIJBqKgGf1qH42bBd7GOMsbKExosd8CErJlMIAcEyyytFzoKOfkJy1ExEnKc0iE7OGkpLJM3ybB2JNlnwtrC5F0cH8kB41IIv0BwNk9DO"

    hobbes@namagiri:~/scratch$ echo "6lH81qbtcGjfFOclhdtt7ieuljADG17Ou6swolu2A2qlnO52zmMkG8Nfk2APbEbso4idFF44wsLIGfrOg5atP0ucg5gubxAcD9ztMcboeC4sAui28skbwtEiuv64OuD8fbFnn1M01oeq3bH7n9bvu8p3P1MwirdDHxKDONktDvtNOLE1srOz2I4wNLsBpgOGIlLs1i11xt58JpOC1whJ54Krmln1ahmrvksODe7kqjtBazKzKamu5tygI6hGHq3h123Ighyw8s2MxE4dl9rBdBNG1o7tJM2HvzOh955NpLB33nuPr4OwfhjB618y1BP9y2euMquIMszOuH1rPAEkBccOu9qIJBqKgGf1qH42bBd7GOMsbKExosd8CErJlMIAcEyyytFzoKOfkJy1ExEnKc0iE7OGkpLJM3ybB2JNlnwtrC5F0cH8kB41IIv0BwNk9DO" | ent
    Entropy = 5.610518 bits per byte.
    
    Optimum compression would reduce the size
    of this 452 byte file by 29 percent.
    
    Chi square distribution for 452 samples is 2093.27, and randomly
    would exceed this value less than 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 86.3496 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is 0.018253 (totally uncorrelated = 0.0).

dpark10y ago

Well, we know that it was a test update that was released unintentionally. http://www.zdnet.com/article/microsoft-accidentally-issued-a...

So the question is why a test update would have "hidden" meaning underneath the random-looking strings.

  Home row % expected: 34.6% 
  Home row % actual: 38.3%
  Expected upper: 50%
  Actual upper: 47.7%
  Expected sequential case match: 50%
  Actual sequential case match: 55.7%

1 more reply

kaoD10y ago

    $ dd if=/dev/urandom bs=512 count=1 2>/dev/null | base64 -w 0 | ent
    Entropy = 5.933850 bits per byte.
    
    Optimum compression would reduce the size
    of this 684 byte file by 25 percent.

    Chi square distribution for 684 samples is 2310.15, and randomly
    would exceed this value 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 85.0731 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is -0.018269 (totally uncorrelated = 0.0).

I don't have the means to convert base64 to 52 but the result shouldn't be much different.

1 more reply

jrochkind110y ago

I don't think anyone actually knows that. Which is exactly why it was inaccurate to describe it as "Base52", it made you think someone did know that.

j / k navigate · click thread line to collapse