International Longest Tweet Contest (opens in new tab)

(blog.ksplice.com)

30 points0xdeadc0de16y ago10 comments

10 comments

9 comments · 3 top-level

sp33216y ago· 4 in thread

I feel like I should point out that Unicode is not being used correctly in the article. A better understanding of Unicode encodings might help in the competition. http://www.joelonsoftware.com/articles/Unicode.html

ximeng16y ago

At first I thought you were right, but reading the article more closely I think it's OK. It's talking about the relationship between the ISO/IEC 10646 and Unicode. There's more information available about that in the Unicode standard here:

http://www.unicode.org/versions/Unicode5.0.0/appC.pdf

See section C.3 for the differences between the UTF-8 encodings. The key paragraph is:

"The definition of UTF-8 in Annex D of ISO/IEC 10646:2003 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as an encoding form of Unicode characters. ISO/IEC 10646 does not allow mapping of surrogate code positions, known as RC-elements in that standard; that restriction is identical to the restriction for the Unicode definition of UTF-8."

That's where the extra characters come from, as the UTF-8 encoding used by twitter is apparently not checking to see if the characters are valid Unicode characters, as required by Unicode UTF-8. This is the extra restriction referred to in the passage above that is imposed by the Unicode version of UTF-8 but not the ISO version. As quoted in the article, http://en.wikipedia.org/wiki/UTF-8#Description says that the ISO version of UTF-8 can encode 31 bits. I couldn't find a source for the encoding of ISO UTF-8.

The other part I wasn't sure about at first was where the 1,112,064 possible characters figure came from. It turns out that's the 17 Unicode planes of 65,536 characters each, less the range from 0xD000 to 0xDFFF reserved for surrogate pairs.

In other words:

1+0x10ffff-(0xdfff-0xd800+1) = 1 112 064

sp33216y ago

Thanks, that helps!

keithwinstein16y ago

Heya, let me know your beef with the article's description of Unicode -- I think it's all correct but would be happy to fix any problems. Note that the way we got to 4.2 kilobits per tweet was by using only UCS code positions that are NOT part of Unicode. In other words, the scheme works by not using Unicode at all!

sp33216y ago

That's pretty cool, but using valid UTF-8 should get you up to 6 bytes per character, since each half of a UTF-16 surrogate pair takes 3 bytes in UTF-8.

1 more reply

mnemonicsloth16y ago· 2 in thread

Fun Facts: Ben and Alyssa make appearances in Structure and Interpretation of Computer Programs, and in Sussman's supercool symbolic programming class: http://groups.csail.mit.edu/mac/users/gjs/6.945/

"Alyssa P Hacker" is a pun on "A Lisp Hacker". She has a friend named Imelda Macros.

d0m16y ago

Alyssa P. Hacker NEVER answer a question.. she only ask trcky ones.

chanux16y ago

And don't forget Eva Lu Ator :D

chaosmachine16y ago

I wonder if this was inspired by my Tweet Compressor project:

http://tweetcompressor.com/

It got a lot of attention on Reddit last week (20k visitors).

j / k navigate · click thread line to collapse

10 comments

9 comments · 3 top-level

sp33216y ago· 4 in thread

ximeng16y ago

http://www.unicode.org/versions/Unicode5.0.0/appC.pdf

See section C.3 for the differences between the UTF-8 encodings. The key paragraph is:

In other words:

1+0x10ffff-(0xdfff-0xd800+1) = 1 112 064

sp33216y ago

Thanks, that helps!

keithwinstein16y ago

sp33216y ago

That's pretty cool, but using valid UTF-8 should get you up to 6 bytes per character, since each half of a UTF-16 surrogate pair takes 3 bytes in UTF-8.

1 more reply

mnemonicsloth16y ago· 2 in thread

"Alyssa P Hacker" is a pun on "A Lisp Hacker". She has a friend named Imelda Macros.

d0m16y ago

Alyssa P. Hacker NEVER answer a question.. she only ask trcky ones.

chanux16y ago

And don't forget Eva Lu Ator :D

chaosmachine16y ago

I wonder if this was inspired by my Tweet Compressor project:

http://tweetcompressor.com/

It got a lot of attention on Reddit last week (20k visitors).

j / k navigate · click thread line to collapse