Well, if you can do better, please enter the contest! Glittering prizes are on the line here. I didn't completely follow your reasoning -- first, the code points allocated to surrogate pairs are not Unicode scalar values, and so can't be expressed in well-formed UTF-8. But also, remember that the goal is to encode the most bits per tweet, not to come up with the most verbose encoding. :-) How many bits of arbitrary source information would your scheme be able to carry in a tweet?