UTF-16 is not a very good encoding. It only exists for legacy reasons. It has the same major drawback as UTF-8 (variable-length encoding) but none of the benefits (ASCII compatibility, size efficient).
4-byte characters in UTF-8 are just as rare as surrogate pairs are just as rare in UTF-16, because they both are used to represent non-BMP characters. As a result, there is software that handles 3-byte characters (i.e., a huge percentage of what you'll ever see), but doesn't handle 4-byte characters.
MySQL is a high-profile example of software which, until recently, had this problem: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8m....
So whereas it's rather common for programs to mis-handle multiple-unit UTF-16 characters, it seems much less likely that programs will mis-handle 4+ unit UTF-8 characters.
Honest question, as the three byte limit seems rather arbitrary and no more logical than, say, a four byte one.
All of the Unicode 6.0 emoji are.
In python:
len(u'汉字') == 2
len( '汉字') == 4 # or maybe 6, it varies based on console encoding and CPython options
len(u'汉字'.encode('utf8')) == 6If it was simple as making everything Unicode and it Just Working, it would be possible. But the number of difficulties and problems I've seen have made me decide -- and tell everyone I know -- to avoid dealing with internationalization if you value your sanity.
Issues discussed here:
* Different incompatible variable-length encodings
* Broken implementations
* Character count != length
Issues discussed elsewhere:
* It's the major showstopper keeping people away from Python 3
* Right-to-left vs. left-to-right [1]
* BOM at the beginning of the stream
Conceptual issues -- questions I honestly don't know the answer to when it comes to internationalization. I don't even know where to look to find answers to these:
* If I split() a string, does each piece get its own BOM?
* If I copy-paste text characters from A into B, what encoding does B save as? If B isn't a text editor, what happens?
* If a chr(0x20) is part of a multi-byte escape sequence, does it count as a space when I use .split()?
* When it encounters right-to-left, does the renderer have to scan the entire string to figure out how far to the right it goes? Wouldn't this mean someone could create a malicious length-n string that took O(n^2) time to process?
* What happens if I try to print() a very long line -- more than a screenful -- with a right-to-left escape in a terminal?
* If I have a custom stream object, and I write characters to it, how does it "know" when to write the BOM?
* Do operators like [] operate on characters, bytes, 16-bit words, or something else?
* Does getting the length of a string really require a custom for loop with a complicated bit-twiddling expression?
* Is it possible for a zero byte to be part of a multibyte sequence representing a character? How does this work with C API's that expect zero-terminated strings?
* If I split() a string to extract words, how do the substrings know the BOM, right-to-left, and other states that apply to themselves? What if those strings are concatenated with other strings that have different values for those states?
* What exactly does "generating locales" do on Debian/Ubuntu and why aren't those files shipped with all the other binary parts of the distribution? All I know about locale generation is that it's some magic incantation you need to speak to keep apt-get from screaming bloody murder every time you run it on a newly debootstrapped chroot.
* Is there a MIME type for each historical, current, and future encoding? How do Web things know which encoding a document's in?
* How do other tools know what encoding a document uses? Is this something the user has to manually tell the tool -- should I be saying nano thing.txt --encoding=utf8? If the information about the format isn't stored anywhere, do you just guess until you get something that seems not to cause problems?
* If you're using UTF-16, what endianness is used? Is it the same as the machine endianness, or fixed? What operations cause endian conversion?
* Should my C programs handle the possibility that sizeof(char) != 1? Or at least check for this case and spit out a warning or error?
* What automated tools exist to remove BOM's or change accented characters into regular ones, if other automated tools don't accept Unicode? Once upon a time, I could not get javac to recognize source files I'd downloaded which had the author's name, which included an 'o' with two dots over it, in comments. That was the only non-ASCII character in the files, and I ended up removing them; syncing local patches with upstream would have been a nightmare. Do people in different countries run incompatible versions of programming languages that won't accept source files that are byte-for-byte identical? It sounds ridiculous, but this experience suggests it may be the case.
One last general comment I have is that a lot of your questions relate to things that you shouldn't necessarily need to understand the exact details of to do a lot of things. Instead, you should use an off-the-shelf, battle-tested unicode library. As far as I know, they exist for every platform by now. Of course this doesn't free you from knowing stuff, but it means that instead of knowing exactly the range of the surrogate pairs, all you need is a mental model of what's going on. When you're surprised, you can begin to fill in the gaps.
1. Use UTF-8 if you're using byte strings, or your platforms unicode string type. If the latter, it will have its own internal encoding, and you'll work at the character level. In either case, as soon as a string comes in from elsewhere (network, disk, OS), sanitize it to a known state, i.e., a UTF-8-encoded byte string, or your platform's unicode string type. In case you can't do that, reject the string, log it, and figure out what's going on.
2. Use a non-broken UTF-8 implementation.
3. Yeah. Your UTF-8 implementation is handling this for you now.
4. Not familiar with this/don't use Python enough.
5. Haven't dealt with this, but it's definitely got some complications to it. I would guess more for layout than for programming, but I can't be sure.
6. The BOM tells your decoder to that the forthcoming stream is little endian or big endian. It itself is not actually part of the string. Admittedly, a lot of programs have trouble with BOMS still, which is why you're using UTF-8 (without BOMs, because you don't need it.)
7. No, when you .split, the BOM is no longer part of the string. You don't have a BOM again until you transmit the string or write it to disk, as it's not needed (your implementation uses whatever byte ordering it likes internally, or the one you specified if using a byte string.)
8. The string is probably transmitted in whatever the internal encoding of your OS is. That means UTF-16 on Windows and UTF-8 on Linux, AFAIK. If you're writing a desktop app, your paste event should treat this string as a pile of unsafe garbage until you've put it into a known state (i.e., a unicode object or byte string in a known encoding.) When you save it, it saves in whatever encoding you specify. You should specify UTF-8.
9. I'm not sure exactly what we're talking about here. The byte 0x20 is a valid UTF-8 character (the space), or part of a 2-byte or 4-byte character in UTF-16. However, as long as you're working with a unicode string type, your .split function operators on the logical character level, not the byte level. If you're using a byte string (e.g., python's string type), then yes, the byte 0x20 is a space, because your split method assumes ASCII. If you try to append a byte string containing 0x20 to a unicode string, you should get an exception, unless you also specify an encoder which takes your byte string and turns it into a unicode string. Your unicode string implementation may have a default encoder, in which case the byte string would be interpreted as that encoding, and an exception would only be thrown if it's invalid (which means if the default encoding was UTF-16, this would throw an exception, because 0x20 is not a valid UTF-16 character.) This answer is long, and HN's formatting options are lacking, so let me know, and I'll try to be clearer.
10. Again I haven't yet dealt with RTL, but the characters are in the same order internally regardless of whether they're to be displayed RTL or LTR. It's a sequence of bytes or characters, the encoder and decoder do not care what those characters actually are. So if I have "[RTL Char]ABC ", that's the order it will be in memory, even though it will display as " CBA." In UTF-8, this string would be 7 bytes long, in UTF-16, 10. In both cases, the character length of the string is 5.
11. I'm not sure why this would be a problem, provided your terminal can handle unicode, which most can in my experience (there's some fiddling you have to do in windows.) It should wrap or break the line the same as with RTL. I believe the unicode standard includes rules for how to break text, but not positive.
12. I'm not really sure what you mean. Your object will write whatever bytes it writes. If you're using a UTF-16 encoder, usually you can specify the Endianness and whether to write a BOM or not.
13. If you're using a unicode type in a language like python, [] operates on logical characters. If you're using a byte string type (python's string, a PHP or C string, for ex), [] operates on bytes.
14. If you're using a unicode type, split returns unicode objects, which have their own internal encoding. Again, right-to-left characters look exactly like any other character to the encoder and decoder. If you're using byte strings, you need to use a unicode-aware split function, and tell it what encoding the string is in. It will return to you strings in the same encoding (and endianness.)
16. Not familiar with this.
17. MIME types are separate from encoding. I can have an HTML page that is in UTF-8 or UTF-16, both have the MIME type txt/html, same with txt/plain and so on. MIME and encoding operate at different levels. Web things knowing the encoding is actually fairly complicated. The correct thing to do is to send an HTTP header that specifies the encoding, and add a <meta charset="bla"> attribute in the HEAD of the page. If you don't do this, I think browsers implement various heuristics to attempt to detect the encoding. Having a type for every future encoding is an unreasonable demand, because the future is notoriously difficult to predict. If you have a crystal ball, I'm sure the standards committees would love to hear from you.
18. Also somewhat complicated! I think there are tools which can guess for you, using similar heuristics that I mentioned that browsers use. You should specify the encoding using your text editor, as you showed. It is not too much to ask to tell people to set their text editors up correctly. Your projects should have a convention that all files use, just like with code formatting. If someone tries to check in something that's not valid UTF-8, you can have a hook that rejects the commit if you want. Then they can fix it. The format is not stored anywhere, which is why you should have the convention and yell at people who mess up (not literally, be nice.) If you don't know what a file is, you can use aforementioned tools, or try a few encodings and see what works. Yes, it's a hassle, which is why you should set a convention.
19. You can specify LE/BE when you say what encoding something is. As in, you say, hey encode this here text as UTF-16 LE, and it says right-o, here we go!
20. A C char is not aware of the encoding, so that wouldn't have any effect! Some other guy said that a char is always a byte, so answer: no.
21. These are two wildly different things. A BOM is always FFEF or FEFF, depending on endianness, so you can look for those, and chop off the two bytes, I guess. For dealing with accents, look into unicode normal forms, they define a specific way to compose and decompose accents. I'm not sure about your javac woes, there ought to be some way to tell javac what encoding to expect, like you can do with python. It may be the case that javac guesses the encoding based on your locale, and his was different than yours.
> If I split() a string, does each piece get its own BOM?
Conceptually, each piece is a sequence of code points. The BOM stuff only comes into play when you turn it into an external encoding. And frankly, I would much rather use UTF-8, explicitly specify that the encoding is UTF-8, and not have to worry about adding a BOM.
> If a chr(0x20) is part of a multi-byte escape sequence, does it count as a space when I use .split()?
In valid UTF-8, all bytes in multibyte characters will have the high bit set. A space can only be represented as the 0x20 byte, and an 0x20 byte can only be a space. If you've got malformed input, then that's a whole other can of worms.
> Is it possible for a zero byte to be part of a multibyte sequence representing a character? How does this work with C API's that expect zero-terminated strings?
In UTF-8, the answer is no. In other multibyte encodings (e.g. UTF-16), you should not expect to be able to treat it at all like ASCII.
> If you're using UTF-16, what endianness is used? Is it the same as the machine endianness, or fixed? What operations cause endian conversion?
When reading external text, you can detect this from the BOM -- byte order, after all, is why you have a byte order marker. When converting from your internal format to UTF-16, you pick whatever is most convenient.
> Should my C programs handle the possibility that sizeof(char) != 1? Or at least check for this case and spit out a warning or error?
I don't know any popular non-embedded platform on which sizeof(char) != 1. That said, it can't hurt to get it Right.
> What automated tools exist to remove BOM's or change accented characters into regular ones, if other automated tools don't accept Unicode?
In Python, there's a library called "unidecode" which does a pretty good job of punching Unicode text until it turns into ASCII.
sizeof(char) == 1 by definition.
Same here.
The one to blame should be Unicode. It's actually near impractical MULTIcode defined by a committee.
I think this is probably semantics, but I just wanted to point that out in case anyone is confused, which would be understandable because this shit is whack.
UTF-16 solves problems that don't exist.
(Honestly, I would love it if someone could explain what the purpose of counting characters is, because I don't know why you'd ever do that, except when you're posting to Twitter.)
int count_multibytes_encountered(char *text, unsigned long len) {
int count = 0;
for (int i = 0; i < len; i++) {
if ((*(text+i) & 0b10000000) == 0b10000000 && // check if it's a multi-char byte
(~*(text+i) & 0b01000000) == 0b01000000) { // and check that it's not a leading byte
count++;
}
}
return count;
}Text editing and rendering. Some parts of the system cannot simply treat Unicode text as an opaque binary hunk of information.
> Why do you care if it is fast to do so?
Efficient full text search that can ignore decorative combining characters.
Because, there are other countries which use more than English language?
I fucking hate you ascii-centric ignorant morons sometimes, you know, for example
- display welcome message character by character fro left to right
- Extract the first character because it's always the surname
- catch two non-ascii keyword and find its index in a string
In the first example, should I just put byte by byte, which displays as garbage, and suddenly three bytes become a recognizable character?
>>> len(u'épicé')
5
>>> len(u'épicé')
7TL;DR:
- Javascript engines are free to internally represent strings as either UCS-2 or UTF-16. Engines that choose to go USC-2 tend to replace all glyphs outside of the BMP with the replacement char (U+FFFD). Firefox, IE, Opera, and Safari all do this (with some inconsistencies).
- However, from the point of view of the actual JS code that gets executed, strings are always UCS-2 (sort of). In UTF-16, code points outside the BMP are encoded as surrogate pairs (4 bytes). But -- if you have a Javascript string that contains such a character, it will be treated as two consecutive 2-byte characters.
var x = '𝌆';
x.length; // 2
x[0]; // \uD834
x[1]; // \uDF06
Note that if you insert said string into the DOM, it will still render correctly (you'll see a single character instead of two ?s).First you say that engines will "internally" replace non-BMP glyphs with the replacement character, but then you give an example that seems to work fine (and I think would work fine as long as you don't cut that character in half, or try to inspect its character code without doing the proper incantations[1].)
So, I guess what I'm asking is, at what point does the string become "internal", such that the engine will replace the character with the replacement character?
[1]: As given in the article you linked to.
javascript:var x = '𝌆';document.write(x);
(On the other hand, I'm sure you knew that. But probably there are people reading your comment who didn't. :))
UCS-2 is not a valid Unicode encoding any more, because there are several sets of characters encoded outside the BMP. The spec should be updated to require UTF-16 support in all implementations.
If a modern programming language like JavaScript doesn't provide a way to represent characters outside the BMP in its character data type, that needs to be fixed too. Indexing and counting characters in a JavaScript string need to reflect the human and Unicode notion of characters, not the arbitrary 2-byte blocks that UCS-2 happens to use.
The language authors should be ashamed of this situation - having a modern language without proper Unicode support is simply awful.
By doing this full of excuses write-up, this guy wasted a substantial amount of time that he could have spent better researching the issue. Your consumer doesn't care that Emoji is this much or that much bits, it doesn't matter for him that you're running your infrastructure on poorly chosen software - there is absolutely no excuse for not supporting this in a native iOS app, especially now that Emoji is so widely used and deeply integrated in iOS.
How is that a problem they are focusing on, anyway, when their landing page features awful, out of date mockups of the app? (not even actual screenshots - notice the positions of menu bar items) They are also featuring Emoji in every screenshot - ending support might be a fresh development, but I still find that ironic.
JavaScript is a joke in this respect, and is keeping horrors like Shift-JIS alive long after they should have been retired.
My question is why does V8 (or anything else) still use UCS-2?
APIs like that tend to be low priority because they aren't used by browsers (which pass everything through as UTF-16 code-units, typically treating them as possibly-valid UTF-16 strings).
Because the ES spec defines a string as a sequence of UTF-16 code units (aka UCS-2-with-visible-surrogates), because as many others (e.g. Java) the language's strings were created during/inherited from Unicode 1.0 which fit in 16 bits (UTF-16 is a retrofitting of Unicode 1.0 fixed-width to accomodate the full range of later unicode version by adding surrogate pairs)
That was actually really fun to read, even as a now non-technical guy. I can't put a finger on it, but there was something about his style that gave off a really friendly vibe even through all the technical jargon. That's a definite skill!
[edit] Node currently uses V8 version 3.11.10.25, which was released after this fix was made, but not sure if the fix was merged to trunk
[edit2] actually, looks like it has, though I can't identify the merge commit
- The spec says UCS2 or UTF16. Those are the only options.
- UCS2 allows random access to characters, UTF-16 does not.
- Remember how the JS engines were fighting for speed on arbitrary benchmarks, and nobody cared about anything else for 5 years? UCS2 helps string benchmarks be fast!
- Changing from UCS2 to UTF-16 might "break the web", something browser vendors hate (and so do web developers)
- Java was UCS2. Then Java 5 changed to UTF-16. Why didn't JS change to UTF-16? Because a Java VM only has to run one program at once! In JS, you can't specify a version, an encoding, and one engine has to run everything on the web. No migration path to other encodings!
I'm not sure if that's really true. On IBM's site, they define 3 levels of UCS-2, only one of which excludes "combining characters" (really code points).
http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%...
If you have combining characters, then you can't simply take the number of bytes and divide by 2 to get the number of letters. If you don't have combining characters, then you have something which isn't terribly useful except for European languages (I think?)
Maybe someone more familiar with the implementation can describe which path they actually went down for this... given what I've heard so far, I'm not optimistic.
See https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b2... and https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b2....
http://news.ycombinator.com/item?id=4834731
I want to agree with you simply because I don't like Node, but it's hardly fair to damn something over a bug that was fixed 9 months ago.
I'm pretty sure work on V8 started after 1996.
More likely is the idea that the authors of V8 felt that UCS-2 was an acceptable speed/correctness trade-off.
> Here's a little known secret about MS-DOS. The DOS developers weren't particularly happy about this state of affairs - heck, they all used Xenix machines for email and stuff, so they were familiar with the *nix command semantics. So they coded the OS to accept either "/" or "\" character as the path character (this continues today, btw - try typing "notepad c:/boot.ini" on an XP machine (if you're an admin)). And they went one step further. They added an undocumented system call to change the switch character. And updated the utilities to respect this flag.
[1] http://blogs.msdn.com/b/larryosterman/archive/2005/06/24/432...
Or am I missing something obviuos here?
In JavaScript, a string is a series of UTF-16 code units, so the smiley face is written '\ud83d\ude04'. This string has length 2, not 1, and behaves like a length-2 string as far as regexes, etc., which is too bad. But even though you don't get the character-counting APIs you might want, the JavaScript engine knows this is a surrogate pair and represents a single code point (character). (It just doesn't do much with this knowledge.)
You can assign '\ud83d\ude04' to document.body.innerHTML in modern Chrome, Firefox, or Safari. In Safari you get a nice Emoji; in stock Chrome and Firefox, you don't, but the empty space is selectable and even copy-and-pastable as a smiley! So the character is actually there, it just doesn't render as a smiley.
The bug that may have been present in V8 or Node is: what happens if you take this length-2 string and write it to a UTF8 buffer, does it get translated correctly? Today, it does.
What if you put the smiley directly into a string literal in JS source code, not \u-escaped? Does that work? Yes, in Chrome, Firefox, and Safari.
And I'm saying it doesn't really matter, because unicode codepoints are already a form of "leaky abstraction" which you'll have to handle (in that a read/written "character" does not correspond 1:1 to a codepoint anyway). Unicode is a tentative standardization of historical human production, and if you expect that to end up clean and simple you're going to have a hard time.
> Can one "character" span multiple codepoints?
Yes.
> Do you have an example of this?
Devanagari (the script used for e.g. Sanskrit) is full of them. For instance, "sanskrit" is written "संस्कृतम्" [sə̃skɹ̩t̪əm]. If you try to select "characters" in your browser you might get 4 (सं, स्कृ, त and म्) or 5 (सं, स्, कृ, त and म्) or maybe yet another different count, but this is a sequence of 9 codepoints (regardless of the normalization, it's the same in all of NFC, NFD, NFKC and NFKD as far as I can tell):
स: DEVANAGARI LETTER SA
ं: DEVANAGARI SIGN ANUSVARA
स: DEVANAGARI LETTER SA
्: DEVANAGARI SIGN VIRAMA
क: DEVANAGARI LETTER KA
ृ: DEVANAGARI VOWEL SIGN VOCALIC R
त: DEVANAGARI LETTER TA
म: DEVANAGARI LETTER MA
्: DEVANAGARI SIGN VIRAMA
Note: I'm not a Sanskrit speaker and I don't actually know devanagari (beyond knowing that it's troublesome for computers, as are jamo) so I can't even tell you how many "symbols" a native reader would see there.So emoji were probably invented by J-Phone, while Softbank was mostly taking care of Yahoo Japan.
Is there a reason that the workaround in comment 8 won't address some of these issues?
If you read closely you'll see the original linked message is from January and there's an update on that issue from March when a fix was made in V8.
http://www.emoji-cheat-sheet.com/
Before I read the article I guessed that maybe the icon set had some licensing issues for Github. Luckily, not so! (:smiley:)
Good rule of thumb for implementers: get over it and use 32 bits internally. Always use UTF-8 when encoding into a byte stream. Add UTF-16 encoding if you must interface with archaic libraries.
There's no such thing as "all done", Unicode 1.0 was 16 bit, Unicode 6 was released recently.
All that aside: emoji should not be in Unicode. Fullstop.
* emoji were invented by NTT DoCoMo, not Softbank
* even if that had been right Softbank's copyrighting of their emoji representations has no bearing on NTT and KDDI/au using completely different implementations (and I do mean completely, KDDI/au essentially use <img> tags)
* lack of cooperation is endemic to japanese markets (especially telecoms) and has nothing to do with "ganging up"
* if NTT and au/KDDI wanted to gang up on Softbank you'd think they'd share the same emoji
* you didn't have to run "adware apps" to unlock the emoji keyboard (there were numerous ways to do so from dedicated — and usually quickly nuked – apps to apps "easter eggs" to jailbreak to phone backup edit/restore)
That's barely the first third.
Edit: v8 in general is pretty cool, but not supporting Unicode outside UCS-2 is pretty bad.
Good on the V8 developers for recognizing these conditions that their code didn't fully handle and refusing to muddle on through with broken processing.
:)
Problem solved. Why is this front page material (#6 as of this writing)?