What developers should know about Unicode and character sets in 2013 (opens in new tab)

(the-pastry-box-project.net)

47 pointsoyvindeh12y ago20 comments

20 comments

16 comments · 8 top-level

PeterisP12y ago· 4 in thread

The concluding statement is a bit wierd: "ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8)"

That isn't accurate, ASCII text would appear identical even if 'you view the hex', because it is identical in UTF-8, that's the whole point of UTF-8. You'd have to look at non-ASCII characters to see how they're encoded.

ygra12y ago

Notepad also doesn't save as ASCII by default but »ANSI«, the default legacy codepage configured for your Windows installation.

apaprocki12y ago

Yes, the default Windows code page -- many pieces of software don't realize that registry keys, file paths, etc. are all encoded in a different code page if you are running, for example, Japanese Windows. (Also, it isn't exactly Shift-JIS...)

lambda12y ago

Yeah, I think he might have meant ISO-8859-1 or Windows 1252 rather than ASCII; but still, all of those characters would take up two bytes in UTF-8, not 3, unless you used combining diacritics rather than precomposed.

jrochkind112y ago

all those characters? You mean except the straight ascii-compatible ones, which will just take up one byte.

1 more reply

golergka12y ago· 2 in thread

Oh, one more fun fact: some emoji characters occupy more than one _Unicode_ character, and can be encoded in different ways depending on the device that uses them. (Before they were introduced into Unicode, they used character codes designated for custom platform-specific stuff).

Debugging a text input field where user can enter emoji & RTL text is FUN.

twic12y ago

Are there really multi-character emoji? Or is it that they are single characters on an astral plane which are encoded as two code units in UTF-16, and therefore behave rather like two characters if your language uses 16-bit chars?

golergka12y ago

Several characters, yes. And those characters, in turn, can be presented as low and hi surrogate pairs in UTF-16.

http://apps.timwhitlock.info/emoji/tables/unicode

Look for flags and numbers. Here's German flag in ASCII: \xF0\x9F\x87\xA9\xF0\x9F\x87\xAA 8 bytes, 2 unicode symbols, 4 UTF-16 symbols.

2 more replies

jrochkind112y ago· 1 in thread

> Never assume that the data you’re dealing with is UTF-8 — ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8).

Um, what? This is just wrong. ascii-equivalent characters only take one byte in UTF-8. Other characters may take two, or three, bytes.

If the author actually viewed text in ascii that, when in UTF-8, had three-bytes per character.... I don't know what they were looking at, but it wasn't UTF-8.

jrochkind112y ago

Also, if the data is ASCII, and includes only legal 7-bit ASCII characters -- it is simultaneously ALSO valid and legal UTF-8. UTF-8 is a superset of ASCII.

I'm not sure this guy understands what he's talking about.

ohwp12y ago· 1 in thread

Note that some browser do use the <meta charset="UTF-8"> even if the content-type header already sent the charset.

Another thing to add: always open a database connection in the charset of choice. And if you are a PHP user (like I am): there are still functions that don't support multibyte so be careful.

oneeyedpigeon12y ago

This is the biggest current driver towards me trying to muster the effort to move off of PHP. Also, I had no end of trouble working with filenames that contained UTF-8 characters using PHP, and had to give up in the end.

VLM12y ago

Some background not covered in an otherwise pretty good article:

"In general, don’t save a Byte Order Mark (BOM) — it’s not needed for UTF-8, and historically could cause problems."

This attitude comes from agony in processing from UTF-16 files. I interface with a group that finds it hilarious to send me textual data in UTF-16 format and the first hard won lesson you learn with UTF-16 is superficially the default order should be correct 50% of the time if guessed randomly but somehow its always wrong. So say you read one line of a UTF-16 text file and process it accordingly after passing it thru a UTF-16 decoder. OK no problemo, it had a BOM as the first glyph/byte/character/whatever and was converted and interpreted correctly. Then you read another line, just like you'd read a line process a line with ASCII or UTF-8. However they only give me a BOM at the start of a file not a start of line, so invariably I translate that to garbage because the bytes are swapped.

Now there are program methods to analyze the BOM and memorize it. Or read the whole blasted multi-gig file into memory all at once and then de-UTF-16 it all at once and then line by line the file. But fundamentally its a simple one liner sysadmin type job to just shove the file thru a UTF-16 to UTF-8 translator program before it hits my processing system. I already had to unencrypt it, and unzip it, and verify its hash so I know they sent the whole file to me (and correctly), so adding a conversion stage is no big deal.

And this kind of UTF-16 experience is what leads people to do things like say "oh, its unicode? That means I should squirt out BOMs as often as possible" even though that technically only applies to unicode UTF-16 and is not helpful for UTF-8.

danso12y ago

I hate to be "that SEO guy", but the OP needs to do some SEO. The submitted title here is nowhere to be seen, which is too bad because it's a great title and one that I would try to Google after forgetting to bookmark this page.

Luckily I do use Pinboard, which auto-grabs the title, if it existed. But this is a helpful reference to many devs who don't read HN, and it's all but obscured.

ygra12y ago

Site appears to be down; Google cache: http://webcache.googleusercontent.com/search?q=cache:A8oNdl-...

hcarvalhoalves12y ago

> While there are a ton of encodings you could use, for the web use UTF-8. You want to use UTF-8 for your entire stack. So how do we get that?

You should use your language's internal unicode representation, and decode from/encode to UTF-8 on I/O.

j / k navigate · click thread line to collapse

20 comments

16 comments · 8 top-level

PeterisP12y ago· 4 in thread

The concluding statement is a bit wierd: "ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8)"

ygra12y ago

Notepad also doesn't save as ASCII by default but »ANSI«, the default legacy codepage configured for your Windows installation.

apaprocki12y ago

lambda12y ago

jrochkind112y ago

all those characters? You mean except the straight ascii-compatible ones, which will just take up one byte.

1 more reply

golergka12y ago· 2 in thread

Debugging a text input field where user can enter emoji & RTL text is FUN.

twic12y ago

golergka12y ago

Several characters, yes. And those characters, in turn, can be presented as low and hi surrogate pairs in UTF-16.

http://apps.timwhitlock.info/emoji/tables/unicode

Look for flags and numbers. Here's German flag in ASCII: \xF0\x9F\x87\xA9\xF0\x9F\x87\xAA 8 bytes, 2 unicode symbols, 4 UTF-16 symbols.

2 more replies

jrochkind112y ago· 1 in thread

> Never assume that the data you’re dealing with is UTF-8 — ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8).

Um, what? This is just wrong. ascii-equivalent characters only take one byte in UTF-8. Other characters may take two, or three, bytes.

If the author actually viewed text in ascii that, when in UTF-8, had three-bytes per character.... I don't know what they were looking at, but it wasn't UTF-8.

jrochkind112y ago

Also, if the data is ASCII, and includes only legal 7-bit ASCII characters -- it is simultaneously ALSO valid and legal UTF-8. UTF-8 is a superset of ASCII.

I'm not sure this guy understands what he's talking about.

ohwp12y ago· 1 in thread

Note that some browser do use the <meta charset="UTF-8"> even if the content-type header already sent the charset.

Another thing to add: always open a database connection in the charset of choice. And if you are a PHP user (like I am): there are still functions that don't support multibyte so be careful.

oneeyedpigeon12y ago

VLM12y ago

Some background not covered in an otherwise pretty good article:

"In general, don’t save a Byte Order Mark (BOM) — it’s not needed for UTF-8, and historically could cause problems."

danso12y ago

Luckily I do use Pinboard, which auto-grabs the title, if it existed. But this is a helpful reference to many devs who don't read HN, and it's all but obscured.

ygra12y ago

Site appears to be down; Google cache: http://webcache.googleusercontent.com/search?q=cache:A8oNdl-...

hcarvalhoalves12y ago

> While there are a ton of encodings you could use, for the web use UTF-8. You want to use UTF-8 for your entire stack. So how do we get that?

You should use your language's internal unicode representation, and decode from/encode to UTF-8 on I/O.

j / k navigate · click thread line to collapse