Node's Unicode Dragon (opens in new tab)

(cirw.in)

94 pointsfoobar2k12y ago63 comments

63 comments

36 comments · 12 top-level

stormbrew12y ago· 8 in thread

Wish I'd known about this when I was pointing out in another HN thread how utf-16 is a terrible encoding for, among other reasons, pushing the corner case where you find out your encoding/decoding is broken to the very edge of likelihood. It's ridiculous that v8 doesn't properly support utf16, but it's to be expected I suppose.

UTF-8 does not have this problem. That's the way we should be moving.

ender712y ago

This behavior is actually part of the ECMAScript standard [0], so it's unlikely that V8 (or any other conformant JS engine) would behave the way you (and many others) would want.

JS's treatment of strings is even more wacky than you might think -- it is neither really UCS-2 or UTF16. Engines are semi-required to use UTF-16 representations of strings internally, but the API surface that is exposed to the JS code makes them look like UCS-2 strings (i.e. no surrogate pairs). However, if you stick a JS string into something that is UTF-16 aware, such as a DOM node, then the surrogate pairs will display correctly.

See [1] for a very clear explanation of this muddy subject.

[0] http://www.ecma-international.org/ecma-262/5.1/#sec-8.4

[1] http://mathiasbynens.be/notes/javascript-encoding

stormbrew12y ago

That is all incredibly depressing.

sillysaurus212y ago

This. Why doesn't everybody use UTF-8? Nobody seems to have any problems with UTF-8. It seems to work almost perfectly, and it's efficient.

est12y ago

Because some of us are pissed that some BMP characters takes 3 bytes in UTF8, that's 50% more waste of storage space and 50% more time to read/write.

I like the design of Python 3.3 encoding. ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes.

http://www.python.org/dev/peps/pep-0393/

3 more replies

pcwalton12y ago

1. Controversy over Han unification made Unicode adoption less universal than might have been hoped.

2. Interoperability with legacy systems that don't use UTF-8 (for example, JavaScript). For example, Rust needs support for the full range of string encodings, because we need that support for implementing a browser engine.

millstone12y ago

Did you read the article? The problem occurs precisely because V8 mishandles UTF-8.

Also check out the bug report: https://code.google.com/p/v8/issues/detail?id=2875

ximeng12y ago

A lot of Windows is UTF-16 or UCS-2, including Office, which forces their use for working with APIs or transferring data.

millstone12y ago

Why do you think that UTF-16's corner cases, by which you presumably mean surrogate pairs, are less likely than UTF-8's corner cases, like invalid code units and non-shortest forms?

I would argue that the UTF-8 corner cases are more rare because they are harder to produce accidentally, and also more serious because they have security implications.

baddox12y ago· 5 in thread

Despite that being a rather interesting technical article, I am upset that my expectation of an actual Unicode depiction of a dragon was not met.

greenyoda12y ago

There is actually a Unicode dragon character at code point U+1F409:

http://www.fileformat.info/info/unicode/char/1f409/index.htm

Also, since any ASCII dragon is also a valid Unicode dragon (in UTF-8, at least), the following might satisfy your needs:

http://www.dougsartgallery.com/ascii-art-dragon.html

1 more reply

cirwin12y ago

🐉

To see this dragon, either:

1. Use Safari or Firefox on OS X. 2. Install custom fonts for Linux or Windows. 3. Install https://chrome.google.com/webstore/detail/chromoji-emoji-for... for Chrome

pavlov12y ago

The dragon glyph is rendered correctly in IE10 on Windows 8 without any custom fonts. Hooray for the most underestimated browser ever ;)

1 more reply

Wilya12y ago

Next time I have some "Here be dragons" code, I'm going to use this.

lelf12y ago

There is also 🐲 U+1F432 DRAGON FACE

Also: didn't know that for every emoji there is https://en.wikipedia.org/wiki/🐉

dsj3612y ago· 4 in thread

how did the error JSON include the undecodable bytes? JSON strings are all unicode sequences, so there would have had to be some way that the raw bytes were mapped into codepoints.

on the other hand, if the offending bytes were blindly substituted into the JSON, then it's not surprising that there were decoding issues down the line...

jlarocco12y ago

From the article:

> The exceptions that were crashing us were caused by people using String.prototype.substr. That function works perfectly on strings that only contain Unicode 1.0 data, but as soon as you're storing UTF-16 in your UCS-2 string there's a possibility that when you take a slice you'll split a valid surrogate pair into two invalid lonely surrogates.

To me, it seems like it'd be nearly impossible for somebody to trigger, but there's always Murphy's law...

twoodfin12y ago

These kinds of isolated surrogate pairs are pretty easy to create if you're doing the right kind of processing on the right kind of data.

Suppose you receive a long piece of text wrapped in JSON, unpack it into a JS String, then start processing it in fixed size chunks. If your source text contains any significant percentage of surrogate pair-represented characters, you'll eventually break one.

cirwin12y ago

In the example I looked at to debug this, the sequence of events was:

1. One of our customer's javascript apps sent a truncated string to their web-server in a JSON payload. This string ended with a leading surrogate (this is another instance of V8 bug discussed in the blog post).

2. Their ruby backend exploded when they tried to use a regular expression on the string (because ruby's regexp library is strict about valid utf-8).

3. The bugsnag exception notifier copied the bytes from the incoming parameter into the JSON exception notification payload (ruby didn't notice because its string library unconditionally believes you if you tell it a string is valid utf8 — another bug :p).

sujayakar12y ago

ah yeah step 3 seems pretty bad -- cool that you found that bug!

shawnz12y ago· 3 in thread

> Unfortunately for us, Javascript has never been updated to support UTF-16. Instead it continues to treat strings as UCS-2.

So really, they were parsing the JSON as if it were UTF-16, but really it was UCS-2. How is that an error in Node?

justincormack12y ago

JSON is defined as UTF8, 16 or 32 [1]. The escaped characters are UTF-16 not UCS2. It is unfortunate of JavaScript can't parse it correctly!

[1] http://www.ietf.org/rfc/rfc4627.txt

kansface12y ago

This is true of JSON, but its not true of Javascript which gives no fucks about utf16 (or valid surrogate pairs). Its a very strange world where JSON and Javascript have incompatible interpretations of strings.

http://mathiasbynens.be/notes/javascript-encoding

1 more reply

kansface12y ago

They wanted to parse some bytes as utf-16, but are unable to do so because V8 only understands ucs2 (with invalid surrogate pairs). This is a major problem with node- ie, it happily produces/consumes invalid unicode encoded strings.

justin_vanw12y ago· 2 in thread

Man, I'm starting to think there is a cult around JSON.

If you need to accept arbitrary binary data, JSON is a profoundly bad choice. At a minimum, you would expect them to base64 encode the data and put that into a JSON string.

If you are looking at error reports, how is it even remotely acceptable to have them silently modified to include invalid unicode replacement characters?

The lesson here isn't some crappy hack workaround they found, it's a case study in the lengths you'll have to go to when you insist on making technology choices without considering the problem you want to solve.

derefr12y ago

Any wire-serialization format that wants to send arbitrary data should really have a "raw binary payload" type. XML has CDATA. ASN.1 has bitstrings. BERT has Binaries. But JSON doesn't really have anything like that.

I wonder... at some point, Javascript could get a convenient literal syntax for creating pre-filled ArrayBuffers, which would basically be the format JSON would want to adopt. But would it? Are changes to Javascript literal syntax folded into JSON, or is JSON now its own thing that doesn't track JS any more?

Dylan1680712y ago

CDATA disallows null bytes, so it's even worse than non-support: illusory support

XML doesn't even allow escaped null bytes, so you're basically forced to use base64 or weird custom app-internal escapes.

JSON never tracked javascript. It has one version, period. But you could get people to adopt a superset with a new data type, if you kept it simple.

1 more reply

bsaul12y ago· 2 in thread

Reminds me of a previous discussion about Go being more "mature" than node.js, where i said having someone like Pike on board gives you more than 30 years of "maturity". I'm pretty sure you wouldn't find those leaky UTF encoding handling in Go.

ygra12y ago

Well, Node builds atop an established language, while Go is a new development. It's probably easier to build sane Unicode semantics into a new language than to change the JS spec.

pjscott12y ago

Since Rob Pike and Ken Thompson are the guys who came up with UTF-8, you'd expect them to write decent Unicode encoding for Go. It would be surprising if they didn't.

nonchalance12y ago

String encoding in general is a mess. Wait till you get to code pages. Incidentally, the largest JS script I've ever seen pertained to encoding and decoding characters under various codepages: https://raw.github.com/Niggler/js-codepage/master/cptable.js [github complains "(Sorry about that, but we can't show files that are this big right now.)"]

jrochkind112y ago

The OP describes an environment where data goes from node to Rails.

If you want to check a string for valid encoding and/or replace bad bytes with replacement char on the _ruby_ end... it's not very obvious how you do that with the ruby stdlib api, and it takes a few tricks to do right.

So I wrote a gem for it: https://github.com/jrochkind/ensure_valid_encoding

state12y ago

Whew. This explains a bug from six months ago that drove me up the wall. I could never figure it out.

scoopr12y ago

This same problem manifests with Java as well, where some methods that claim to return UTF-8 on closer inspection actually return “modified UTF-8”, which is broken the same way. Notably I ran across this in with JNI function GetStringUTFChars, but may come across in DataOutputStream's writeUTF etc.

scott_karana12y ago

Is it just me, or is the two-column layout a bit tricky for readability?

(1440x900)

oceanstone12y ago

I can't believe NodeJS doesn't support Dragon symbols. This is a dealbreaker.

j / k navigate · click thread line to collapse

63 comments

36 comments · 12 top-level

stormbrew12y ago· 8 in thread

UTF-8 does not have this problem. That's the way we should be moving.

ender712y ago

This behavior is actually part of the ECMAScript standard [0], so it's unlikely that V8 (or any other conformant JS engine) would behave the way you (and many others) would want.

See [1] for a very clear explanation of this muddy subject.

[0] http://www.ecma-international.org/ecma-262/5.1/#sec-8.4

[1] http://mathiasbynens.be/notes/javascript-encoding

stormbrew12y ago

That is all incredibly depressing.

sillysaurus212y ago

This. Why doesn't everybody use UTF-8? Nobody seems to have any problems with UTF-8. It seems to work almost perfectly, and it's efficient.

est12y ago

Because some of us are pissed that some BMP characters takes 3 bytes in UTF8, that's 50% more waste of storage space and 50% more time to read/write.

I like the design of Python 3.3 encoding. ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes.

http://www.python.org/dev/peps/pep-0393/

3 more replies

pcwalton12y ago

1. Controversy over Han unification made Unicode adoption less universal than might have been hoped.

millstone12y ago

Did you read the article? The problem occurs precisely because V8 mishandles UTF-8.

Also check out the bug report: https://code.google.com/p/v8/issues/detail?id=2875

ximeng12y ago

A lot of Windows is UTF-16 or UCS-2, including Office, which forces their use for working with APIs or transferring data.

millstone12y ago

Why do you think that UTF-16's corner cases, by which you presumably mean surrogate pairs, are less likely than UTF-8's corner cases, like invalid code units and non-shortest forms?

I would argue that the UTF-8 corner cases are more rare because they are harder to produce accidentally, and also more serious because they have security implications.

baddox12y ago· 5 in thread

Despite that being a rather interesting technical article, I am upset that my expectation of an actual Unicode depiction of a dragon was not met.

greenyoda12y ago

There is actually a Unicode dragon character at code point U+1F409:

http://www.fileformat.info/info/unicode/char/1f409/index.htm

Also, since any ASCII dragon is also a valid Unicode dragon (in UTF-8, at least), the following might satisfy your needs:

http://www.dougsartgallery.com/ascii-art-dragon.html

1 more reply

cirwin12y ago

🐉

To see this dragon, either:

1. Use Safari or Firefox on OS X. 2. Install custom fonts for Linux or Windows. 3. Install https://chrome.google.com/webstore/detail/chromoji-emoji-for... for Chrome

pavlov12y ago

The dragon glyph is rendered correctly in IE10 on Windows 8 without any custom fonts. Hooray for the most underestimated browser ever ;)

1 more reply

Wilya12y ago

Next time I have some "Here be dragons" code, I'm going to use this.

lelf12y ago

There is also 🐲 U+1F432 DRAGON FACE

Also: didn't know that for every emoji there is https://en.wikipedia.org/wiki/🐉

dsj3612y ago· 4 in thread

how did the error JSON include the undecodable bytes? JSON strings are all unicode sequences, so there would have had to be some way that the raw bytes were mapped into codepoints.

on the other hand, if the offending bytes were blindly substituted into the JSON, then it's not surprising that there were decoding issues down the line...

jlarocco12y ago

From the article:

To me, it seems like it'd be nearly impossible for somebody to trigger, but there's always Murphy's law...

twoodfin12y ago

These kinds of isolated surrogate pairs are pretty easy to create if you're doing the right kind of processing on the right kind of data.

cirwin12y ago

In the example I looked at to debug this, the sequence of events was:

2. Their ruby backend exploded when they tried to use a regular expression on the string (because ruby's regexp library is strict about valid utf-8).

sujayakar12y ago

ah yeah step 3 seems pretty bad -- cool that you found that bug!

shawnz12y ago· 3 in thread

> Unfortunately for us, Javascript has never been updated to support UTF-16. Instead it continues to treat strings as UCS-2.

So really, they were parsing the JSON as if it were UTF-16, but really it was UCS-2. How is that an error in Node?

justincormack12y ago

JSON is defined as UTF8, 16 or 32 [1]. The escaped characters are UTF-16 not UCS2. It is unfortunate of JavaScript can't parse it correctly!

[1] http://www.ietf.org/rfc/rfc4627.txt

kansface12y ago

http://mathiasbynens.be/notes/javascript-encoding

1 more reply

kansface12y ago

justin_vanw12y ago· 2 in thread

Man, I'm starting to think there is a cult around JSON.

If you need to accept arbitrary binary data, JSON is a profoundly bad choice. At a minimum, you would expect them to base64 encode the data and put that into a JSON string.

If you are looking at error reports, how is it even remotely acceptable to have them silently modified to include invalid unicode replacement characters?

derefr12y ago

Dylan1680712y ago

CDATA disallows null bytes, so it's even worse than non-support: illusory support

XML doesn't even allow escaped null bytes, so you're basically forced to use base64 or weird custom app-internal escapes.

JSON never tracked javascript. It has one version, period. But you could get people to adopt a superset with a new data type, if you kept it simple.

1 more reply

bsaul12y ago· 2 in thread

ygra12y ago

Well, Node builds atop an established language, while Go is a new development. It's probably easier to build sane Unicode semantics into a new language than to change the JS spec.

pjscott12y ago

Since Rob Pike and Ken Thompson are the guys who came up with UTF-8, you'd expect them to write decent Unicode encoding for Go. It would be surprising if they didn't.

nonchalance12y ago

jrochkind112y ago

The OP describes an environment where data goes from node to Rails.

So I wrote a gem for it: https://github.com/jrochkind/ensure_valid_encoding

state12y ago

Whew. This explains a bug from six months ago that drove me up the wall. I could never figure it out.

scoopr12y ago

scott_karana12y ago

Is it just me, or is the two-column layout a bit tricky for readability?

(1440x900)

oceanstone12y ago

I can't believe NodeJS doesn't support Dragon symbols. This is a dealbreaker.

j / k navigate · click thread line to collapse