UTF-8 does not have this problem. That's the way we should be moving.
JS's treatment of strings is even more wacky than you might think -- it is neither really UCS-2 or UTF16. Engines are semi-required to use UTF-16 representations of strings internally, but the API surface that is exposed to the JS code makes them look like UCS-2 strings (i.e. no surrogate pairs). However, if you stick a JS string into something that is UTF-16 aware, such as a DOM node, then the surrogate pairs will display correctly.
See [1] for a very clear explanation of this muddy subject.
I like the design of Python 3.3 encoding. ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes.
2. Interoperability with legacy systems that don't use UTF-8 (for example, JavaScript). For example, Rust needs support for the full range of string encodings, because we need that support for implementing a browser engine.
Also check out the bug report: https://code.google.com/p/v8/issues/detail?id=2875
I would argue that the UTF-8 corner cases are more rare because they are harder to produce accidentally, and also more serious because they have security implications.
http://www.fileformat.info/info/unicode/char/1f409/index.htm
Also, since any ASCII dragon is also a valid Unicode dragon (in UTF-8, at least), the following might satisfy your needs:
To see this dragon, either:
1. Use Safari or Firefox on OS X. 2. Install custom fonts for Linux or Windows. 3. Install https://chrome.google.com/webstore/detail/chromoji-emoji-for... for Chrome
Also: didn't know that for every emoji there is https://en.wikipedia.org/wiki/๐
on the other hand, if the offending bytes were blindly substituted into the JSON, then it's not surprising that there were decoding issues down the line...
> The exceptions that were crashing us were caused by people using String.prototype.substr. That function works perfectly on strings that only contain Unicode 1.0 data, but as soon as you're storing UTF-16 in your UCS-2 string there's a possibility that when you take a slice you'll split a valid surrogate pair into two invalid lonely surrogates.
To me, it seems like it'd be nearly impossible for somebody to trigger, but there's always Murphy's law...
Suppose you receive a long piece of text wrapped in JSON, unpack it into a JS String, then start processing it in fixed size chunks. If your source text contains any significant percentage of surrogate pair-represented characters, you'll eventually break one.
1. One of our customer's javascript apps sent a truncated string to their web-server in a JSON payload. This string ended with a leading surrogate (this is another instance of V8 bug discussed in the blog post).
2. Their ruby backend exploded when they tried to use a regular expression on the string (because ruby's regexp library is strict about valid utf-8).
3. The bugsnag exception notifier copied the bytes from the incoming parameter into the JSON exception notification payload (ruby didn't notice because its string library unconditionally believes you if you tell it a string is valid utf8 โ another bug :p).
So really, they were parsing the JSON as if it were UTF-16, but really it was UCS-2. How is that an error in Node?
If you need to accept arbitrary binary data, JSON is a profoundly bad choice. At a minimum, you would expect them to base64 encode the data and put that into a JSON string.
If you are looking at error reports, how is it even remotely acceptable to have them silently modified to include invalid unicode replacement characters?
The lesson here isn't some crappy hack workaround they found, it's a case study in the lengths you'll have to go to when you insist on making technology choices without considering the problem you want to solve.
I wonder... at some point, Javascript could get a convenient literal syntax for creating pre-filled ArrayBuffers, which would basically be the format JSON would want to adopt. But would it? Are changes to Javascript literal syntax folded into JSON, or is JSON now its own thing that doesn't track JS any more?
XML doesn't even allow escaped null bytes, so you're basically forced to use base64 or weird custom app-internal escapes.
JSON never tracked javascript. It has one version, period. But you could get people to adopt a superset with a new data type, if you kept it simple.
If you want to check a string for valid encoding and/or replace bad bytes with replacement char on the _ruby_ end... it's not very obvious how you do that with the ruby stdlib api, and it takes a few tricks to do right.
So I wrote a gem for it: https://github.com/jrochkind/ensure_valid_encoding
(1440x900)