UTF-8 Encoding Debugging Chart (opens in new tab)

(i18nqa.com)

116 pointstard10y ago11 comments

11 comments

The fortunate thing is, almost all of the broken sequences are unambiguous enough to be signs that the text should be encoded and then re-decoded as UTF-8. (This is not the case with any arbitrary encoding mixup -- if you mix up Big5 with EUC-JP, you might as well throw out your text and start over -- but it works for UTF-8 and the most common other encodings because UTF-8 is well-designed.)

So if you want a Python library that can do this automatically with an extremely low rate of false positives: https://github.com/LuminosoInsight/python-ftfy

pixelbeat10y ago

I previously wrote about this common double encoding issue at http://www.pixelbeat.org/docs/unicode_utils/ which references tools and techniques to fix up such garbled data

plank10y ago

Missing: how defaults are wrong between UTF8 and EBCDIC. E.g. where a character in UTF8 outside the MES2 subset ('latin1') will be mapped to the x3F 'unknown' character of EBCDIC, which will be mapped back to the x1A character ('CTRL-z') of UTF8...

julie110y ago

Lol, biggest bug is developer ignoring that latin1 & unicode encoded in UTF8 can coexists in the same stream of data :

- HTTP 1.1 headers are ISO-8859-1 (CERN legacy) while content can be UTF8 - SIP being based on HTTP RFC have the same flaw.

The CTO of my last VoIP company is still wondering why some callerIDs are breaking his nice python program assuming everything is UTF8 and still does not understand this...

Yes, encoding can change, I also saw it while using regionalisation with C# .net in logs.

guelo10y ago

According to newer HTTP specs clients should ignore weird ISO-8859 characters. https://tools.ietf.org/html/rfc7230#section-3.2.4:

   Historically, HTTP has allowed field content with text in the
   ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
   through use of [RFC2047] encoding.  In practice, most HTTP header
   field values use only a subset of the US-ASCII charset [USASCII].
   Newly defined header fields SHOULD limit their field values to
   US-ASCII octets.  A recipient SHOULD treat other octets in field
   content (obs-text) as opaque data.

Though I guess you'd still need to decode it correctly in order to ignore the right characters.

AnthonyMouse10y ago

IETF should just publish an RFC that says "all text without a field specifying its encoding shall be UTF-8, even if this conflicts with a previous RFC." The only real objection to doing this is that it would break things, but almost all of those things are already broken.

1 more reply

julie110y ago

Well there are still situations where coders put mix different codeset / encoding. Willingly or not. And SIP still exists.

arm10y ago

Currently down. Here’s a snapshot from January:

http://archive.is/t2tB3

alien3d10y ago

Are utfmb4 effected also ? I been converting my table from utf8 unicode to utfmb4 unicode for supporting emoticon unicode.

j / k navigate · click thread line to collapse

11 comments

rspeer10y ago

So if you want a Python library that can do this automatically with an extremely low rate of false positives: https://github.com/LuminosoInsight/python-ftfy

pixelbeat10y ago

I previously wrote about this common double encoding issue at http://www.pixelbeat.org/docs/unicode_utils/ which references tools and techniques to fix up such garbled data

plank10y ago

julie110y ago

Lol, biggest bug is developer ignoring that latin1 & unicode encoded in UTF8 can coexists in the same stream of data :

- HTTP 1.1 headers are ISO-8859-1 (CERN legacy) while content can be UTF8 - SIP being based on HTTP RFC have the same flaw.

The CTO of my last VoIP company is still wondering why some callerIDs are breaking his nice python program assuming everything is UTF8 and still does not understand this...

Yes, encoding can change, I also saw it while using regionalisation with C# .net in logs.

guelo10y ago

According to newer HTTP specs clients should ignore weird ISO-8859 characters. https://tools.ietf.org/html/rfc7230#section-3.2.4:

   Historically, HTTP has allowed field content with text in the
   ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
   through use of [RFC2047] encoding.  In practice, most HTTP header
   field values use only a subset of the US-ASCII charset [USASCII].
   Newly defined header fields SHOULD limit their field values to
   US-ASCII octets.  A recipient SHOULD treat other octets in field
   content (obs-text) as opaque data.

Though I guess you'd still need to decode it correctly in order to ignore the right characters.

AnthonyMouse10y ago

1 more reply

julie110y ago

Well there are still situations where coders put mix different codeset / encoding. Willingly or not. And SIP still exists.

arm10y ago

Currently down. Here’s a snapshot from January:

http://archive.is/t2tB3

alien3d10y ago

Are utfmb4 effected also ? I been converting my table from utf8 unicode to utfmb4 unicode for supporting emoticon unicode.

j / k navigate · click thread line to collapse