So if you want a Python library that can do this automatically with an extremely low rate of false positives: https://github.com/LuminosoInsight/python-ftfy
- HTTP 1.1 headers are ISO-8859-1 (CERN legacy) while content can be UTF8 - SIP being based on HTTP RFC have the same flaw.
The CTO of my last VoIP company is still wondering why some callerIDs are breaking his nice python program assuming everything is UTF8 and still does not understand this...
Yes, encoding can change, I also saw it while using regionalisation with C# .net in logs.
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.
Though I guess you'd still need to decode it correctly in order to ignore the right characters.