undefined | Better HN

0 pointstakeda6y ago0 comments

The rule of thumb (not just for Python, but anything that deals with encoding) is to use binary encoding at the bounds of your program (reading/writing files, sending/receiving data over network etc) it applies to everything including tools like this. If you follow it your life will be simpler.

You just need to be aware that in some cases the work is already done for you by the language, for example in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

0 comments

15 comments · 2 top-level

slavik816y ago· 9 in thread

> in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

Of course, if you don't know what encoding the file was opened with, you don't know what characters can be written to the file.

I was bitten by this with Python 3.5 on Windows. I naively assumed the default file encoding would be UTF-8 or UTF-16, but it was actually CP-1252, so my program would crash upon trying to write a non-ASCII character.

mark-r6y ago

Every Python program should be tested with Emoji characters, they're a real torture test.

slavik816y ago

Note that you need to test on every platform, as the default file encoding may vary. I missed that bug in part because it worked correctly on Linux.

mark-r6y ago

Good point. I do almost all of my Python on Windows where it's much easier to get an error.

WorldMaker6y ago

Every program in general should be tested with Emoji characters at this point.

mark-r6y ago

Not a bad idea, but I think Python is more likely to have hidden bugs that this will uncover. A language that accepts bytes as input and emits the same on output will probably work fine on UTF-8 for example.

1 more reply

takedaOP6y ago

It defaults to the system encoding. I don't use Python on Windows, but Windows evolved its default encoding over time, the code pages were popular in Windows 9x, starting with NT based (2000, XP...) They used UTF-16 I believe and then Windows 7? It became UTF-8. Perhaps Python needs to be updated to reflect that?

You can also specify encoding when calling open.

Dylan168076y ago

> Windows evolved its default encoding over time, the code pages were popular in Windows 9x, starting with NT based (2000, XP...) They used UTF-16 I believe and then Windows 7? It became UTF-8.

They bolted on a separate set of functions that took UCS-2 and now take UTF-16.

The actual code pages, to this day, are legacy things that are mostly 8 bits. My system is set to code pages 437 and 1252, for example.

They put together a code page for UTF-8 but it's behind a 'beta' warning.

ygra6y ago

> They bolted on a separate set of functions that took UCS-2 and now take UTF-16.

NT actually bolted on 8-bit versions of the native Unicode functions. FooBarA is a wrapper around FooBarW.

> They put together a code page for UTF-8 but it's behind a 'beta' warning.

Codepage 65001 has been a thing for quite a while. It's just that it's variable-width per character and few applications are ready to handle that when they assume a 1:1 or 2:1 relationship between bytes and characters. It does work sort of for applications that don't do too weird stuff to text, though, and can be a useful workaround in such cases to get UTF-8 support into legacy applications.

But in general, Windows is UTF-16LE and the code pages are indeed legacy cruft that no application should touch or even use. Sadly much software ported from Unix-likes notices »Hey, there's a default encoding in Windows too, so let's just use that«.

slavik816y ago

The default file encoding for Windows was changed to UTF-8 in Python 3.6. That particular problem on that particular platform is now a thing of the past.

It was just an example of why implicit conversions in the standard library functions don't save you from having to think about encodings. You get much more robust and user-friendly programs when you explicitly consider your encodings and the error-handling strategies to go with them.

masklinn6y ago· 4 in thread

> You just need to be aware that in some cases the work is already done for you by the language, for example in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

Sadly they fucked up that part rather thoroughly, because the default encoding is `locale.getpreferredencoding()`, which ensures it's going to be wrong at the least possible convenient time and on the devices least accessible for debugging.

Do not ever use text-mode `open` without specifying an encoding.

mintplant6y ago

Node.js tries to be helpful in defaulting file writes to UTF-8, but defaults file reads to returning a raw byte buffer [0]. So you have to either remember to treat the two operations differently, or, like in Python, manually specify the encoding for both.

[0] I seem to recall that it used to default to the locale's preferred encoding, but I could have my wires crossed with other languages' standard libraries there.

takedaOP6y ago

The locales are provided by LANG and other locate variables, so Python will use whatever is set in environment, you can also specify the encoding in one of open() parameters.

masklinn6y ago

> The locales are provided by LANG and other locate variables

Which is absolutely not what you want when, say, opening your own data files. Even when opening the user’s files it’s likely not what you want.

> you can also specify …

And what I’m saying is this is not a “can also” it’s a “must”. Not doing so will bite you in the ass, because “whatever random garbage is on the machine” is really not what you want a default to be.

takedaOP6y ago

Oh I see your point. Looks like they changed the behavior in 3.7 (they added -X UTF-8 option), but being able to set it from the application would be great.

j / k navigate · click thread line to collapse

0 comments

15 comments · 2 top-level

slavik816y ago· 9 in thread

> in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

Of course, if you don't know what encoding the file was opened with, you don't know what characters can be written to the file.

mark-r6y ago

Every Python program should be tested with Emoji characters, they're a real torture test.

slavik816y ago

Note that you need to test on every platform, as the default file encoding may vary. I missed that bug in part because it worked correctly on Linux.

mark-r6y ago

Good point. I do almost all of my Python on Windows where it's much easier to get an error.

WorldMaker6y ago

Every program in general should be tested with Emoji characters at this point.

mark-r6y ago

1 more reply

takedaOP6y ago

You can also specify encoding when calling open.

Dylan168076y ago

> Windows evolved its default encoding over time, the code pages were popular in Windows 9x, starting with NT based (2000, XP...) They used UTF-16 I believe and then Windows 7? It became UTF-8.

They bolted on a separate set of functions that took UCS-2 and now take UTF-16.

The actual code pages, to this day, are legacy things that are mostly 8 bits. My system is set to code pages 437 and 1252, for example.

They put together a code page for UTF-8 but it's behind a 'beta' warning.

ygra6y ago

> They bolted on a separate set of functions that took UCS-2 and now take UTF-16.

NT actually bolted on 8-bit versions of the native Unicode functions. FooBarA is a wrapper around FooBarW.

> They put together a code page for UTF-8 but it's behind a 'beta' warning.

slavik816y ago

The default file encoding for Windows was changed to UTF-8 in Python 3.6. That particular problem on that particular platform is now a thing of the past.

masklinn6y ago· 4 in thread

Do not ever use text-mode `open` without specifying an encoding.

mintplant6y ago

[0] I seem to recall that it used to default to the locale's preferred encoding, but I could have my wires crossed with other languages' standard libraries there.

takedaOP6y ago

The locales are provided by LANG and other locate variables, so Python will use whatever is set in environment, you can also specify the encoding in one of open() parameters.

masklinn6y ago

> The locales are provided by LANG and other locate variables

Which is absolutely not what you want when, say, opening your own data files. Even when opening the user’s files it’s likely not what you want.

> you can also specify …

takedaOP6y ago

Oh I see your point. Looks like they changed the behavior in 3.7 (they added -X UTF-8 option), but being able to set it from the application would be great.

j / k navigate · click thread line to collapse