undefined | Better HN

0 pointsafiori6y ago0 comments

some would say that the distinction was in the wrong places, like assuming that the command line arguments or file paths were utf8

0 comments

5 comments · 2 top-level

ynik6y ago· 3 in thread

Fundamentally, the "right place" here differs between Windows and Linux. On Windows, command line arguments really are unicode (UTF-16 actually). On Linux, they're just bytes. In Python 2, on Linux you got the bytes as-is; but on Windows you got the command line arguments converted to the system codepage. Note that the Windows system codepage generally isn't a Unicode encoding, so there was unavoidable data loss even before the first line of your code started running (AFAIK neither sys.argv nor sys.environ had a unicode-supporting alternative in Python 2). However, on Linux, Python 2 was just fine.

Now with Python 3 it's the other way around -- Windows is fine but Linux has issues. However, the problems for linux are less severe: often you can get away with assuming that everything is UTF-8. And you can still work with bytes if you absolutely need to.

pdonis6y ago

> On Windows, command line arguments really are unicode (UTF-16 actually)

No, they're not. Windows can't magically send your program Unicode. It sends your program strings of bytes, which your program interprets as Unicode with the UTF-16 encoding. The actual raw data your program is being sent by Windows is still strings of bytes.

> you can still work with bytes if you absolutely need to

In your own code, yes, you can, but you can't tell the Standard Library to treat sys.std{in|out|err} as bytes, or fix their encodings (at least, not until Python 3.7, when you can do the latter), when it incorrectly detects the encoding of whatever Unicode the system is sending/receiving to/from them.

> AFAIK neither sys.argv nor sys.environ had a unicode-supporting alternative in Python 2)

That's because none was needed. You got strings of bytes and you could decode them to whatever you wanted, if you knew the encoding and wanted to work with them as Unicode. That's exactly what a language/library should do when it can't rely on a particular encoding or on detecting the encoding--work with the lowest common denominator, which is strings of bytes.

takeda6y ago

> In your own code, yes, you can, but you can't tell the Standard Library to treat sys.std{in|out|err} as bytes,

Actually you can, you should use sys.std{in,out,err}.buffer, which will be binary[1]

> or fix their encodings (at least, not until Python 3.7, when you can do the latter), when it incorrectly detects the encoding of whatever Unicode the system is sending/receiving to/from them.

I'm assuming you're talking about scenario where LANG/LC_* was not defined, then Python assumed us-ascii encoding. I think in 3.7 they changed default to UTF-8.

[1] https://docs.python.org/3/library/sys.html#sys.stdin

1 more reply

takeda6y ago

Linux had issues whenever LANG/LC_* variables weren't defined, python assumed us-ascii, I believe that was changed recently to just assume utf-8.

int_19h6y ago

Python 3 does not make such assumptions; it uses the appropriate locale encoding.

(IIRC it used to do that in 3.0, but they backtracked very quickly - and 3.0 was effectively treated as a beta by the community at large, anyway.)

j / k navigate · click thread line to collapse

0 comments

5 comments · 2 top-level

ynik6y ago· 3 in thread

pdonis6y ago

> On Windows, command line arguments really are unicode (UTF-16 actually)

> you can still work with bytes if you absolutely need to

> AFAIK neither sys.argv nor sys.environ had a unicode-supporting alternative in Python 2)

takeda6y ago

> In your own code, yes, you can, but you can't tell the Standard Library to treat sys.std{in|out|err} as bytes,

Actually you can, you should use sys.std{in,out,err}.buffer, which will be binary[1]

> or fix their encodings (at least, not until Python 3.7, when you can do the latter), when it incorrectly detects the encoding of whatever Unicode the system is sending/receiving to/from them.

I'm assuming you're talking about scenario where LANG/LC_* was not defined, then Python assumed us-ascii encoding. I think in 3.7 they changed default to UTF-8.

[1] https://docs.python.org/3/library/sys.html#sys.stdin

1 more reply

takeda6y ago

Linux had issues whenever LANG/LC_* variables weren't defined, python assumed us-ascii, I believe that was changed recently to just assume utf-8.

int_19h6y ago

Python 3 does not make such assumptions; it uses the appropriate locale encoding.

(IIRC it used to do that in 3.0, but they backtracked very quickly - and 3.0 was effectively treated as a beta by the community at large, anyway.)

j / k navigate · click thread line to collapse