undefined | Better HN

0 pointseru10mo ago0 comments

Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.

0 comments

13 comments · 2 top-level

xigoi10mo ago· 8 in thread

I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.

zahlman10mo ago

Such languages do not have strings. Definitionally a string is a sequence of characters, and more than 256 characters exist. A byte sequence is just an encoding; if you are working with that encoding directly and have to do the interpretation yourself, you are not using a string.

But if you do want a sequence of bytes for whatever reason, you can trivially obtain that in any version of Python.

capitainenemo10mo ago

My experience personally with python3 (and repeated interactions with about a dozen python programmers, including core contributors) is that python3 does not let you trivially work with streams of bytes, esp if you need to do character set conversions, since a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3. The last attempt was much larger, still failed, and they thought they could probably do it, but it would require far more code and was not worth their effort.

I'll probably just use rust for that script if python2 ever gets dropped by my distro. Reminds me of https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journ...

1 more reply

afiori10mo ago

I would like an utf-8 optimized bag of bytes where arbitrary byte operations are possible but the buffer keeps track of whether is it valid utf-8 or not (for every edit of n bytes it should be enough to check about n+8 bytes to validate) then utf-8 then utf-8 encoding/decoding becomes a noop and utf-8 specific apis can check quickly is the string is malformed or not.

account4210mo ago

But why care if it's malformed UTF-8? And specifically, what do you want to happen when you get a malformed UTF-8 string. Keep in mind that UTF-8 is self-synchronizing so even if you encode strings into a larger text-based format without verifying them it will still be possible to decode the document. As a user I normally want my programs to pass on the string without mangling it further. Some tool throwing fatal errors because some string I don't actually care about contains an invalid UTF-8 byte sequence is the last thing I want. With strings being an arbitrary bag of bytes many programs can support arbitrary encodings or at least arbitrary ASCII-supersets without any additional effort.

1 more reply

bawolff10mo ago

Me too.

The languages that i really dont get are those that force valid utf-8 everywhere but dont enforce NFC. Which is most of them but seems like the worst of both worlds.

Non normalized unicode is just as problematic as non validated unicode imo.

jibal10mo ago

Python has byte arrays that allow for that, in addition to strings consisting of arrays of Unicode code points.

account4210mo ago

Yes, I always roll my eyes when people complain that C strings or C++'s std::string/string_view don't have Unicode support. They are bags of bytes with support for concatenation. Any other transformation isn't going to have a "correct" way to do it so you need to be aware of what you want anyway.

astrange10mo ago

C strings are not bags of bytes because they can't contain 0x00.

ynik10mo ago· 3 in thread

Python 3 internally uses UTF-32. When exchanging data with the outside world, it uses the "default encoding" which it derives from various system settings. This usually ends up being UTF-8 on non-Windows systems, but on weird enough systems (and almost always on Windows), you can end up with a default encoding other than UTF-8. "UTF-8 mode" (https://peps.python.org/pep-0540/) fixes this but it's not yet enabled by default (this is planned for Python 3.15).

arcticbull10mo ago

Apparently Python uses a variety of internal representations depending on the string itself. I looked it up because I saw UTF-32 and thought there's no way that's what they do -- it's pretty much always the wrong answer.

It uses Latin-1 for ASCII strings, UCS-2 for strings that contain code points in the BMP and UCS-4 only for strings that contain code points outside the BMP.

It would be pretty silly for them to explode all strings to 4-byte characters.

jibal10mo ago

You are correct. Discussions of this topic tend to be full of unvalidated but confidently stated assertions, like "Python 3 internally uses UTF-32." Also unjustified assertions, like the OP's claim that len(" ") == 5 is "rather useless" and that "Python 3’s approach is unambiguously the worst one". Unlike in many other languages, the code points in Python's strings are always directly O(1) indexable--which can be useful--and the subject string has 5 indexable code points. That may not be the semantics that someone is looking for in a particular application, but it certainly isn't useless. And given the Python implementation of strings, the only other number that would be useful would be the number of grapheme clusters, which in this case is 1, and that count can be obtained via the grapheme or regex modules.

account4210mo ago

It conceptually uses arrays of code points, which need up to 24 bits. Optimizing the storage to use smaller integers when possible is an implementation detail.

2 more replies

j / k navigate · click thread line to collapse

0 comments

13 comments · 2 top-level

xigoi10mo ago· 8 in thread

I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.

zahlman10mo ago

But if you do want a sequence of bytes for whatever reason, you can trivially obtain that in any version of Python.

capitainenemo10mo ago

I'll probably just use rust for that script if python2 ever gets dropped by my distro. Reminds me of https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journ...

1 more reply

afiori10mo ago

account4210mo ago

1 more reply

bawolff10mo ago

Me too.

The languages that i really dont get are those that force valid utf-8 everywhere but dont enforce NFC. Which is most of them but seems like the worst of both worlds.

Non normalized unicode is just as problematic as non validated unicode imo.

jibal10mo ago

Python has byte arrays that allow for that, in addition to strings consisting of arrays of Unicode code points.

account4210mo ago

astrange10mo ago

C strings are not bags of bytes because they can't contain 0x00.

ynik10mo ago· 3 in thread

arcticbull10mo ago

It uses Latin-1 for ASCII strings, UCS-2 for strings that contain code points in the BMP and UCS-4 only for strings that contain code points outside the BMP.

It would be pretty silly for them to explode all strings to 4-byte characters.

jibal10mo ago

account4210mo ago

It conceptually uses arrays of code points, which need up to 24 bits. Optimizing the storage to use smaller integers when possible is an implementation detail.

2 more replies

j / k navigate · click thread line to collapse