Strings, bytes, runes and characters in Go (opens in new tab)

(blog.golang.org)

76 pointskisielk12y ago35 comments

35 comments

25 comments · 5 top-level

SigmundA12y ago· 9 in thread

Coming from C# it seems odd that a string would index on byte and not char(rune) and that it is essentially a read only byte array. If you wanted a byte array why wouldn't you use a byte array, why have strings and byte arrays?

In C# you can encode/decode strings to byte arrays based on your desired encoding, but a string is composed of characters, it's in memory representation is abstracted.

Is this a performance or zero copy thing? Not having to encode/decode to get to the bytes?

bazzargh12y ago

As someone's already pointed out, C# strings are composed of UTF-16 codepoints not characters - this means that if you have a character outside the basic multilingual plane it'll be represented as two codepoints using a surrogate, and the character count in the C# string will be wrong (the same is true of Java and JS for example)

That's a hard problem, and avoiding it in every situation would require scanning the strings for surrogates beforehand, when you might never need to know that information. Go makes it explicit that knowing the exact character position and string length in characters comes at a cost.

There's a good discussion of this on Tim Bray's blog: http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

twotwotwo12y ago

Just for fun, here's Go handling a char outside the BMP (😃, U+1F603):

http://play.golang.org/p/qg7POYAAOL

1 more reply

neild12y ago

Indexing on runes is expensive, since either you need to store strings as arrays of runes or iterate on each index operation. Indexing on runes is also less useful than it may first seem. Consider that a rune is distinct from a glyph--a single "character" on the screen may be composed of several runes. The occasions when you care about specific runes as opposed to substrings are uncommon. When dealing with substrings, there is no advantage to substring-of-runes as opposed to substring-of-bytes.

Note that many languages that appear to offer indexing by rune (e.g., Java) do not in fact do so, since their 16-bit "character" type is incapable of representing all runes. The fact that this is only rarely an issue points at the fundamental rarity with with code needs to deal with runes-qua-runes.

porges12y ago

Note that in C# indexing on char is really indexing on UTF-16 codepoint, which is arguably worse because it seems to work, until it doesn't.

SigmundA12y ago

Interesting, never had to use use StringInfo to get into surrogate pair support. Although its seems like this could be fixed since both char and string are abstracted from bytes, although indexing on variable length chars would increase computation or UTF-32 would increase memory.

skybrian12y ago

Well, the first question is what a string's internal representation should be, and they went with UTF8. Once you've decided on UTF8, the question is whether to hide the representation or expose it. If you decide to expose it, there's hardly any difference between an immutable byte array and string, so it's simpler in a way to have one type that can be used both ways.

dsymonds12y ago

No, read the article more carefully. A Go string's internal representation is a sequence of bytes. There's nothing UTF-8 about it.

2 more replies

dsymonds12y ago

The string type is immutable, which is the main reason for using it instead of []byte.

SigmundA12y ago

Same in C#, but why not have an immutable byte array for bytes and a string for char/runes? Honestly never needed an immutable byte array that wasn't for chars.

2 more replies

mseepgood12y ago· 5 in thread

Some people here seem to think that indexing, measuring and slicing operations based on runes (code points) instead of bytes (UTF-8 code units) by default would be to be a good idea. It's not - you get the worst of both worlds: indexing is not a constant-time operation and a code point is still not a user-perceived character, because combining character sequences consist of multiple code points, even normalization doesn't help in general.

Other languages like C# seem to be different on the surface, but in fact they index and measure by code units as well (2 byte UTF-16 code units), not by code points.

masklinn12y ago

> It's not - you get the worst of both worlds: indexing is not a constant-time operation

You can't usefully index a unicode stream in constant time and do correct and useful textual stuff anyway due to combining codepoints which may not have precombined forms (if only because there is no defined limit to the number of combining codepoints tacked onto the base) (so normalization will not save you) or codepoints which are not visible to the user and which you may or may not want to see depending on the work you're doing.

People really need to come to terms that a unicode stream is exactly that, a stream.

iv_0812y ago

> You can't usefully index a unicode stream in constant time and do correct and useful textual stuff anyway

To find an index of a substring you need to scan the string, right. But once you have the byte index you can quickly jump to its position in the string, e.g. when you do a slice operation based on that index: s[i:]. If strings.Index() returned a code point index and not a byte index you would have to scan the string again.

1 more reply

derefr12y ago

> and a code point is still not a user-perceived character

How about indexing, measuring and slicing operations based on user-perceived characters, then?

iv_0812y ago

I think the number of displayed characters is even font-dependent.

codeka12y ago

Because even that is not exactly trivial, particularly for noon-Latin languages.

lmm12y ago· 2 in thread

No distinction between string and byte array? I foresee all the fun of python 3 in go's future. Those of us programming in the real world need to deal with legacy character sets in strings obtained from elsewhere, and it's no fun at all to discover that what you thought was a string is actually an array of SJIS or iso8859-1 bytes.

mediocregopher12y ago

It sounds like you've decided to dislike go without ever having actually used it for anything in the "real world".

There is a difference between a string and a byte array. A string is a string. A byte array is a []byte (byte slice). You have to explicitly cast from one to another. Neither are inherently utf8. A string is represented by a byte array under-the-hood, and string literals in your code are read as utf8 encoded. Strings themselves are not necessarily utf8 encoded, and if you need to use a different encoding there's libraries for that (unless you're using something really esoteric).

lmm12y ago

So what do you get when you read a file, or when a file is uploaded to your web server? What happens if you write a function that accepts a string as a parameter, but haven't noticed that you're implicitly assuming the string is utf8? (e.g. a function that formats one string using another - if the encodings are different you'll end up with a string that's invalid for either encoding, no?)

The distinction between a string with one encoding and a string with another is subtle but vitally important - exactly the sort of thing a type system should take care of.

1 more reply

pygy_12y ago· 2 in thread

Julia uses the same strategy, but indexing an utf8 string returns the rune rather than the byte. If you try to get a byte in the middle of a rune representation, it raises an error.

The `next(string, index)` function used for the iteration protocol works like the `utf8.DecodeRuneInString()` shown in the example, but it returns the next valid index rather than the character width.

dbaupp12y ago

Rust has a similar approach (in that it raises an error when you attempt to do something not on a rune boundary), although `string[index]` still returns a byte rather than a character but strong static typing means that it isn't a huge problem.

elithrar12y ago

Go also has utf8.RuneCount([]byte) and utf8.RuneCountInString(string), which return the number of runes: http://golang.org/pkg/unicode/utf8/#RuneCount

(check the docs, there's also RuneStart, which returns true/false if you index mid-rune)

frou_dh12y ago· 2 in thread

That blog post is an example of good technical writing.

stevvooe12y ago

After following the development of Go for awhile, I've come to idolize Rob Pike's terse, accurate communication style.

4ad12y ago

Check out his books too: The Unix Programming Environment and The Practice of Programming. Both co-authored by Brian Kernighan.

j / k navigate · click thread line to collapse

35 comments

25 comments · 5 top-level

SigmundA12y ago· 9 in thread

In C# you can encode/decode strings to byte arrays based on your desired encoding, but a string is composed of characters, it's in memory representation is abstracted.

Is this a performance or zero copy thing? Not having to encode/decode to get to the bytes?

bazzargh12y ago

There's a good discussion of this on Tim Bray's blog: http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

twotwotwo12y ago

Just for fun, here's Go handling a char outside the BMP (😃, U+1F603):

http://play.golang.org/p/qg7POYAAOL

1 more reply

neild12y ago

porges12y ago

Note that in C# indexing on char is really indexing on UTF-16 codepoint, which is arguably worse because it seems to work, until it doesn't.

SigmundA12y ago

skybrian12y ago

dsymonds12y ago

No, read the article more carefully. A Go string's internal representation is a sequence of bytes. There's nothing UTF-8 about it.

2 more replies

dsymonds12y ago

The string type is immutable, which is the main reason for using it instead of []byte.

SigmundA12y ago

Same in C#, but why not have an immutable byte array for bytes and a string for char/runes? Honestly never needed an immutable byte array that wasn't for chars.

2 more replies

mseepgood12y ago· 5 in thread

Other languages like C# seem to be different on the surface, but in fact they index and measure by code units as well (2 byte UTF-16 code units), not by code points.

masklinn12y ago

> It's not - you get the worst of both worlds: indexing is not a constant-time operation

People really need to come to terms that a unicode stream is exactly that, a stream.

iv_0812y ago

> You can't usefully index a unicode stream in constant time and do correct and useful textual stuff anyway

1 more reply

derefr12y ago

> and a code point is still not a user-perceived character

How about indexing, measuring and slicing operations based on user-perceived characters, then?

iv_0812y ago

I think the number of displayed characters is even font-dependent.

codeka12y ago

Because even that is not exactly trivial, particularly for noon-Latin languages.

lmm12y ago· 2 in thread

mediocregopher12y ago

It sounds like you've decided to dislike go without ever having actually used it for anything in the "real world".

lmm12y ago

The distinction between a string with one encoding and a string with another is subtle but vitally important - exactly the sort of thing a type system should take care of.

1 more reply

pygy_12y ago· 2 in thread

Julia uses the same strategy, but indexing an utf8 string returns the rune rather than the byte. If you try to get a byte in the middle of a rune representation, it raises an error.

dbaupp12y ago

elithrar12y ago

Go also has utf8.RuneCount([]byte) and utf8.RuneCountInString(string), which return the number of runes: http://golang.org/pkg/unicode/utf8/#RuneCount

(check the docs, there's also RuneStart, which returns true/false if you index mid-rune)

frou_dh12y ago· 2 in thread

That blog post is an example of good technical writing.

stevvooe12y ago

After following the development of Go for awhile, I've come to idolize Rob Pike's terse, accurate communication style.

4ad12y ago

Check out his books too: The Unix Programming Environment and The Practice of Programming. Both co-authored by Brian Kernighan.

j / k navigate · click thread line to collapse