undefined | Better HN

0 pointsmannykannot5y ago0 comments

It is not clear to me whether there is a material difference here. Any text string is a sequence of bytes for which some interpretation is intended, and many meaningful operations on those bytes will not be meaningful unless that interpretation is taken into account.

The problem that you have raised here seems to be one of what alphabet or language is being used, but that issue cannot even arise without taking the interpretation into account. If you want alphabet-aware, language-aware, spelling-aware or grammar-aware operators, these will all have to be layered on top of merely byte-aware operations, and this cannot be done without taking into account the intended interpretation of the bytes sequence.

Note that it is not unusual to embed strings of one language within strings written in another. I do not suppose it would be surprising to see some French in a Russian-language War and Peace.

0 comments

3 comments · 1 top-level

naniwaduni5y ago· 2 in thread

This implies that you should have types for every intended use of a text string. This is, in fact, a sensible approach, reasonably popular in languages with GADTs, even if a bit cumbersome to apply universally.

A type to specify encoding alone? Totally useless. You can just as well implement those operations on top of a byte string assuming the encoding and language &c., as you can implement those operations on top of a Unicode sequence assuming language and culture &c..

mannykannotOP5y ago

To implement any of the above, while studiously avoiding anything making explicit the fact that the interpretation of the bytes as a sequence of glyphs is an intended, necessary and separable step on the way, would be bizzarre and tendentious.

I see you have been editing your post concurrently with my reply:

> You can just as well implement those operations on top of a byte string assuming the encoding and language &c., as you can implement those operations on top of a Unicode sequence assuming language and culture &c..

Of course you can (though maybe not "just as well"), but that does not mean it is the best way to do so, and certainly not that it is "totally useless" to implement the decoding as a separate step. Separation of concerns is a key aspect of software engineering.

naniwaduni5y ago

> To implement any of the above, while studiously avoiding anything making explicit the fact that the interpretation of the bytes as a sequence of glyphs is an intended, necessary and separable step on the way, would be bizzarre and tendentious.

Codepoints are not glyphs. Nor are any useful operations generally performed on glyphs in the first place. Almost all interpretable operations you might want to do are better conceived of as operating as substrings of arbitrary length, rather than glyphs, and byte substrings do this better than unicode codepoint sequences anyway.

So I contest the position that interpreting bytes as a glyph sequence is a viable step at all.

1 more reply

j / k navigate · click thread line to collapse