undefined | Better HN

0 pointslolc5y ago0 comments

> In the end, the only layer at which it really matters whether your byte sequence can be decoded is the font renderer

Ok that explains how we ended up here. I'm considering some other common uses! A search-index for example greatly profits from being able to normalize representations and split words.

0 comments

3 comments · 1 top-level

naniwaduni5y ago· 2 in thread

Search index use cases probably also benefit from normalizing inputs across encodings, that's no shining example of utf-8-onlyism.

You can still best-effort split words! You can do a pretty good job splitting words without ensuring that the words decode in your preferred encoding.

lolcOP5y ago

Here's the thing: I don't want to work in UTF8. I want to work in Unicode. Big difference. Because tracking the encoding of my strings would increase complexity. So at the earliest convenience, I validate my assumptions about encoding and let a lower layer handle it from then on.

I understand you're arguing about some sort of equivalency between byte-arrays and Unicode strings. Sure there are half-baked ways to do word-splitting on a byte-array. But why do you consider that a viable option? Under what circumstances would you do that?

naniwaduni5y ago

Every circumstance. Why do you consider it unviable? What problems do you think having a Unicode sequence solves?

1 more reply

j / k navigate · click thread line to collapse