undefined | Better HN

0 pointsacdha12y ago0 comments

You'd have to have rather weird data for it to be anywhere near 50% larger for real text (i.e. even if you only use Chinese, if you have punctuation, arabic numerals, quotes or URLs, HTML, etc. the averages cancel more than you might think) and a completely incompetent search engine design for that to remotely approach 50% more time to query or index.

If you were assigned the task of indexing the UTF-8 worst case corpus, nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages UTF-8 gives you in every other area. Y internal details like compression are much easier to change than dealing with external interfaces which must be coordinated (this is why JavaScript still has such painful Unicode support even though browsers handle almost everything well in markup)

0 comments

3 comments · 1 top-level

est12y ago· 2 in thread

> nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages ...

That's exactly how those UTF-X and UCS-Y encodings were invented, right?

The point is, this beast is called unicode, how ironic.

acdhaOP12y ago

The difference is internal vs. external: if you use UTF-8 things simply work better any time you exchange data. If you had lots of high code points (>\U+8000) you would get better results in most cases using a full compression system rather than playing with 2-byte encoding hacks, which is really all you're talking about in both cases.

est12y ago

I totally agree that UTF-8 is pretty good at exchanging data because UTF8 is better than UCS-* and other UTF-* overall, and because everyone is (and should be) using it.

But you know, there are other cases besides exchanging. Like I said, if your text data is mainly latin you are good, but not so good if you are stuck with non-latin BMPs.

1 more reply

j / k navigate · click thread line to collapse