undefined | Better HN

0 pointsmojuba5y ago0 comments

So string is no longer a "string of characters", it is in fact a program (not Turing complete) that you need to execute.

Though substring(m, n) still makes sense in at least interactive text manipulation: how do you do copy/paste?

0 comments

7 comments · 4 top-level

truefossil5y ago· 2 in thread

The safest path is to consider it a blob. There is some library that can render it magically and that's the only wise thing you can do. The internal structure is hard to understand. Also, definitions change over time. So, you better leave it all to professionals.

spookthesunset5y ago

The thing about Unicode is.... anybody who tried to do it “more simple” would eventually just develop a crappier version of Unicode.

Unicode is complex because the sum of all human language is complex. Short of a ground up rewrite of the worlds languages, you cannot boil away most of that complexity... it has to go somewhere.

And even if you did manage to “rewrite” the worlds languages to be simple and remove accidental complexity I assert that over centuries it would devolve right back into a complex mess again. Why? Languages represent (and literally shape and constrain) how humans think and humans are a messy bunch of meat sacks living in a huge world rich in weird crazy things to feel and talk about.

kps5y ago

There are definitely crappy things about Unicode that are separate from language.

- Several writing systems are widely scattered across multiple ‘Supplement’/‘Extended’/‘Extensions’ blocks.

- Operators (e.g. combining forms, joiners) are a mishmash of postfix, infix, and halffix. They should have been (a) in an easily tested reserved block (e.g. 0xF0nn for binary operators, 0xFmnn for unary), so that you could parse over a sequence even if it contains specific operators from a later version — i.e. separate syntax from semantics, and (b) uniformly prefix, so that read-ahead isn't required to find the end of a sequence (and dead keys become just like normal characters).

goto115y ago· 1 in thread

No it is not a program - at least not anymore than an ASCII string is a program.

It is just that there isn't a simple 1:1 correspondence between bytes and characters and glyphs as in unicode, so you cant just extract an arbitrary byte-sequence from a string and expect it to render correctly.

mojubaOP5y ago

> there isn't a simple 1:1 correspondence between bytes and characters and glyphs

There isn't a simple 1:1 correspondence between anything at all. The only definitive thing about Unicode strings is the beginning where you should start your parsing.

Then the way things are supposed to be displayed to be Unicode-compliant look more like some virtual machine analyzing the code. How is this different from any other declarative language?

techdragon5y ago

Not really. A Unicode string is more like a sequence of data built from simple binary structs, which belong to a smallish group of valid structs. Additionally, some but not all, of these structs can be used to infer the validity of subsequent structs in the sequence if your parsing in a more byte-at-a-time fashion. Alternately if your happy dealing with a little less forward compatibility and go for explicit enumeration of all groups of valid bytes you can be a lot more sure of things but it’s harder to make this method as performant as the byte-at-a-time method, which given the complete ubiquity of string processing in software... leads to the dominance of the byte-at-a-time method.

roel_v5y ago

"So string is no longer a "string of characters""

It hasn't been for 30 years.

j / k navigate · click thread line to collapse