> That's a really good explanation of your position and reasons for it, thanks you.
Cheers, I've had time to think and some sleep. I apologize to you and the people I've offended with my cranky trollish manner.
> Unicode is not intended to, or try to encode how a text should be displayed.
This made realize "text" traditionally is exactly language that is displayed somehow. The whole concept of storing writing as digital bits is metaphysical. Barely so for e.g. English, but quite a lot for e.g. Arabic.
> [Unicode] is purely and only about encoding the graphemes.
If it's just a catalog mapping numbers to little pictures (technically to collections, or families, of glyphs, or even to non-specific heuristics for deciding if a graphical structure counts as a glyph for a grapheme [1]) then I'll shut up. But what about the modifiers and stuff?
Maybe I am being unfair to Unicode. I don't want to deny or denigrate the cool and useful things it actually does do. As I said I think it's a combination of a good idea (encoding graphemes) with an impossible idea (encoding written human languages). If Unicode isn't the latter then I've been shouting at the wrong cloud!
- - - -
Here's what I'm trying to say: Imagine a conceptual "space" with ASCII on one side and PostScript on the other. In between there's a countably infinite set of formalisms that can describe and render human languages. From this point of view, the Unicode standard is a small part of that domain but it is absorbing (in my opinion) so much of the available time and attention that other potentially more-useful regions of the domain are completely neglected.
- - - -
So, yeah, I think we should study languages and writing systems and computerize them carefully with native speakers and writers and linguistic experts in the room. And I think we would need what are in effect DSLs for each kind of writing system. (Not every language, but rather every kind of way that languages are written down.)
> how would you process the content of a string that is actually a DSL
Parse it to a data-structure, the simplest that will suffice for the language's structure. Work with it using defined functions (API). This is what we do already but the fact that English could be represented as array<char> reasonably well tends to obscure it.
string_value.split()
Or better yet:
>>> s = "What is the type of text?"
>>> s.title()
'What Is The Type Of Text?'
> With Unicode it's possible to write a library that can process text in any script
That seems like it's true but I don't think it is true in practice. In your reply to mjevans elsewhere in this thread,
> You can't determine [the correct way of connecting the characters] purely from Unicode, you have to also know the conventions used in writing Arabic script. However Unicode is not intended to encode such conventions.
And you point out that Unicode won't help you properly support cut-and-paste for Arabic. So you can't process text using Unicode if that text is Arabic. In fact, there may not be "text" in Arabic the way there is in English! There is written Arabic but not textual Arabic. In other words, Unicode may well be engaged in creating the textual form of Arabic (and other languages.)
> any one of thousands of different domain specific languages
I think there would be less than a hundred distinct formalisms that together could capture the ways we have come up with to write, perhaps less than a dozen, but I wouldn't want to bet on it.
> how would you ever be able to write one piece of code to work with all of them and all possible future permutations?
Maybe you can't.
But if it's possible it will be by figuring out the type of text, which means exactly to figure out the set of functions that make sense on text. At which point your code can use those functions (the API of the TextType) to abstract over text. Like the str.title() method. Does that even makes sense in Chinese or Arabic?
The comment by int_19h in this thread speaks to this point really well:
> It's not about encodings at all, actually. It's about the API that is presented to the programmer.
> And the way you take it all into account is by refusing to accept any defaults. So, for example, a string type should not have a "length" operation at all. It should have "length in code points", "length in graphemes" etc operations. And maybe, if you do decide to expose UTF-8 (which I think is a bad idea) - "length in bytes". But every time someone talks about the length, they should be forced to specify what they want (and hence think about why they actually need it).
> Similarly, strings shouldn't support simple indexing - at all. They should support operations like "nth codepoint", "nth grapheme" etc. Again, forcing the programmer to decide every time, and to think about the implications of those decisions.
> It wouldn't solve all problems, of course. But I bet it would reduce them significantly, because wrong assumptions about strings are the most common source of problems.
What you're asking for is the base type for "text" for all languages, the ur-basestring, if you will. (It may not exist.)
> Finally if your DSL is producing display output, how does that work with fonts? What if you want to vary the appearance of the output, how do you apply that to the encoding output? It just seems that this approach produces an enormous monolithic super-complex rabbit hole with no bottom in sight.
Well again, computerized text is a new thing under the sun, different from writing, which has been happening all over the world for thousands of years (cf. Rongorongo[2]) Separating the "text" from the written form of the text (the display) is a new and metaphysical thing to do. For languages like English we get pretty far with encoding the Alphabet and some punctuation marks and putting them in a row. We completely bunted on capitalization though, we pretend that 'a' and 'A' are two different things. Typefaces can be abstracted from the stream of encoded byte/characters and treated as metadata. If you want to include it in a digital document you immediately have to define a DSL (Rich Text Format for example) to shoehorn the metadata back into the byte stream. Complications ensue.
For some languages (e.g. Arabic) it may not make sense to abstract the display of the text from the text. (Again, writing is exactly display. It is literally (no pun intended) the act of displaying language.) You have to include metadata in addition to the graphemes in order to recreate the correct display of the text, so you have to have some kind of DSL for the task.
As I said above, I don't think there are more than one or two dozen truly different ways of writing. A set of DSLs (perhaps not dissimilar to the generative L-Systems that can produce myriad realistic plant-like images from a small set of operations) could presumably model those ways of writing.
Unicode was a start on computerization of written languages. I think an approach that treats each kind of writing system as a first-class object of study in its own right will give us standard models for dealing with text in each kind in digital form. We should strive for computerized writing systems that are "as simple as possible, but no simpler." And, yes, it seems to me that some of them will have to include producing display output.
[1] DuckDuckGo image search for "letter A" https://duckduckgo.com/?q=letter+a&t=ffsb&atb=v60-2_b&iax=1&...
[2] https://en.wikipedia.org/wiki/Rongorongo
- - - -
Here's my "Cartoon History of Unicode":
1. ASCII exists
2. Europe does too! Extend ASCII with the funky umlauts or whatever.
3. Oh shit! Japan! Mojibake!
4. I know! Let's use *sixteen* bits! That'll solve everything.
5. What do you mean Chinese is different from Japanese?
6. WTF Arabic!?
7. Boy there sure are a lot of graphemes. Gotta collect 'em all.
8. PIZZA SLICE
9. POOP
At which point we reach "peak internet" and Doge appears to say "wow".