Was Unicode designed well? If it were designed from scratch today, with no legacy considerations, would the ideal design look like the current design? What would you change?
Being extremely ignorant of the problem space, the first thing I would consider for the chopping block would be combining characters. Just make every character a precomposed character (one code point), so there's no need for normalization. I'm curious if such a scheme could fit every code point into 32 bits, though. Would this be feasible?
It’s like giving me a list of numbers and asking me to “combine” them. What does that mean? Do I sum them up, or concatenate them, or something else entirely? A lot of string reversal solutions are “incorrect” because there isn’t even a correct question in the first place.
Even with an infinitely large code space, doing away with combining marks and encoding everything as precomposed would be impossible because you cannot have a definitive list of every single combination of letters and diacritics that may mean something to someone. If Unicode had been the first digital character set ever created, it would not contain a single precomposed code point because they are utterly impractical. As such, normalisation – or at least the canonical reordering part of it – is always going to be a necessity.
I'm not sure why you focused on this one example, which was just meant to indicate the nature of the issue, not cite a broad concrete problem. There are plenty of situations where you'd want to operate on graphemes, not code points, like deleting the previous grapheme in a text editor. It would certainly help programmers write correct code if the two were the same.
>doing away with combining marks and encoding everything as precomposed would be impossible because you cannot have a definitive list of every single combination of letters and diacritics that may mean something to someone
It seems to me it would be trivial to enumerate these combinations, and assign code points to them. For example, the Germanic umlaut is only used with vowels, so that's at most 5 code points.
Well, 10 code points because vowels can be capitalized and 12 because ÿ is used in other languages.
That's one of the easiest cases. Now you need to go through _every_ other language which has _ever_ been used in human history and repeat that process for every combining character. Note also that in some languages it's valid to keep stacking a fair number of combining modifiers so you'd need to cover every permutation allowed in each of them, and spend a lot of time working with linguists and classicists to make sure you weren't removing obscure combinations which are actually needed.
At the end of years of work, you'd have an encoding which is easier for C programmers to think about but means all of your documents require substantially more storage than they used to.
Far from it. Even if you limit yourself to just Latin, the number of valid (whatever “valid” even means) combinations is already unmanageably gargantuan. Just look at phonetic notation as one example of many. The basic IPA alone uses over 100 letters for consonants and vowels, plus dozens of different diacritics, many of which need to be present concurrently on the same base letter. Make the jump to extended IPA or any number of other, more specialised transcription systems – and there are plenty – and you’ll never see the end of it.
Sure, it may be technically possible to create an exhaustive list of letter-and-diacritic combinations, just like you can technically create an exhaustive list of every single human on Earth, but good luck getting there. And good luck making sure you didn’t miss anything in the process.
Of course, you don’t need to limit yourself to Latin, because Unicode has 160 other writing systems to offer.
Writing systems like Tibetan and Newa where consonants can be stacked vertically to arbitrary heights and then have vowel signs and other marks attached as a bonus as well.
Or Hangul which would occupy no less than 1,638,750 code points if all possible syllable blocks were encoded atomicly, and that doesn’t even account for the archaic tone marks, or those novel letters that North Korea once tried to establish that aren’t even in Unicode yet.
Or Sutton SignWriting whose system of combining marks and modifiers is so complex that I’m not even gonna explain it here.
If you eschew combining characters then yes, you will create an encoding where every code point is at the same time a full grapheme cluster and that definitely has concrete advantages, but as a consequence you have now assigned to yourself the unenviable task of having to possess perfect, nigh-omniscient knowledge of every single thing that a person has ever written down in the entirety of human history. Because unless you possess that knowledge, you will leave out things that some people need to type on a computer under some circumstances.
Every time some scholar discovers a previously forgotten vowel sign in an old Devanagari manuscript, you need to encode not only that one new character, but every combination of that vowel sign with any of the (currently) 53 Devanagari consonants, plus Candrabindu, Anusvara, and Visarga at the very least, just in case these combinations pop up somewhere, because they’re all linguistically meaningful and well-defined.
It’s doable, in a sense, but why would you subject yourself to that if you can just make characters combine with each other instead?
For Unicode, a “design from scratch” design would remove duplicate legacy code points. Why have “é” both as a single code point and as ”e” plus a combining character?
It also wouldn’t have any of the deprecated characters (https://en.wikipedia.org/wiki/Unicode_character_property#Dep...)
I also would remove the few special flag code points (https://home.unicode.org/the-past-and-future-of-flag-emoji/)
If “design from scratch” also means “drop the goal of encompassing old character encodings”, more code points probably could go. Why are DOS box characters in Unicode, while Atari/PET, etc, ones aren’t, for example?
Finally, I would look into making it easier to retrieve character class from a code point (the ‘these code points are digits, these are combining marks, etc’ tables are a bit of a wart, and getting rid of them could be useful in small embedded devices).
I doubt a solution exists there that is future proof agains extension of Unicode and doesn’t blow up memory use, though, and am not sure any embedded devices too small to host those tables actually could use that info.
Reversing a string merely indicates the problem. There are many cases for operating on graphemes instead of code points. For example, deleting the previous grapheme in a text editor when pressing backspace/delete. I think most programmers assume they're dealing with graphemes when they're actually dealing with code points. See, for example, the rune type in Go.
Where I would make the change isn't Unicode itself but the APIs. All of the problems you're talking about basically come down to legacy language design where people think they're working with grapheme clusters but they're really working with code points. Making that more explicit in the tools would be good, similar to how Python 3 forced you to think about whether you wanted encoded binary data or a decoded string but there's so much history around that making it hard to do without getting a lot of griping from people who don't want to update decades of habit.
Did you mean to say that not all programs support Unicode? It's been a long time — at least a decade — since I ran into a device which doesn't support it at all, as opposed to something like PHP code which has built-in support but didn't enable it.
has nothing with Unicode
It's thing to do with whether it's Little or Big Endian
Unicode as fixed & secured as Ascii at its era
And of course what characters are a mark-bearing letter is application-dependent and there might be collateral requirements (e.g. if the reversed text is meant to be displayed swapping closing and opening delimiters such as parentheses: "(note)" -> ")eton(" or "(eton)" ).