And I'm saying it doesn't really matter, because unicode codepoints are already a form of "leaky abstraction" which you'll have to handle (in that a read/written "character" does not correspond 1:1 to a codepoint anyway). Unicode is a tentative standardization of historical human production, and if you expect that to end up clean and simple you're going to have a hard time.
> Can one "character" span multiple codepoints?
Yes.
> Do you have an example of this?
Devanagari (the script used for e.g. Sanskrit) is full of them. For instance, "sanskrit" is written "संस्कृतम्" [sə̃skɹ̩t̪əm]. If you try to select "characters" in your browser you might get 4 (सं, स्कृ, त and म्) or 5 (सं, स्, कृ, त and म्) or maybe yet another different count, but this is a sequence of 9 codepoints (regardless of the normalization, it's the same in all of NFC, NFD, NFKC and NFKD as far as I can tell):
स: DEVANAGARI LETTER SA
ं: DEVANAGARI SIGN ANUSVARA
स: DEVANAGARI LETTER SA
्: DEVANAGARI SIGN VIRAMA
क: DEVANAGARI LETTER KA
ृ: DEVANAGARI VOWEL SIGN VOCALIC R
त: DEVANAGARI LETTER TA
म: DEVANAGARI LETTER MA
्: DEVANAGARI SIGN VIRAMA
Note: I'm not a Sanskrit speaker and I don't actually know devanagari (beyond knowing that it's troublesome for computers, as are jamo) so I can't even tell you how many "symbols" a native reader would see there.I'm curious if a Sanskrit speaker would see each of the codepoints as a symbol or not.
Edit: thinking about it, i guess if you asked a Sanskrit speaker how long a word/sentence was, you'd get the answer..
There is one, kind-of: "grapheme cluster"[0]. This is the "unit" used by UAX29 to define text segmentation, and aliases to "user-perceived character"[1].
Most languages/API don't really consider them (although they crop up often in e.g. browser bug trackers), let alone provide first-class access to them. One of the very few APIs which actually acknowledges them is Cocoa's NSString — and Apple provides a document explaining grapheme clusters and how they relate to NNString[2] — which has very good unicode support (probably the best I know of, though Factor may have an even better one[3]), and it handles grapheme clusters through providing messages which work on codepoint ranges in an NSString, it doesn't treat clusters as first-class objects.
> i guess if you asked a Sanskrit speaker how long a word/sentence was, you'd get the answer..
Indeed.
[0] http://www.unicode.org/glossary/#grapheme_cluster
[1] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Bounda...
[2] https://developer.apple.com/library/mac/#documentation/Cocoa...
[3] the original implementor detailed his whole route through creating factor's unicode library, and I learned a lot from it: http://useless-factor.blogspot.be/search/label/unicode