It's infuriating how many Japanese sites still don't use Unicode, purportedly because of this issue (though I suspect that it's just another example of Japan lagging when it comes to web/computer tech).
To make the problem more understandable to the people that are used to alphabetic scripts, suppose that tomorrow an Asian committee starts creating Uniword, a repertoire that maps complete words to numerical IDs. At a certain point they get to "colour".
Uniword committee: Well, that word shares meaning and origin with the other word "color", for which we have already a codepoint, so we will encode them under the same codepoint.
GB, Australia and Canada: Ehi! No! To us those are different words; especially, we do not want Mr. Colours to appear as Mr. Color.
Uniword commitee: No problem, just add some out-of-band information like "nationality" or "<span lang='en-GB'>"
"colour"-people: that will not work, there are so many cases in which this can go wrong. Whenever I copy a field from a DB I also have to extract this extra information?
Uniword: yes, that is the problem? C'mon!
"colour"-people: but do you need to do that in your applications?
Uniword: no, we have one code for every single word in our languages, including codes for very old languages that exist only in two palimpsests.
"colour"-people: and why cannot we have the same level of granularity?
Uniword: because you have too many words!!! And we started we had only 100k available integers.
"colour"-people: and now?
Uniword: now we have 2^32. But, yeah, that is not the point; just do how we suggest. This dialog is getting to long.
"colour"-people: "dialogue", please.
That was perceived as happening more than a few times in the Han Unification debate.
I remember being concerned about Han unification around the time Ruby 1.9 was released, since this seemed to be one of Ruby's major reasons for being encoding-independent instead of standardizing on Unicode. But I hadn't heard about this issue in a while, except to hear occasionally someone say it's not a problem (maybe it was a Chinese person instead of a Japanese person -- the Wikipedia page says that the Chinese aren't as concerned about Han unification since Traditional Chinese didn't get unified with Simplified Chinese).
Until Unicode stops breaking people's names, it will continue to be the one standard for Japanese systems on and offline. Even when(if?) it stops breaking Japanese names, it will take a very, very long time to roll over existing systems and that's precluding unforeseen problems during the conversion.
We should stop before we take the "not following standard" = "broken" ideology. Especially when we consider whom the standard serves best.
Edit: By "it will continue to be the one standard for Japanese" in the 2nd paragraph, I meant ShiftJIS not Unicode. That looked a bit unclear.
This, naturally, has an impact beyond those systems' borders.
There are two ways of dealing with glyphs that share code points. The first is TTC (truetype collection) fonts. A TTC is basically one set of glyphs with several sets of mappings (i.e. which code point maps to which glyph). When you install it, assuming your computer groks ttc, your system shows you a separate font for each mapping. Taking for example Source Han Sans, which adobe just released - if you go to the download page[0] and get the complete version (the "OTC" one), you get a bunch of files like "SourceHanSans-Bold.ttc". If you install one of them you'll see four new fonts: "Source Han Sans J", K, SC, and TC. Then when you use the font, depending on which font name you used the system will change which mapping it applies to the combined set of glyphs. (Hence the choice of font name is the selection mechanism you described.)
The second way is that TrueType fonts have a way to build locale settings into the font. I'm less clear on the details here but apparently it's similar to TTC behind the scenes, except that the mappings are associated with locales - so in an app that supports TT locales, even if you select "Foo J" as your font, when the locale was simplified Chinese you'd get the SC glyph. Of course now the selection mechanism is whether the application knows what locale the content is. (And also whether it supports the mechanism - I don't know how widespread this is.) Either way though, in principle you get different glyphs for the same code point, depending on context.
Or anyway that's the understanding I took away as a font layperson - happy to be corrected.
[0] http://sourceforge.net/projects/source-han-sans.adobe/files/
I don't know if that's a decent solution, but just guesswork doesn't sound like a good idea, because there are bound to be edge cases where it wouldn't work, and then we're back where we started...
Over the years, i am starting to think Han Unification is western ways of hacking the CJK Hans problem rather then actually solving it.
But speaking as a front-end webdev guy, it's been a looong time since I came in contact with any encoding here besides utf-8.
Edit: Besides, a TTF font doesn't have to always use Unicode internally. It supports an arbitrary mapping from bytes (could be in UTF-8 or SJIS) to a glyph number. People who really care about the looks (i.e. printing) have been using a charset for each specific language, such as Adobe-Japan1, which is different from both UTF-8 or Shift JIS.
http://blog.typekit.com/2014/07/15/introducing-source-han-sa...
Fortunately, Git guys added fix to convert it into Unicode internally.
On the other hand, Oriya, which has over 33 million native speakers, including 80% of India's Odisha state, does not appear to be supported.
Oriya appears to be quite complicated to render: http://www.microsoft.com/typography/OpenTypeDev/oriya/intro....
Meanwhile, I wonder if this means we'll see OCR and ePubs for all kinds of scripts now; or if this will help enable Google Translate in more languages? ;-)
Devanagri is what Hindi, Marathi, and Sanskrit use, so I am certain that it isn't any more complex to render than those languages.
I wouldn't be surprised if there happens to be an Osmanya geek in Google, but none of his teammates has ever heard of Oriya. For the same reason, I wouldn't surprised if they added a bunch of geeky fictional languages before actual ones.
http://www.unicode.org/charts/PDF/U10480.pdf
already contains reference glyphs. If you want to preserve scripts, preserve Unicode tables instead of making fonts.
The other is simple practicality - these things take time to develop, you can either wait until all the glyphs are done, or release subsets that cover languages as you work; a subset that covers part of a language isn't very useful but subsets that cover whole languages are.
If your website is written in English and an occasional accented character from other Western languages, there's no need to load a 50MB web font containing all the Tradntional Chinese characters.
Another part is that some CJK characters look somewhat different in C, J, and K. http://www.unicode.org/faq/han_cjk.html#3
Then there are practical considerations. While Latin, Greek and Cyrillic are similar enough to warrant the same styles (serif, sans-serif, script, italic, and various weights) not all of them make very much sense for, say, CJK or a variety of other scripts. So having different fonts for different scripts that are still designed to go together is actually not that bad a solution.
It does mean that for good typography you need a matrix of fonts based on style and script. Word includes two fonts per style for this, to treat CJK differently, which might not be enough, depending on the numbers of different scripts involved in a single document. But a) several dozen scripts per document are somewhat rare apart from Wikipedia's language list per article and font demonstrations; and b) good typography needs effort, this won't change.
[1] https://en.wikipedia.org/wiki/Nasta%CA%BFl%C4%ABq_script
[2] https://medium.com/@eteraz/the-death-of-the-urdu-script-9ce9...
the title is messed up, but I hope the message is clear.
But true, no Nastaliq (yet). Sadly not even mentioned as "unsupported".
edit: Found a different demo page that renders the webfont client-side instead of showing images, and looks much better to me: http://www.google.com/fonts/specimen/Noto+Serif. Maybe it's just that the pre-rendered specimens are made with a poor rendering engine?
How you might write the conversation
"Does he know how to speak Mandarin?
"No, he doesn't."
他會說普通話嗎?
他不會。
in Modern Standard Chinese characters contrasts with how you would write
"Does he know how to speak Cantonese?
"No, he doesn't."
佢識唔識講廣東話?
佢唔識。
in the Chinese characters used to write Cantonese. As will readily appear even to readers who don't know Chinese characters (if you have a good Unicode implementation enabled as you read Hacker News), many more words than "Mandarin" and "Cantonese" differ between those sentences in Chinese characters.
Obviously I'm wrong, because these are just regular Unicode characters, without an HTML "lang" attribute.
What gives?
The Han Unification problem arises from the inverse case - characters that are used in several languages but rendered differently depending on locale[0]. For those characters, they'll render even without a pan-CJK font, but the problem is they'll render in a way that's not appropriate for their locale.
[0] Another way to phrase this would be "distinct characters which share a code point becaus Unicode mistakenly thinks they're a single character whose rendering differs by locale". The difference is basically subjective.
Sadly, it's probably still not possible to use as a Webfont. A single font weight is over 8mb, but there is a distinct possibility this could go into mobile devices and operating systems which would be awesome.
http://www.monotype.com/services/screen-imaging-solutions/dy...
I also couldn't find any font that covers mathematical symbols from the SMP.
EDIT: Just downloaded the zip archive. Unix permissions for the Bengali and Gurmukhi fonts are different from the rest of them.
mathematical symbols from the SMP
as I did the expected Google search, and I am not sure that the search results I see refer to what you were referring to.
http://developers.google.com/speed/pagespeed/insights/?url=h...
I'm not aware of any other font that does a decent job of handling all of Simplified Chinese, Traditional Chinese, Japanese, and Korean simultaneously, and with light, bold, thin etc variants to boot. Most existing fonts, even expensive commercial ones, are lucky to support two, and even then usually regular text only.
Cherokee (US) is one fine looking set of glyphs.
Smallest .otf in collection: 4093KB