There are that many Han characters alone, so I’m not sure what the surprise is. It’s not like you have to hard-code them in your grammar.
If anything, I’d hope that new languages in 2010 allow any of the roughly 100,000 non-control non-whitespace [edit: non-punctuation] Unicode characters. For a lot of the code I see, ASCII is at least as constraining as, say, fixnums would be.
And that is why this particular parser won't pass the Java TCKs.
> # There are 46,908 different valid characters that you can use in an identifier
And yet among all those characters I can't have a question mark at the end of a boolean variable or function name...
Anyway, weirdness is no excuse for not supporting a feature in a parser!
condition ? result-if-true : result-if-false
although you're probably right that a smarter parser should be able to distinguish between both cases. But then the next guy comes along and complains that he can't use a colon as a valid character in his variable names...Also, I am not sure rewording the title added any value. The title of the article IMO was just fine - See http://ycombinator.com/newsguidelines.html
Depending on your purpose, might be a whole lot easier to re-use an existing implementation and focus on whatever you're planning to use this parser for...
I was unable to find an existing Java parser in C#. Do you know of one?
Often, this is really just another way of saying, "I didn't think of this before, and I don't want to start thinking about it just now."
The floating point notation would be understandable for those who need to be intimately familiar with floats at the bit level. There are a few people who have to do this.
Allowing unicode identifier names is a feature I've seen many people ask for, and doesn't seem like that big a deal. It must be frustrating using languages that don't support this feature to foreign speakers. Of course some characters can be in an identifier but not start it, this is true of most languages. You can't start variables with numbers in C.
I see no issue with allowing numbers in different bases. And of course the decimal would also be in that base. It would be weird if left of the decimal were in hex and right of the decimal were decimal.