Rule of thumb: parsing/lexing shouldn’t takes more than 10% of your compiler course.
I get that parsing is more of an implementation detail and doesn't really belong to the space-brained realm of language design per se, but it's a bit annoying that most texts refuse to give any space to the topic, and rely on your language being S-expression based or assume you're going to use a parser generator. Like, in the real world, even if one will never actually implement a fully-fledged programming language, you're still probably going to have to parse things sometimes. I would love a book that goes into detail about different parsing techniques and considers best practices and patterns and tradeoffs/design considerations -- would pay good money for that
It reminds me somewhat of the situation in analysis, where there are lots of theorems that aren't written down anywhere because literally every book states them as "easy" exercises. Maybe I'm looking in the wrong places, but I can't find much in the way of concrete guidance on implementing parsers. I'm aware of the beautiful series on parsing theory by Aho & Ullman ("The Theory of Parsing, Translation, and Compiling"), but those are more focused on theory rather than implementation
> I would love a book that goes into detail about different parsing techniques and considers best practices and patterns and tradeoffs/design considerations -- would pay good money for that
Terrence Parr's "Language Implementation Patterns" spends quite a bit of time on parsing, and parse tree->ast conversyions.
That is definitely true, but in practice there isn't much to say about it, because sophisticated parsers turn out not to be particularly important; it works out better overall to design simple grammars, and then the parsing is easy.
- If you're a beginner, you'll write a recursive descent parser, because that's the simplest technique, and it lets you focus on your project instead of a new, unfamiliar tool.
- If you're writing a domain-specific language, or a config format, or something of that nature, you'll use whichever parser generator integrates most conveniently into your workflow, and you'll design your grammar around whatever its manual tells you to do.
- If you're writing a full-scale language compiler, you'll go back to recursive descent, because that offers the easiest way to recover from errors and report informative messages. Maybe you'll throw in precedence-climbing for operators.
> I would love a book that goes into detail about different parsing techniques and considers best practices and patterns and tradeoffs/design considerations -- would pay good money for that
I would also read such a book, but it would be more of a book about parser generators than a book about parsers.
For every big, deep, native code compiler, there are a hundred template languages, config files, report generators, etc. all of which are real programs providing real value for actual people.
Emphasizing parsing provides the most value for the greatest number of people. The folks that do end up needing more back end depth will still have the resources available to learn it.
Everybody and their dog thinks it necessary to inflict some new sub-par language on us when in about 99.9% of cases they should just either have stuck to s-expressions or some suitable subset of a popular programming or existing config language with a relatively sane syntax (blaze/bazel did that right, cmake did that very very wrong).
When was the last time you looked at some config file and thought, wow I'm so glad they didn't use toml or python or whatever, but instead made up some completely new syntax nothing in the world apart from this tool itself can parse and that I can't programaticaly manipulate?
When was the last time you thought, wow I am so glad that someone invented a new templating language that creates some new injection vulnerabilities, because no one apart from the lisp people ever seem to have worked out that if you want to interpolate into something tree shaped, you should have a tree-based interpolation syntax? Because although sexps and quasiquote solve this very nicely and concisely everyone else still seems to love string-bashing plus some ad-hoc "escaping" system for this. And one reason for this is of course precisely the enormous abundance of idiotic config languages that can't be easily manipulated as anything than opaque strings.
[Edit: if you do create a new config file language, pretty please provide some means to directly query and losslessly manipulate it; for the lossless part you will either need to have first class comments unambiguously attached to a particular syntactic construct and agreed upon deterministic formatting or IDE-style complexity, the first one is probably a better idea]
As opposed to most compiler articles, this one actually covers code generation for every section of its chapters, which is really great.
I also like that every chapter focuses on a specific feature and describes how to implement it end to end: lexical/syntactic parsing, AST, and x86_64 generation.
Great series!
https://en.wikipedia.org/wiki/Logic_for_Computable_Functions
print(“You have some form of undefined behavior, which means printing this is a valid response per the C standard”)For C++ IFNDR ("Ill-formed, No diagnostic required") the situation is trickier because the affected programs (some unknowable but likely large proportion of all purported C++ code) are not well formed C++, the standard offers no hint as to what happens or why, since it constrains only the behaviour of a C++ compiler for well formed C++ programs.
† It's possible the C lexer claims to have some "Undefined Behaviour" cases like the C++ lexer, hence P2621 "UB? In my lexer?" which is a reference to a 2005 meme because C++ standards committee members are down with the kids, but that's clearly a standards text bug if so because it makes no sense to have UB in the lexer, these should just be ill-formed programs, you get a compiler error.