https://news.ycombinator.com/item?id=30414683
https://news.ycombinator.com/item?id=30414879
I spent a year or two working with PEGs, and ran into similar issues multiple times. Adding a new production could totally screw up seemingly unrelated parses that worked fine before.
As the author points out, Earley parsing with some disambiguation rules (production precedence, etc.) has been much less finicky/annoying to work with. It's also reasonably fast for small parses even with a dumb implementation. Would suggest for prototyping/settings when runtime ambiguity is not a showstopper, despite the remaining issues described in the article re: having a separate lexer.
> Earley parsing with some disambiguation rules
Any idea why GLR always gets ignored?
GLR would probably have (much) better performance but I'm usually not parsing huge files (or would hand-roll one if I were). I've not yet found an explanation of GLR (or even LR for that matter) that's quite as simple as PEGs or Earley (suggestions welcome tho!).
Both GCC and LLVM implement recursive decent parsers for their C compilers.
Parser generators are an abomination inflicted upon us by academia, solving a non problem, and poorly.
A hand-written recursive descent parser (with an embedded Pratt parser to handle expressions/operators) solves all the problems that parser generators struggle with. The big/tricky "issue" mentioned in the article - composing or embedding one parser in another - is a complete non-issue with recursive-descent - it's just a function call. Other basic features of parsing: informative/useful error messages, recovery (i.e. don't just blow up with the first error), seamless integration with the rest of the host language, remain issues with all parser generators but are simply not issues with recursive descent.
And that's before you consider non-functional advantages of recursive-descent: debuggability, no change/additions to the build system, fewer/zero dependencies, no requirement to learn a complex DSL, etc.
This is the controversial part, Lisp aficionados to the contrary.
You can parse quite a lot more than Lisp with techniques from 1965.
My brain can do it, why can't my computer?
Grammar-based parsing for natural language isn't anywhere close to working, sadly, and may never be.
Parsing: The Solved Problem That Isn't (2011) - https://news.ycombinator.com/item?id=8505382 - Oct 2014 (70 comments)
Parsing: the solved problem that isn't - https://news.ycombinator.com/item?id=2327313 - March 2011 (47 comments)
Certain parser generators make life easier by supporting actions on parser/lexer rules. This is great and all, but it has the downside that the grammar you provide is no longer reusable. There's no way for others to import that grammar and provide custom actions for them.
I don't know. In my opinion parsing theory is already solved. Whether it's PEG, LL, LR, LALR, whatever. One of those is certainly good enough for the kind of data you're trying to parse. I think the biggest annoyance is the tooling.
Pros: * They're just a technique/library that you can use in your own language without the seperate generation step.
* They're simple enough that I often roll my own rather than using an existing library.
* They let you stick code into your parsing steps - logging, extra information, constructing your own results directly, e.g.
* The same technique works for lexing and parsing - just write a parser from bytes to tokens, and a second parser from tokens to objects.
* Depending on your languages syntax, you can get your parser code looking a lot like the bnf grammar you're trying to implement.
Cons: * You will eventually run into left-recursion problems. It can be nightmarish trying to change the code so it 'just works'. You really need to step back and grok left-recursion itself - no handholding from parser combinators.
* Same thing with precedence - you just gotta learn how to do it. Fixing left-recursion didn't click for me until I learned how to do precedence.
A somewhat breathless description of all of this is in the Marpa parser documentation:
https://jeffreykegler.github.io/Marpa-web-site/
In practice, I've found that computers are so fast, that with just the Joop Leo optimizations, 'naive' Earley parsing is Good Enough™: https://loup-vaillant.fr/tutorials/earley-parsing/(I may be talking out of my ass here.)
In particular, you can do (a subset of) the following in sequence:
* write your own grammar in whatever bespoke language you want
* compose those grammars into a single grammar
* generate a Bison grammar from that grammar
* run `bison --xml` instead of actually generating code
* read the XML file and implement your own (trivial) runtime so you can easily handle ownership issues
In particular, I am vehemently opposed to the idea of implementing parsers separately using some non-proven tool/theory, since that way leads to subtle grammar incompatibilities later.
https://soft-dev.org/pubs/html/diekmann_tratt__dont_panic/
https://drops.dagstuhl.de/storage/00lipics/lipics-vol166-eco...
I don't know if that's specific to tree-sitter though, I'm sure there are other incremental parsers. I have to say that I've tried ANTLR and tree-sitter, and I absolutely love tree-sitter. It's a joy to work with.
Also Tree Sitter only does half the parsing job - you get a tree on nodes, but you have to do your own parse of that tree to get useful structures out.
I prefer Chumsky or Nom which go all the way.
No, it isn't. And incremental parsing is older than 2011 too (like at least the 70s).
For example: https://dl.acm.org/doi/pdf/10.1145/357062.357066
Because the grammar for a parser generator is usually much simpler than most general purpose programming languages, it is typically relatively straightforward to handwrite a parser for it.
Double quotes in C code mean begin and end of a string. But strings contain quotes too. And newlines. Etc.
So we got the cumbersome invention of escape codes, and so characters strings in source (itself a character string) are not literally the strings they represent.
ugly, yes. problematic? no.
You can't always just copy and paste some text into code, without adding escape encodings.
Now write code that generates C code with strings, that generates C code with strings, and ... (ahem!)
It's not a big deal, but it isn't zero friction either. Relevant here because it might be the most prevalent example of what happens when even two simple grammars collide.
Unless it's a regex....
But this ignores all sorts of other steps you can take. Targeting multiple execution environments is an obvious step. Optimization is another. Trivial local optimizations like shifts over multiplications by 2 and fusing operations to take advantage of the machine that is executing it. Less trivial full program optimizations that can propagate constants across source files.
And preemptive execution is a huge consideration, of course. Very little code runs in a way that can't be interrupted for some other code to run in the meantime. To the point that we don't even think of what this implies anymore. Despite accumulators being a very basic execution unit on most every computer. (Though, I think I'm thankful that reentrancy is the norm nowadays in functions.)
Even English can be parsed first into the sounds. This is why puns work. Consider the joke, "why should you wear glasses to math class? It helps with division." That only works if you go to the sounds first. And you will have optionality in where to go from there.
So, for parsing programs, we often first decide on primitives for execution. For teaching, this is often basic math operations. But in reality, you have far more than the basic math operations. And, as I was saying, you can do more with the intermediate representation than you probably realize at the outset.