Parsing: The Solved Problem That Isn't (2011) (opens in new tab)

(tratt.net)

117 pointscontr-error2y ago84 comments

84 comments

43 comments · 8 top-level

kazinator2y ago· 13 in thread

Parsing computer languages is an entirely self-inflicted problem. You can easily design a language so it doesn't require any parsing techniques that were not known and practical in 1965, and it will greatly benefit the readability also.

dkjaudyeqooe2y ago

This is entirely the case. Given a sensible grammar stated in a sensible way, it's very easy to write a nice recursive decent parser. They are fast and easy to maintain. It doesn't limit the expressiveness of your grammar unduly.

Both GCC and LLVM implement recursive decent parsers for their C compilers.

Parser generators are an abomination inflicted upon us by academia, solving a non problem, and poorly.

derriz2y ago

Agree completely. Having used a bunch of parser generators (Antlr and bison most extensively) and written a parser combinator library, I came to the conclusion that they're a complete waste of time for practical applications.

A hand-written recursive descent parser (with an embedded Pratt parser to handle expressions/operators) solves all the problems that parser generators struggle with. The big/tricky "issue" mentioned in the article - composing or embedding one parser in another - is a complete non-issue with recursive-descent - it's just a function call. Other basic features of parsing: informative/useful error messages, recovery (i.e. don't just blow up with the first error), seamless integration with the rest of the host language, remain issues with all parser generators but are simply not issues with recursive descent.

And that's before you consider non-functional advantages of recursive-descent: debuggability, no change/additions to the build system, fewer/zero dependencies, no requirement to learn a complex DSL, etc.

3 more replies

LgWoodenBadger2y ago

Just like email addresses. The specification/rfc/whatever could have defined a reg-ex that determines a valid address, instead of the essential impossibility we have today.

paulddraper2y ago

> and it will greatly benefit the readability also

This is the controversial part, Lisp aficionados to the contrary.

kazinator2y ago

People are misunderstanding my original comment.

You can parse quite a lot more than Lisp with techniques from 1965.

1 more reply

loevborg2y ago

Do you mean lisp? If yes I agree

kazinator2y ago

Including, but not limited to.

1 more reply

wglb2y ago

Smalltalk is another.

amelius2y ago

End-comment

lgas2y ago

)

Legend24402y ago

But I don't want to be able to parse only highly restricted languages. I want to be able to parse anything, including natural language or even non-languages like raw audio.

My brain can do it, why can't my computer?

throwbadubadu2y ago

Yes, never do humans misunderstand each other, or instructions are not clear to everyone and totally unambiguous, and luckily no language has pure differentiation of meaning by intonation, and, and.. and...

1 more reply

reissbaker2y ago

Your computer can do it, if it has a beefy enough Nvidia card. That's what LLMs are for!

Grammar-based parsing for natural language isn't anywhere close to working, sadly, and may never be.

1 more reply

temp1237892462y ago· 13 in thread

Have there been any notable innovations in parsing since this was written?

thechao2y ago

Aycock & Horspool came up with a 'practical' method for implementing Earley parsing (conversion to a state-machine) that has pretty humorously good performance delta over "naive" Earley, and is still reasonable to implement. Joop Leo figured out how to get the worst-case of Earley parsing down to either O(n) (left-recursive, non-ambiguous) or O(n^2) (right-recursive, non-ambiguous). That means the Earley algorithm is only O(n^3) on right-recursive, ambiguous grammars; and, if you're doing that, you're holding your language wrong.

A somewhat breathless description of all of this is in the Marpa parser documentation:

    https://jeffreykegler.github.io/Marpa-web-site/

In practice, I've found that computers are so fast, that with just the Joop Leo optimizations, 'naive' Earley parsing is Good Enough™:

    https://loup-vaillant.fr/tutorials/earley-parsing/

marcusf2y ago

An extremely layman answer is that most interesting innovation in parsing in relatively modern times has happened seems to be in the context of IDE's. I.e. incremental, high-performance parsing to support syntax highlighting, refactoring, etc. etc.

(I may be talking out of my ass here.)

ReleaseCandidat2y ago

Actually the most important step of parsers (as even non-incremental, slow (or better: not fast) parsers are fast enough) is error recovery (error resilience) from syntax errors (mostly half written or half deleted code). What is time consuming is e.g. type-checking. Semantic checking in general, like exhaustiveness checks of pattern matches, syntax checking is fast.

1 more reply

o11c2y ago

Not sure, but I at least am certainly aware of possibilities that such writeups exclude.

In particular, you can do (a subset of) the following in sequence:

* write your own grammar in whatever bespoke language you want

* compose those grammars into a single grammar

* generate a Bison grammar from that grammar

* run `bison --xml` instead of actually generating code

* read the XML file and implement your own (trivial) runtime so you can easily handle ownership issues

In particular, I am vehemently opposed to the idea of implementing parsers separately using some non-proven tool/theory, since that way leads to subtle grammar incompatibilities later.

sse2y ago

The one I'm most excited about is improved error recovery:

https://soft-dev.org/pubs/html/diekmann_tratt__dont_panic/

https://drops.dagstuhl.de/storage/00lipics/lipics-vol166-eco...

danielvaughn2y ago

I'm not super familiar with the space, but tree-sitter seems to take an interesting approach in that they are an incremental parser. So instead of re-parsing the entire document on change, it only parses the affected text, thereby making it much more efficient for text editors.

I don't know if that's specific to tree-sitter though, I'm sure there are other incremental parsers. I have to say that I've tried ANTLR and tree-sitter, and I absolutely love tree-sitter. It's a joy to work with.

IshKebab2y ago

In my experience incremental parsing doesn't really make much sense. Non-incremental parsing can easily parse huge documents in milliseconds.

Also Tree Sitter only does half the parsing job - you get a tree on nodes, but you have to do your own parse of that tree to get useful structures out.

I prefer Chumsky or Nom which go all the way.

2 more replies

ReleaseCandidat2y ago

> [incremental parsing] I don't know if that's specific to tree-sitter though

No, it isn't. And incremental parsing is older than 2011 too (like at least the 70s).

For example: https://dl.acm.org/doi/pdf/10.1145/357062.357066

troupo2y ago

Yes. You roll your own manual parser. It's not as difficult as people make it out to be.

davidkellis2y ago

These are not new, but my takeaways from https://tratt.net/laurie/blog/2020/which_parsing_approach.ht... and https://rust-analyzer.github.io/blog/2020/09/16/challeging-L... are to embrace various forms of LR parsing. https://github.com/igordejanovic/parglare is a very capable GLR parser, and I've been keeping a close eye on it for use in my projects.

bsder2y ago

Yes, we resdesigned our programming languages to be easy to parse with limited lookahead.

norir2y ago

I feel that most of the time the two options are presented as either write a handwritten parser or use a parser generator. A nice third way is to write a custom parser generator for the language you wish to parse. Handwritten parsers do tend to get unwieldy and general purpose parser generators can have inscrutable behavior for any specific language.

Because the grammar for a parser generator is usually much simpler than most general purpose programming languages, it is typically relatively straightforward to handwrite a parser for it.

dboreham2y ago

You could write a language with which to specify that custom parser. Oh wait...

Nevermark2y ago· 4 in thread

Common example of complications of two grammars being combined: C code and character strings.

Double quotes in C code mean begin and end of a string. But strings contain quotes too. And newlines. Etc.

So we got the cumbersome invention of escape codes, and so characters strings in source (itself a character string) are not literally the strings they represent.

PH95VuimJjqBqy2y ago

at no point in my life have I ever considered escape codes to be problematic.

ugly, yes. problematic? no.

Nevermark2y ago

Not problematic. Just a little cumbersome. (And ugly, agreed.)

You can't always just copy and paste some text into code, without adding escape encodings.

Now write code that generates C code with strings, that generates C code with strings, and ... (ahem!)

It's not a big deal, but it isn't zero friction either. Relevant here because it might be the most prevalent example of what happens when even two simple grammars collide.

convolvatron2y ago

until you need to get your string through several levels of escape. how many backslashes to add? depends on how deep your pipe is and how each of those layers is defined

2 more replies

paulddraper2y ago

> problematic?

Unless it's a regex....

pcfwik2y ago· 2 in thread

There was an interesting discussion two years ago regarding nonobvious issues with PEGs:

https://news.ycombinator.com/item?id=30414683

https://news.ycombinator.com/item?id=30414879

I spent a year or two working with PEGs, and ran into similar issues multiple times. Adding a new production could totally screw up seemingly unrelated parses that worked fine before.

As the author points out, Earley parsing with some disambiguation rules (production precedence, etc.) has been much less finicky/annoying to work with. It's also reasonably fast for small parses even with a dumb implementation. Would suggest for prototyping/settings when runtime ambiguity is not a showstopper, despite the remaining issues described in the article re: having a separate lexer.

dataflow2y ago

> I spent a year or two working with PEGs

> Earley parsing with some disambiguation rules

Any idea why GLR always gets ignored?

pcfwik2y ago

At least in my personal case: GLR sounds great in theory, but I like to implement things 'from scratch' when possible. Both PEGs and the basic Earley algorithm are incredibly simple to write up and hack on in a few hundred lines of insert-favorite-language-here.

GLR would probably have (much) better performance but I'm usually not parsing huge files (or would hand-roll one if I were). I've not yet found an explanation of GLR (or even LR for that matter) that's quite as simple as PEGs or Earley (suggestions welcome tho!).

1 more reply

taeric2y ago· 2 in thread

My current view of what makes parsing so difficult is that people want to jump straight over a ton of intermediate things from parsing to execution. That is, we often know what we want to happen at the end. And we know what we are given. It is hoped that it is a trivially mechanical problem to go from one to the other.

But this ignores all sorts of other steps you can take. Targeting multiple execution environments is an obvious step. Optimization is another. Trivial local optimizations like shifts over multiplications by 2 and fusing operations to take advantage of the machine that is executing it. Less trivial full program optimizations that can propagate constants across source files.

And preemptive execution is a huge consideration, of course. Very little code runs in a way that can't be interrupted for some other code to run in the meantime. To the point that we don't even think of what this implies anymore. Despite accumulators being a very basic execution unit on most every computer. (Though, I think I'm thankful that reentrancy is the norm nowadays in functions.)

Karellen2y ago

What have those other things got to do with parsing though? Granted, they rely on parsing having already happened, but I don't see how there's much feedback from those considerations to the way that parsers work, or are written, or - as the article discussed - can be combined?

taeric2y ago

You can easily view it as having nothing to do with it. My push is that it is the point of parsing. You don't parse directly into understanding/execution. You parse into another representation, one that we never directly talk about, so that you can then move it into the next level.

Even English can be parsed first into the sounds. This is why puns work. Consider the joke, "why should you wear glasses to math class? It helps with division." That only works if you go to the sounds first. And you will have optionality in where to go from there.

So, for parsing programs, we often first decide on primitives for execution. For teaching, this is often basic math operations. But in reality, you have far more than the basic math operations. And, as I was saying, you can do more with the intermediate representation than you probably realize at the outset.

1 more reply

EdSchouten2y ago· 1 in thread

What I find annoying about using parser generators is that it always feels messy integrating the resulting parser into your application. So you write a file that contains the grammar and generate a parser out of that. Now you build it into your app and call into it to parse some input file, but that ends up giving you some poorly typed AST that is cluttered/hard to work with.

Certain parser generators make life easier by supporting actions on parser/lexer rules. This is great and all, but it has the downside that the grammar you provide is no longer reusable. There's no way for others to import that grammar and provide custom actions for them.

I don't know. In my opinion parsing theory is already solved. Whether it's PEG, LL, LR, LALR, whatever. One of those is certainly good enough for the kind of data you're trying to parse. I think the biggest annoyance is the tooling.

mrkeen2y ago

Parser combinators is what I've been loving in the last few years.

Pros: * They're just a technique/library that you can use in your own language without the seperate generation step.

* They're simple enough that I often roll my own rather than using an existing library.

* They let you stick code into your parsing steps - logging, extra information, constructing your own results directly, e.g.

* The same technique works for lexing and parsing - just write a parser from bytes to tokens, and a second parser from tokens to objects.

* Depending on your languages syntax, you can get your parser code looking a lot like the bnf grammar you're trying to implement.

Cons: * You will eventually run into left-recursion problems. It can be nightmarish trying to change the code so it 'just works'. You really need to step back and grok left-recursion itself - no handholding from parser combinators.

* Same thing with precedence - you just gotta learn how to do it. Fixing left-recursion didn't click for me until I learned how to do precedence.

dang2y ago

Parsing: The Solved Problem That Isn't (2011) - https://news.ycombinator.com/item?id=8505382 - Oct 2014 (70 comments)

Parsing: the solved problem that isn't - https://news.ycombinator.com/item?id=2327313 - March 2011 (47 comments)

jolt422y ago

After using Instaparse at least it felt like a solved problem: https://github.com/Engelberg/instaparse

j / k navigate · click thread line to collapse

84 comments

43 comments · 8 top-level

kazinator2y ago· 13 in thread

dkjaudyeqooe2y ago

Both GCC and LLVM implement recursive decent parsers for their C compilers.

Parser generators are an abomination inflicted upon us by academia, solving a non problem, and poorly.

derriz2y ago

3 more replies

LgWoodenBadger2y ago

Just like email addresses. The specification/rfc/whatever could have defined a reg-ex that determines a valid address, instead of the essential impossibility we have today.

paulddraper2y ago

> and it will greatly benefit the readability also

This is the controversial part, Lisp aficionados to the contrary.

kazinator2y ago

People are misunderstanding my original comment.

You can parse quite a lot more than Lisp with techniques from 1965.

1 more reply

loevborg2y ago

Do you mean lisp? If yes I agree

kazinator2y ago

Including, but not limited to.

1 more reply

wglb2y ago

Smalltalk is another.

amelius2y ago

End-comment

lgas2y ago

)

Legend24402y ago

But I don't want to be able to parse only highly restricted languages. I want to be able to parse anything, including natural language or even non-languages like raw audio.

My brain can do it, why can't my computer?

throwbadubadu2y ago

1 more reply

reissbaker2y ago

Your computer can do it, if it has a beefy enough Nvidia card. That's what LLMs are for!

Grammar-based parsing for natural language isn't anywhere close to working, sadly, and may never be.

1 more reply

temp1237892462y ago· 13 in thread

Have there been any notable innovations in parsing since this was written?

thechao2y ago

A somewhat breathless description of all of this is in the Marpa parser documentation:

    https://jeffreykegler.github.io/Marpa-web-site/

In practice, I've found that computers are so fast, that with just the Joop Leo optimizations, 'naive' Earley parsing is Good Enough™:

    https://loup-vaillant.fr/tutorials/earley-parsing/

marcusf2y ago

(I may be talking out of my ass here.)

ReleaseCandidat2y ago

1 more reply

o11c2y ago

Not sure, but I at least am certainly aware of possibilities that such writeups exclude.

In particular, you can do (a subset of) the following in sequence:

* write your own grammar in whatever bespoke language you want

* compose those grammars into a single grammar

* generate a Bison grammar from that grammar

* run `bison --xml` instead of actually generating code

* read the XML file and implement your own (trivial) runtime so you can easily handle ownership issues

In particular, I am vehemently opposed to the idea of implementing parsers separately using some non-proven tool/theory, since that way leads to subtle grammar incompatibilities later.

sse2y ago

The one I'm most excited about is improved error recovery:

https://soft-dev.org/pubs/html/diekmann_tratt__dont_panic/

https://drops.dagstuhl.de/storage/00lipics/lipics-vol166-eco...

danielvaughn2y ago

IshKebab2y ago

In my experience incremental parsing doesn't really make much sense. Non-incremental parsing can easily parse huge documents in milliseconds.

Also Tree Sitter only does half the parsing job - you get a tree on nodes, but you have to do your own parse of that tree to get useful structures out.

I prefer Chumsky or Nom which go all the way.

2 more replies

ReleaseCandidat2y ago

> [incremental parsing] I don't know if that's specific to tree-sitter though

No, it isn't. And incremental parsing is older than 2011 too (like at least the 70s).

For example: https://dl.acm.org/doi/pdf/10.1145/357062.357066

troupo2y ago

Yes. You roll your own manual parser. It's not as difficult as people make it out to be.

davidkellis2y ago

bsder2y ago

Yes, we resdesigned our programming languages to be easy to parse with limited lookahead.

norir2y ago

Because the grammar for a parser generator is usually much simpler than most general purpose programming languages, it is typically relatively straightforward to handwrite a parser for it.

dboreham2y ago

You could write a language with which to specify that custom parser. Oh wait...

Nevermark2y ago· 4 in thread

Common example of complications of two grammars being combined: C code and character strings.

Double quotes in C code mean begin and end of a string. But strings contain quotes too. And newlines. Etc.

So we got the cumbersome invention of escape codes, and so characters strings in source (itself a character string) are not literally the strings they represent.

PH95VuimJjqBqy2y ago

at no point in my life have I ever considered escape codes to be problematic.

ugly, yes. problematic? no.

Nevermark2y ago

Not problematic. Just a little cumbersome. (And ugly, agreed.)

You can't always just copy and paste some text into code, without adding escape encodings.

Now write code that generates C code with strings, that generates C code with strings, and ... (ahem!)

It's not a big deal, but it isn't zero friction either. Relevant here because it might be the most prevalent example of what happens when even two simple grammars collide.

convolvatron2y ago

until you need to get your string through several levels of escape. how many backslashes to add? depends on how deep your pipe is and how each of those layers is defined

2 more replies

paulddraper2y ago

> problematic?

Unless it's a regex....

pcfwik2y ago· 2 in thread

There was an interesting discussion two years ago regarding nonobvious issues with PEGs:

https://news.ycombinator.com/item?id=30414683

https://news.ycombinator.com/item?id=30414879

I spent a year or two working with PEGs, and ran into similar issues multiple times. Adding a new production could totally screw up seemingly unrelated parses that worked fine before.

dataflow2y ago

> I spent a year or two working with PEGs

> Earley parsing with some disambiguation rules

Any idea why GLR always gets ignored?

pcfwik2y ago

1 more reply

taeric2y ago· 2 in thread

Karellen2y ago

taeric2y ago

1 more reply

EdSchouten2y ago· 1 in thread

mrkeen2y ago

Parser combinators is what I've been loving in the last few years.

Pros: * They're just a technique/library that you can use in your own language without the seperate generation step.

* They're simple enough that I often roll my own rather than using an existing library.

* They let you stick code into your parsing steps - logging, extra information, constructing your own results directly, e.g.

* The same technique works for lexing and parsing - just write a parser from bytes to tokens, and a second parser from tokens to objects.

* Depending on your languages syntax, you can get your parser code looking a lot like the bnf grammar you're trying to implement.

* Same thing with precedence - you just gotta learn how to do it. Fixing left-recursion didn't click for me until I learned how to do precedence.

dang2y ago

Parsing: The Solved Problem That Isn't (2011) - https://news.ycombinator.com/item?id=8505382 - Oct 2014 (70 comments)

Parsing: the solved problem that isn't - https://news.ycombinator.com/item?id=2327313 - March 2011 (47 comments)

jolt422y ago

After using Instaparse at least it felt like a solved problem: https://github.com/Engelberg/instaparse

j / k navigate · click thread line to collapse