Yacc is dead (opens in new tab)

(arxiv.org)

154 pointsbenblack15y ago44 comments

44 comments

36 comments · 11 top-level

johkra15y ago· 9 in thread

What is actually state-of-the-art in parsing?

When I, as an Amateur, last looked into it, PEG[1] and extensions like OMeta[2] seemed to be the best options. I've heard good things about Parsec[3], too.

[1](http://en.wikipedia.org/wiki/Parsing_expression_grammar) [2](http://www.tinlizzie.org/ometa/) [3](http://legacy.cs.uu.nl/daan/parsec.html)

sb15y ago

I am not so sure about the actual state-of-the-art (I have read some PEG papers and some OMeta stuff from vpri, plus used ANTLR, Bison, and Coco/R), but a very interesting comment on HN (http://news.ycombinator.com/item?id=1643715) points out that most production compilers use hand-written recursive-descent parsers, primarily due to practical reasons.

I have taken two compiler construction courses during my college years, one with LL(1) grammar using a recursive descent parser, the other one with LR(1) grammar using flex+bison tool-chain. I found the experience from the LL case helped me tremendously in the course of writing the LR-language compiler. I think that recursive-descent experience also helps understanding attribute grammars, too. Probably the best introduction to recursive-descent parsers is from Niklaus Wirth's compiler construction book (http://www-old.oberon.ethz.ch/WirthPubl/CBEAll.pdf).

krakensden15y ago

What are the practical reasons? Dealing with non-uniform edge cases and other strange constraints? Integrating the stuff the compiler needs to do with the parser-generator?

3 more replies

dill_day15y ago

There was a really cool parsing paper at POPL this year [1] that goes 'beyond CFGs'... but as far as I know their system YAKKER is still in development / unreleased.

I think from a user's perspective, e.g. someone wanting to design a DSL, it would be nice to be able to write down arbitrary CFGs and not necessarily have to know lots of technical parsing details ("what are all these shift-reduce conflicts!?"). Maybe PEGs aren't the best choice for this (no left recursion)-- MetaBorg's SGLR is probably state-of-the-art then like Zef said.

Though one of the nice things about PEGs is no ambiguity; MetaBorg will do type-based disambiguation but only after parsing finishes. I'm working on an undergraduate thesis on 'extensible syntax' right now, trying to jump off from some of their stuff (but using a variation on Earley parsing) in particular to see if we could use types to help disambiguate incrementally during parsing.

[1] http://portal.acm.org/citation.cfm?id=1706347

jmatt15y ago

The PDF from AT&T Labs for those of us without ACM digital library access:

http://www2.research.att.com/~yitzhak/publications/ddg-tr.pd...

jules15y ago

PDF of paper that isn't behind a paywall: http://www2.research.att.com/~yitzhak/publications/ddg-tr.pd...

Zef15y ago

State of the art in parsing is SGLR (http://strategoxt.org/Sdf/SGLR) and GLL parsing

ScottBurson15y ago

See also Adam Megacz' SBP ("Scannerless Boolean Parser") (http://research.cs.berkeley.edu/project/sbp/). Boolean grammars are a superset of context-free grammars that can do some interesting things. For example, one can write a scannerless grammar for Python that handles the significant indentation in the grammar.

johkra15y ago

Thanks a lot for the link, I'll look through it.

I also found the GLL paper at p. 113 [123] of http://ldta.info/2009/ldta2009proceedings.pdf

fhars15y ago

The SGLR parser in Strtego/XT is not state of the art, at least not for production use. It only supports pure seven bit ASCII, it is a throwback to the 1970ies.

mattmight15y ago· 8 in thread

(Article author here.)

I'm delighted to see this get some attention here.

I absolutely love these techniques for parsing, but as my primary research areas is static analysis, I haven't had time to revise this paper and resubmit.

As it stands, I may never get the time to do so. :(

I posted it on arxiv so that David could reference it for his Ph.D. school apps.

Since some of you have asked, here are the reviews:

http://matt.might.net/papers/reviews/esop2010-derivatives.tx...

I do have an updated implementation that's much cleaner and faster, and I've been planning to do that in a blog post. (Alas, no opportunity yet.)

David's also done another implementation in Haskell that's screaming fast and efficient on many restricted classes of grammars (like LL(k)). I'll encourage him to post that as well.

If you're interested in getting your name on a scientific publication and helping this get the attention of the scientific community, you can help us by creating an implementation of either technique in your favorite language and beating on it to help find the inefficiencies.

(For instance, the original Scala version linked from this paper has memory leaks from the way it caches derivatives. We got around them by rolling top-level repetition by hand.)

Please email me if that's something you're interested in doing.

Can HN do science? I'd love to find out.

copper15y ago

"this is not quantum theory, after all..." is one of the funnier comments I've seen on a review. Judging by the rest of it, I'm tempted to guess that it was written by someone from team-PLT :)

Could either you or David put the source for the LL(k) version up somewhere? Comparing it head-to-head with parsec (or its faster cousins) something fun to do.

davdar15y ago

Here is the git repo (over http) for my Haskell implementation which exploits the technique to be linear for LL(k) (the Zip module is where this technique is implemented). The constant overhead for the implementation is still extremely high because we need to compute a fixed point computation on the whole parse graph for every input token. I'm working on getting all that down...

http://david.darais.com/git/research/der-parser-3/

1 more reply

davdar15y ago

Here is my latest Haskell implementation that Matt is referring to (git repo):

http://david.darais.com/git/research/der-parser-3/

k4st15y ago

I haven't finished the paper yet, but this type of technique excites me as well. Recently, I did an assignment on the paper "memoization in top-down parsing" by Mark Johnson. Have you seen this paper, and do you think that there are any similarities between the derivatives of a CFG and continuations of parsing procedures (for variables and their related productions)?

mattmight15y ago

I'll have to check out Mark's paper, but just by your description, I bet they're related.

For finite-state and pushdown automata, derivatives and continuations are two sides of the same coin.

jules15y ago

How does your approach compare with Koen Claessen's Parallel Parsing Processes?

http://www.cse.chalmers.se/edu/course/afp/Papers/parser-clae...

I haven't studies either approach well but they seem similar in that both are like a breadth first traversal of the parser combinators, instead of the traditional depth first (backtracking) traversal. That is, when you have a union A u B you process this by advancing both sides in tandem D_c A u D_c B, instead of the traditional approach of first computing all results of A on the entire input and then computing all results of B on the entire input?

Parallel Parsing Processes don't seem to handle left recursion. On the other hand they parse S expressions in linear time without a special coded repetition operation ;)

When I find time I'm going to try to do an F# implementation.

davdar15y ago

Yes, breadth first traversal of the solution space is very similar in spirit to our approach.

I have since completely rewritten the Haskell implementation. You should really check it out, especially if you are thinking about writing one of your own.

(git repo) http://david.darais.com/git/research/der-parser-3

This implementation no longer requires a special coded repetition operator to get good complexity for regular and LL(k) grammars. This is a result of our progress w.r.t. performance of the theory since the paper's submission. My technique to achieve this uses structure derivatives (context based derivatives, or pseudo-equivalently, continuations) to avoid taking unnecessary parser derivatives. It in a sense "focuses" the parser computation on a small sub-parser.

Ping me if you have any questions; this paper has been getting passed around quite a bit lately and Matt and I have made progress on certain areas since the writing of the paper -- like that of the Rep hack.

timtadh15y ago

emailed!

RiderOfGiraffes15y ago· 4 in thread

Fascinating. I don't understand it all yet - I'm reading it carefully but it'll be weeks before I really get it.

However ...

I'm getting the feeling that this encompasses properly a feeling I've had about parsing for some time, that there should be a way of parsing the entire text, with the parse settling onto the text all at the same time. I don't know if this is what it's saying, but that's the sense I get from it.

But even if it isn't, it looks intriguing.

jerf15y ago

You mean something like this?: http://blog.sigfpe.com/2009/01/fast-incremental-regular-expr...

Note that the post has 'prerequisites' at the beginning, which you will need to read, but they are actually pretty cool.

Also I am not saying this is exactly what you mean, I'm just suggesting it as a possible connection.

This sort of thing is one of the admittedly-rare exceptions where computer science is actually making surprising amounts of progress in relatively practical fields. Parsing has gotten noticeably easier in the past ten years, if you know where to look for the right libraries, and it has been affecting my programming quite a bit. Often, a parser is the "correct" solution, but we used to reach up for hacked up crap with regular expressions or worse because it was ten times easier and did 80% of the job (and ignore the 10% that is a serious security vulnerability since everybody always does). Now it's maybe twice as hard, or, given how easy it is to underestimate the difficulty of getting the hacked up crap to actually work everywhere in the real world you need it to, sometimes it's just flat-out easier if you make a full accounting of costs to actually do it correctly.

pjscott15y ago

Here's a quick summary of the broader implications of that link, because I had trouble wrapping my head around it at first.

Suppose you have the ability to take a chunk of text and construct a partial parse state from it. In the case of regexp matching, these partial parse states are functions mapping one state of the regexp matching automaton to another. You need one more thing: the ability to append two of these partial parse states, combining them into one. In the regexp example, this is just function composition. The key here is that this operation must be associative, and there must be an identity element: some partial parse state such that combining it with another state doesn't change anything. And its result must also be a partial parse state. This combination of parse states and an associative binary operation is called a monoid.

Once you have these conditions fulfilled, you can do all sorts of fun stuff. For instance, you can represent a string as a tree of chunks, and cache partial parse states at the nodes in the tree. That way, when you change the string, you can recompute the changed parse states in something like O(lg n) time, rather than going through and re-parsing the entire string. Or you can almost trivially parallelize your parser.

A week ago, I did exactly this: I had a language that needed parsing, and I wanted to incrementally reparse when I changed the (potentially very long) string, so I used a finger tree and wrote an incremental parser. It works beautifully.

1 more reply

RiderOfGiraffes15y ago

Cool reference, and I'll be working on that too over the next week or so to see if that's close to what I mean.

Thanks.

StrawberryFrog15y ago

regular expressions are always the wrong tool for the job of parsing languages with recursive syntax. See, for instance, http://stackoverflow.com/questions/1732348/regex-match-open-...

1 more reply

amichail15y ago· 2 in thread

Why was it rejected by ESOP?

stupidsignup15y ago

Well, if you want to know what ESOP rejection reviews look like: http://phlegmaticprogrammer.wordpress.com/2010/11/21/reviews...

and the response to these reviews here: http://phlegmaticprogrammer.wordpress.com/2010/11/21/respons...

fizx15y ago

Possibly because they hand-waved away a lot of formal specification of their solution. For example, there was no attempt to analyze the complexity of the generated rules trees.

DanielRibeiro15y ago· 1 in thread

Well, Antlr (http://www.antlr.org/) has replaced yacc for most of the practical work already. Daniel Spiewak also mentions parser combinators and how GLL parsers can be much easier to use and yet fast enough most of the time (http://www.codecommit.com/blog/scala/unveiling-the-mysteries...). He mentions parser combinators as domain specif languages for creating parsers (http://www.codecommit.com/blog/scala/the-magic-behind-parser...).

Despite all of these, adoption of better techniques is really slow. As usual.

sliverstorm15y ago

Heh... Yacc, Bison, Antlr...

antimatter1515y ago· 1 in thread

The implementations are at http://www.ucombinator.org/projects/parsing/

davdar15y ago

I have since rewritten the Haskell implementation to compute fixed points on cyclic graphs without using pointers or Monads. Check it out if it interests you (git repo):

http://david.darais.com/git/research/der-parser-3/

fizx15y ago

Here's what a grammar actually looks like in their scala version. This grammar is for arithmetic over the language where x represents 1, and s represents +. This only generates the parse tree, not the final answer. Comments are added by me:

  // The Nodes in the parse tree
  abstract class Exp
  case object One extends Exp
  case class Sum(e1 : Exp, e2 : Exp) extends Exp
  
  // Terminals
  lazy val S : Parser[Char,Char] = new EqT[Char] ('s')
  lazy val X : Parser[Char,Char] = new EqT[Char] ('x')
  
  // Definition of an expression
  // I'm pretty sure all the asInstanceOfs are avoidable/unnecessary.
  lazy val EXP : Parser[Char,Exp] = 
    rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
    rule(EXP ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(e1,e2).asInstanceOf[Exp] } ||
    rule(EXP ~ S ~ X) ==> { case e1 ~ s ~ e2 => Sum(e1,One).asInstanceOf[Exp] } ||
    rule(X ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(One,e2).asInstanceOf[Exp] } ||
    rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
    rule(EXP) ==> { case e => e } ||
    rule(Epsilon[Char]) ==> { case () => One } 
  
  // Actually run the rule
  val xin = Stream.fromIterator("xsxsxsxsx".elements)
  EXP.parseFull(xin)
  // return value => Stream(Sum(One,Sum(Sum(One,One),Sum(One,One))), ?)

benblackOP15y ago

So many useful links in this discussion, I consolidated them (and a few extras) here http://post.b3k.us/modern-parsing-bibliography

mahmud15y ago

Brzozowski's derivatives of regular expressions has been in use in the PLT compiler tools for ages now.

herdrick15y ago

Better comments at programming.reddit, sadly: http://www.reddit.com/r/programming/comments/ed2pb/yacc_is_d...

kleiba15y ago

Do I hear John McCarthy chuckle?

j / k navigate · click thread line to collapse

44 comments

36 comments · 11 top-level

johkra15y ago· 9 in thread

What is actually state-of-the-art in parsing?

When I, as an Amateur, last looked into it, PEG[1] and extensions like OMeta[2] seemed to be the best options. I've heard good things about Parsec[3], too.

[1](http://en.wikipedia.org/wiki/Parsing_expression_grammar) [2](http://www.tinlizzie.org/ometa/) [3](http://legacy.cs.uu.nl/daan/parsec.html)

sb15y ago

krakensden15y ago

What are the practical reasons? Dealing with non-uniform edge cases and other strange constraints? Integrating the stuff the compiler needs to do with the parser-generator?

3 more replies

dill_day15y ago

There was a really cool parsing paper at POPL this year [1] that goes 'beyond CFGs'... but as far as I know their system YAKKER is still in development / unreleased.

[1] http://portal.acm.org/citation.cfm?id=1706347

jmatt15y ago

The PDF from AT&T Labs for those of us without ACM digital library access:

http://www2.research.att.com/~yitzhak/publications/ddg-tr.pd...

jules15y ago

PDF of paper that isn't behind a paywall: http://www2.research.att.com/~yitzhak/publications/ddg-tr.pd...

Zef15y ago

State of the art in parsing is SGLR (http://strategoxt.org/Sdf/SGLR) and GLL parsing

ScottBurson15y ago

johkra15y ago

Thanks a lot for the link, I'll look through it.

I also found the GLL paper at p. 113 [123] of http://ldta.info/2009/ldta2009proceedings.pdf

fhars15y ago

The SGLR parser in Strtego/XT is not state of the art, at least not for production use. It only supports pure seven bit ASCII, it is a throwback to the 1970ies.

mattmight15y ago· 8 in thread

(Article author here.)

I'm delighted to see this get some attention here.

I absolutely love these techniques for parsing, but as my primary research areas is static analysis, I haven't had time to revise this paper and resubmit.

As it stands, I may never get the time to do so. :(

I posted it on arxiv so that David could reference it for his Ph.D. school apps.

Since some of you have asked, here are the reviews:

http://matt.might.net/papers/reviews/esop2010-derivatives.tx...

I do have an updated implementation that's much cleaner and faster, and I've been planning to do that in a blog post. (Alas, no opportunity yet.)

David's also done another implementation in Haskell that's screaming fast and efficient on many restricted classes of grammars (like LL(k)). I'll encourage him to post that as well.

(For instance, the original Scala version linked from this paper has memory leaks from the way it caches derivatives. We got around them by rolling top-level repetition by hand.)

Please email me if that's something you're interested in doing.

Can HN do science? I'd love to find out.

copper15y ago

"this is not quantum theory, after all..." is one of the funnier comments I've seen on a review. Judging by the rest of it, I'm tempted to guess that it was written by someone from team-PLT :)

Could either you or David put the source for the LL(k) version up somewhere? Comparing it head-to-head with parsec (or its faster cousins) something fun to do.

davdar15y ago

http://david.darais.com/git/research/der-parser-3/

1 more reply

davdar15y ago

Here is my latest Haskell implementation that Matt is referring to (git repo):

http://david.darais.com/git/research/der-parser-3/

k4st15y ago

mattmight15y ago

I'll have to check out Mark's paper, but just by your description, I bet they're related.

For finite-state and pushdown automata, derivatives and continuations are two sides of the same coin.

jules15y ago

How does your approach compare with Koen Claessen's Parallel Parsing Processes?

http://www.cse.chalmers.se/edu/course/afp/Papers/parser-clae...

Parallel Parsing Processes don't seem to handle left recursion. On the other hand they parse S expressions in linear time without a special coded repetition operation ;)

When I find time I'm going to try to do an F# implementation.

davdar15y ago

Yes, breadth first traversal of the solution space is very similar in spirit to our approach.

I have since completely rewritten the Haskell implementation. You should really check it out, especially if you are thinking about writing one of your own.

(git repo) http://david.darais.com/git/research/der-parser-3

timtadh15y ago

emailed!

RiderOfGiraffes15y ago· 4 in thread

Fascinating. I don't understand it all yet - I'm reading it carefully but it'll be weeks before I really get it.

However ...

But even if it isn't, it looks intriguing.

jerf15y ago

You mean something like this?: http://blog.sigfpe.com/2009/01/fast-incremental-regular-expr...

Note that the post has 'prerequisites' at the beginning, which you will need to read, but they are actually pretty cool.

Also I am not saying this is exactly what you mean, I'm just suggesting it as a possible connection.

pjscott15y ago

Here's a quick summary of the broader implications of that link, because I had trouble wrapping my head around it at first.

1 more reply

RiderOfGiraffes15y ago

Cool reference, and I'll be working on that too over the next week or so to see if that's close to what I mean.

Thanks.

StrawberryFrog15y ago

regular expressions are always the wrong tool for the job of parsing languages with recursive syntax. See, for instance, http://stackoverflow.com/questions/1732348/regex-match-open-...

1 more reply

amichail15y ago· 2 in thread

Why was it rejected by ESOP?

stupidsignup15y ago

Well, if you want to know what ESOP rejection reviews look like: http://phlegmaticprogrammer.wordpress.com/2010/11/21/reviews...

and the response to these reviews here: http://phlegmaticprogrammer.wordpress.com/2010/11/21/respons...

fizx15y ago

Possibly because they hand-waved away a lot of formal specification of their solution. For example, there was no attempt to analyze the complexity of the generated rules trees.

DanielRibeiro15y ago· 1 in thread

Despite all of these, adoption of better techniques is really slow. As usual.

sliverstorm15y ago

Heh... Yacc, Bison, Antlr...

antimatter1515y ago· 1 in thread

The implementations are at http://www.ucombinator.org/projects/parsing/

davdar15y ago

I have since rewritten the Haskell implementation to compute fixed points on cyclic graphs without using pointers or Monads. Check it out if it interests you (git repo):

http://david.darais.com/git/research/der-parser-3/

fizx15y ago

  // The Nodes in the parse tree
  abstract class Exp
  case object One extends Exp
  case class Sum(e1 : Exp, e2 : Exp) extends Exp
  
  // Terminals
  lazy val S : Parser[Char,Char] = new EqT[Char] ('s')
  lazy val X : Parser[Char,Char] = new EqT[Char] ('x')
  
  // Definition of an expression
  // I'm pretty sure all the asInstanceOfs are avoidable/unnecessary.
  lazy val EXP : Parser[Char,Exp] = 
    rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
    rule(EXP ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(e1,e2).asInstanceOf[Exp] } ||
    rule(EXP ~ S ~ X) ==> { case e1 ~ s ~ e2 => Sum(e1,One).asInstanceOf[Exp] } ||
    rule(X ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(One,e2).asInstanceOf[Exp] } ||
    rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
    rule(EXP) ==> { case e => e } ||
    rule(Epsilon[Char]) ==> { case () => One } 
  
  // Actually run the rule
  val xin = Stream.fromIterator("xsxsxsxsx".elements)
  EXP.parseFull(xin)
  // return value => Stream(Sum(One,Sum(Sum(One,One),Sum(One,One))), ?)

benblackOP15y ago

So many useful links in this discussion, I consolidated them (and a few extras) here http://post.b3k.us/modern-parsing-bibliography

mahmud15y ago

Brzozowski's derivatives of regular expressions has been in use in the PLT compiler tools for ages now.

herdrick15y ago

Better comments at programming.reddit, sadly: http://www.reddit.com/r/programming/comments/ed2pb/yacc_is_d...

kleiba15y ago

Do I hear John McCarthy chuckle?

j / k navigate · click thread line to collapse