When I, as an Amateur, last looked into it, PEG[1] and extensions like OMeta[2] seemed to be the best options. I've heard good things about Parsec[3], too.
[1](http://en.wikipedia.org/wiki/Parsing_expression_grammar) [2](http://www.tinlizzie.org/ometa/) [3](http://legacy.cs.uu.nl/daan/parsec.html)
I have taken two compiler construction courses during my college years, one with LL(1) grammar using a recursive descent parser, the other one with LR(1) grammar using flex+bison tool-chain. I found the experience from the LL case helped me tremendously in the course of writing the LR-language compiler. I think that recursive-descent experience also helps understanding attribute grammars, too. Probably the best introduction to recursive-descent parsers is from Niklaus Wirth's compiler construction book (http://www-old.oberon.ethz.ch/WirthPubl/CBEAll.pdf).
I think from a user's perspective, e.g. someone wanting to design a DSL, it would be nice to be able to write down arbitrary CFGs and not necessarily have to know lots of technical parsing details ("what are all these shift-reduce conflicts!?"). Maybe PEGs aren't the best choice for this (no left recursion)-- MetaBorg's SGLR is probably state-of-the-art then like Zef said.
Though one of the nice things about PEGs is no ambiguity; MetaBorg will do type-based disambiguation but only after parsing finishes. I'm working on an undergraduate thesis on 'extensible syntax' right now, trying to jump off from some of their stuff (but using a variation on Earley parsing) in particular to see if we could use types to help disambiguate incrementally during parsing.
http://www2.research.att.com/~yitzhak/publications/ddg-tr.pd...
I also found the GLL paper at p. 113 [123] of http://ldta.info/2009/ldta2009proceedings.pdf
I'm delighted to see this get some attention here.
I absolutely love these techniques for parsing, but as my primary research areas is static analysis, I haven't had time to revise this paper and resubmit.
As it stands, I may never get the time to do so. :(
I posted it on arxiv so that David could reference it for his Ph.D. school apps.
Since some of you have asked, here are the reviews:
http://matt.might.net/papers/reviews/esop2010-derivatives.tx...
I do have an updated implementation that's much cleaner and faster, and I've been planning to do that in a blog post. (Alas, no opportunity yet.)
David's also done another implementation in Haskell that's screaming fast and efficient on many restricted classes of grammars (like LL(k)). I'll encourage him to post that as well.
If you're interested in getting your name on a scientific publication and helping this get the attention of the scientific community, you can help us by creating an implementation of either technique in your favorite language and beating on it to help find the inefficiencies.
(For instance, the original Scala version linked from this paper has memory leaks from the way it caches derivatives. We got around them by rolling top-level repetition by hand.)
Please email me if that's something you're interested in doing.
Can HN do science? I'd love to find out.
Could either you or David put the source for the LL(k) version up somewhere? Comparing it head-to-head with parsec (or its faster cousins) something fun to do.
For finite-state and pushdown automata, derivatives and continuations are two sides of the same coin.
http://www.cse.chalmers.se/edu/course/afp/Papers/parser-clae...
I haven't studies either approach well but they seem similar in that both are like a breadth first traversal of the parser combinators, instead of the traditional depth first (backtracking) traversal. That is, when you have a union A u B you process this by advancing both sides in tandem D_c A u D_c B, instead of the traditional approach of first computing all results of A on the entire input and then computing all results of B on the entire input?
Parallel Parsing Processes don't seem to handle left recursion. On the other hand they parse S expressions in linear time without a special coded repetition operation ;)
When I find time I'm going to try to do an F# implementation.
I have since completely rewritten the Haskell implementation. You should really check it out, especially if you are thinking about writing one of your own.
(git repo) http://david.darais.com/git/research/der-parser-3
This implementation no longer requires a special coded repetition operator to get good complexity for regular and LL(k) grammars. This is a result of our progress w.r.t. performance of the theory since the paper's submission. My technique to achieve this uses structure derivatives (context based derivatives, or pseudo-equivalently, continuations) to avoid taking unnecessary parser derivatives. It in a sense "focuses" the parser computation on a small sub-parser.
Ping me if you have any questions; this paper has been getting passed around quite a bit lately and Matt and I have made progress on certain areas since the writing of the paper -- like that of the Rep hack.
However ...
I'm getting the feeling that this encompasses properly a feeling I've had about parsing for some time, that there should be a way of parsing the entire text, with the parse settling onto the text all at the same time. I don't know if this is what it's saying, but that's the sense I get from it.
But even if it isn't, it looks intriguing.
Note that the post has 'prerequisites' at the beginning, which you will need to read, but they are actually pretty cool.
Also I am not saying this is exactly what you mean, I'm just suggesting it as a possible connection.
This sort of thing is one of the admittedly-rare exceptions where computer science is actually making surprising amounts of progress in relatively practical fields. Parsing has gotten noticeably easier in the past ten years, if you know where to look for the right libraries, and it has been affecting my programming quite a bit. Often, a parser is the "correct" solution, but we used to reach up for hacked up crap with regular expressions or worse because it was ten times easier and did 80% of the job (and ignore the 10% that is a serious security vulnerability since everybody always does). Now it's maybe twice as hard, or, given how easy it is to underestimate the difficulty of getting the hacked up crap to actually work everywhere in the real world you need it to, sometimes it's just flat-out easier if you make a full accounting of costs to actually do it correctly.
Suppose you have the ability to take a chunk of text and construct a partial parse state from it. In the case of regexp matching, these partial parse states are functions mapping one state of the regexp matching automaton to another. You need one more thing: the ability to append two of these partial parse states, combining them into one. In the regexp example, this is just function composition. The key here is that this operation must be associative, and there must be an identity element: some partial parse state such that combining it with another state doesn't change anything. And its result must also be a partial parse state. This combination of parse states and an associative binary operation is called a monoid.
Once you have these conditions fulfilled, you can do all sorts of fun stuff. For instance, you can represent a string as a tree of chunks, and cache partial parse states at the nodes in the tree. That way, when you change the string, you can recompute the changed parse states in something like O(lg n) time, rather than going through and re-parsing the entire string. Or you can almost trivially parallelize your parser.
A week ago, I did exactly this: I had a language that needed parsing, and I wanted to incrementally reparse when I changed the (potentially very long) string, so I used a finger tree and wrote an incremental parser. It works beautifully.
Thanks.
and the response to these reviews here: http://phlegmaticprogrammer.wordpress.com/2010/11/21/respons...
Despite all of these, adoption of better techniques is really slow. As usual.
// The Nodes in the parse tree
abstract class Exp
case object One extends Exp
case class Sum(e1 : Exp, e2 : Exp) extends Exp
// Terminals
lazy val S : Parser[Char,Char] = new EqT[Char] ('s')
lazy val X : Parser[Char,Char] = new EqT[Char] ('x')
// Definition of an expression
// I'm pretty sure all the asInstanceOfs are avoidable/unnecessary.
lazy val EXP : Parser[Char,Exp] =
rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
rule(EXP ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(e1,e2).asInstanceOf[Exp] } ||
rule(EXP ~ S ~ X) ==> { case e1 ~ s ~ e2 => Sum(e1,One).asInstanceOf[Exp] } ||
rule(X ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(One,e2).asInstanceOf[Exp] } ||
rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
rule(EXP) ==> { case e => e } ||
rule(Epsilon[Char]) ==> { case () => One }
// Actually run the rule
val xin = Stream.fromIterator("xsxsxsxsx".elements)
EXP.parseFull(xin)
// return value => Stream(Sum(One,Sum(Sum(One,One),Sum(One,One))), ?)