undefined | Better HN

0 pointspcwalton11y ago0 comments

> This is a perfect example of how text parsing is really inherently non-parallelizable. It's very rare that you can do anything useful with a buffer of text without knowing the precise state of the parse at the beginning of that buffer.

There are two mechanisms that are usually used to get around this:

(1) Perform a fast, sequential "skeleton parsing" pass before the main parse that scans just enough to find "split points" that are consumed by the parallel parser. This is what some of the parallel XML parsing work [1] did.

(2) Guess the state you're in based on some heuristics, and roll back on failure. This actually works surprisingly well in practice for many grammars, for example HTML [2].

[1]: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=410047...

[2]: http://www.cs.wm.edu/~xshen/Publications/taco14.pdf

0 comments

1 comments · 1 top-level

haberman11y ago

> (1) Perform a fast, sequential "skeleton parsing" pass before the main parse that scans just enough to find "split points" that are consumed by the parallel parser.

I'm not able to access the full text of this paper. But from the description I wouldn't really consider this "parallel parsing." For the "skeleton parser" to be correct, it must transition through a state machine that is exactly as complex as the real parser. I suspect (again, not being able to read the paper right now) that what makes the "skeleton parse" faster than the "real parse" is not the speed of the parser itself, but the speed of the "load" on the parser.

For the skeleton parse, the "load" on the parser is just finding split points (cheap). At the application level, the "load" on the parser in many cases is building a tree of some sort. The tree-building is often significantly more expensive than the parse itself because it usually involves a lot of dynamic memory allocation.

So yes, if you do a preliminary parse that chunks up the document, and then a second parse that has a heavier load on it like tree-building, the second parse can indeed be parallelized. But I wouldn't consider this parallelizing a parser, I would consider it parallelizing the tree-building. In raw terms you have probably spent more CPU on the actual parsing logic than in the single-threaded case.

I don't mean for this to be a semantic quibble. I'm really interested in parsing architectures that decouple the parser itself from its "load." Event-based parsers like SAX parsers do this. I'm interested specifically in the parser part itself, and the limits of how it can be optimized.

> (2) Guess the state you're in based on some heuristics, and roll back on failure. This actually works surprisingly well in practice for many grammars, for example HTML [2].

Looks like an interesting paper, I'll have to dig more into that.

j / k navigate · click thread line to collapse