Tries and Lexers (opens in new tab)

(blog.ircmaxell.com)

67 pointsaberatiu11y ago28 comments

28 comments

25 comments · 8 top-level

lolptdr11y ago· 7 in thread

"Parsers work at the grammatical level, lexers work at the word level."

Is this correct to say?

gliese133711y ago

Yes. For a given definition of "word", anyway, but "word" is fiddly enough we can assume they mean something sensible.

A stricter definition would be "lexers work at the lexical level" for NLP (in which case it's nearly tautologically obvious), or "lexers work at the token level" for programming languages.

(Creds: lexing / wordbreaking is the topic of my masters research in linguistics. Tries are a good starting point for that, but it turns out that they are a special case of simple morphological analyzer, and you need more complicated ones to do a good job on natural languages.)

munificent11y ago

Yes, as long as you include punctuation in your definition of "word". I like to think of it like this:

A lexer takes a sequence of characters and produces a sequence of tokens. List in, list out. It just chunks runs of characters together.

A parser takes in a sequence of tokens and produces a tree. It produces a nested data structure from a flat input.

ch11y ago

Right, a lexer is a mapping function. A parser is a fold.

1 more reply

jahewson11y ago

Lexers work at the lexeme level, which are generally understood to be words, assuming that a word is a non-decomposable unit of language.

agumonkey11y ago

Similar to the atom / compound forms described by McCarthy.

TheLoneWolfling11y ago

I've always understood it to be this:

A lexer is for a regular language (in the formal sense), whereas a parser is for a context-sensitive (or context-free) language.

Or, for a less formal description. A lexer's transformation is one that can be expressed as a function of <<input symbol or end of file>, position in file, state> -> <<some finite number of output symbols (including zero)>, <state or done>>, where state is finite. A lexer can work without backtracking or arbitrary memory usage. Whereas a parser potentially requires either arbitrary (but finite) backtracking or arbitrary (but finite) memory usage.

The boundary gets a little fuzzy, though. For instance, constants, at least for any language that allows arbitrary-sized integers.

eliben11y ago

I guess only if you consider punctuation to be "words". A more accurate description would be "lexers work at the atom level", IOW breaking the input into atoms - the smallest non-breakable lexical units that are used in a grammar.

bane11y ago· 5 in thread

Tries are amazing data structures, simple and extraordinarily fast -- O(m) on look ups. But they also eat memory at extraordinary rates as well. They're a classic speed vs. memory data structure.

However, most people use naive Tries, just adding elements down a branch until they exhaust the string they're inserting.

One easy optimization to make with Tries is to set a maximum branch length (based on some statistical analysis of lexeme usage. For example, make 90% of your lookups reachable under that length), any lexeme longer than that length simply gets hung off of the end of the branch in a more space-efficient data structure (like a hash table).

Your lookup then is then still O(m) for anything under the maximum branch length, and things longer are still just O(m)+O(n) or whatever.

But your memory usage will shrink dramatically. And you can improve it by fiddling with your branch length and choose say 80% reachable without hashing.

FreakLegion11y ago

Indeed there are a lot of good optimization options for tries.

At a little bit of a speed penalty you can use a bitmap for navigation instead of pointers, which obliterates the memory footprint. In some use cases the result is actually smaller than the equivalent Bloom filter with a sane FP rate (though generally not smaller than a Golomb-compressed set or cuckoo filter).

A compromise between the speed and memory bloat of pointers is to use Radix/Patricia tries, which are a solid go-to option. In some cases they'll be slower than naive tries, but in others they'll be faster.

The fastest and most memory-efficient implementation I've seen, though, boiled each pattern in the trie down to its most unique dword or qword, and only attempted a full match after an initial hit. Locality of reference can work wonders here -- we got a 20x speedup over an already very fast naive trie this way.

bane11y ago

> boiled each pattern in the trie down to its most unique dword or qword, and only attempted a full match after an initial hit

Oh that's interesting - like a very fast cmp before bothering to check the rest of the branches.

When you say pattern are you talking about specific letters/morphemes or subsequence? And were you pulling the [dq]words from the initial pattern or generating them some otherway (through some kind of hash or encoding function?).

1 more reply

bunderbunder11y ago

They can eat memory at extraordinary rates, but I've successfully used them to reduce memory usage as well. If you need to store a mess of strings that tend to have a lot of duplication toward the front - URIs, for example - then a trie might fare pretty well in that department.

TheLoneWolfling11y ago

A DAWG can be even better, at the expense of considerable setup cost.

Though a naive Trie can be converted to a DAWG incrementally and on-the-fly relatively easily. If you're careful, you can even do this on another thread in the background.

TheLoneWolfling11y ago

One thing about tries that's really nice is that you can transparently combine nodes to turn it into a DAWG.

Especially as you can do it on-the-fly. Memory usage is getting excessive? Stop and do a suffix combination pass until it's decent again. Otherwise? Don't bother.

djoldman11y ago· 2 in thread

Want to build an ultra-fast lexer? Ragel is the way.

http://en.wikipedia.org/wiki/Ragel

http://www.colm.net/open-source/ragel/

JadeNB11y ago

I may be mixing this language up with another text-manipulation language starting with 'R' (not being coy—I genuinely don't remember its name), but I seem to remember being uncertain of the reach of this project.

The Wikipedia page links two publications by (I suppose) the creator of Ragel, and two publications that do not seem to be specifically about Ragel; and the product page mentions nothing in the way of publications or references to uses in industry. Is Ragel actually widely used?

djoldman11y ago

I'm not sure how widely it is used. However the documentation is pretty comprehensive[1], the interface is intuitive, and the compiled code is fast.

[1] http://www.colm.net/files/ragel/ragel-guide-6.9.pdf

haberman11y ago· 1 in thread

I would recommend that the author read up on NFAs and DFAs -- they are a formalism better suited to lexers than tries.

At a high level, if compiling a lexer for a run-of-the-mill language like JavaScript takes 5 minutes and 2.5GB of RAM, you are most likely doing it wrong. By "doing it wrong," I mean that there is almost certainly a better approach that is far superior in every measurable aspect (CPU, memory, code size, etc).

I don't fully understand what kind of algorithm the author was using, so I can't comment in detail on it, but in general lexers are better thought of as finite automata (NFAs and DFAs) than tries. The two are related, but unlike tries NFAs and DFAs can have cycles, which are essential for representing a lexing state machine efficiently.

Another observation: it's not too terribly surprising that you could beat preg_match_all() with the latter being given a huge regular expression. Most regex engines are backtracking (RE2 being a notable counterexample), which means that a regular expression with high-cardinality alternation (ie. A|B|C|D|E|F|G ...etc) is one of their worst cases. This isn't what they are designed for. A backtracking regex engine will basically try each alternative in sequence until that alternative no longer matches, then back up and try the next one. NFA/DFA based parsers will be much faster for this case.

The right tool for this job, as another commenter mentioned, is Ragel. It's basically designed for exactly this. It doesn't appear to support PHP though...

ircmaxell11y ago

> I would recommend that the author read up on NFAs and DFAs -- they are a formalism better suited to lexers than tries.

Author here. The actual regex implementation uses a NFA. The start of it used a Trie, but it moved away.

The majority of what I wanted to get across here was the use of a minimal structure (single-character).

The next step was using a maximal radix implementation (as long of a prefix as possible). Then finally, throwing all of it away and going straight to parsing using a state machine.

bluetech11y ago· 1 in thread

[From all the bad things I hear about PHP, the code is very readble without any previous experience - nice].

Here are some things a lexer for a programming language might have to deal with:

1. Comments (some even do nested - which means regular expressions are out for that).

2. Continuation lines.

3. Includes (if done at the lexical level).

4. Filename/line/column number for nice error messages (can really hurt with branch mispredictions).

5. Evaluation of literals: decimal/hex/octal/binary integers, floats, strings (with escapes), etc.

6. Identifiers.

So matching keywords is mostly the straightforward part. However I have found that matching many keywords is the perfect (and in my experience so far, the only) use case for a perfect hashing tool like gperf - it would normally be much faster than any pointer-chasing trie. gperf mostly elminated keyword matching from the profile of any lexer I've done.

TheLoneWolfling11y ago

Another thing:

Some languages allow escapes before everything else. Looking at you, Java. So either you need to do a pass beforehand to unescape them, or unescape characters (and in the process do error handling / etc!) on-the-fly.

lindig11y ago· 1 in thread

A lexer for a language with a lot of keywords leads to a large representation as an automaton as the author has experienced. One way to deals with this is to only recognise in one rule all identifiers including keywords (something like "[a-z_][a-zA-Z0-9_]*" and to use a hash table of keywords to check whether a match is a keyword (and which one) or an identifier .

Edit: fixed the regexp to allow for single-char identifiers.

munificent11y ago

> One way to deals with this is to only recognise in one rule all identifiers including keywords (something like "[a-z_][a-zA-Z0-9_]+" and to use a hash table of keywords to check whether a match is a keyword or an identifier (and which one).

A lot of production parsers I've seen do this with two state machines. They'll have a top-level state machine (usually hand-implemented using a giant switch statement) that handles all of the main different token types—numbers, strings, punctuators, etc. That has a single type for all identifiers, including keywords.

Then, after it determines a token is a identifier, it runs another little tiny optimized state machine to see if it's a reserved word or not.

jakobegger11y ago

So I get that this optimized giant trie might be faster than a regex. But what about a normal lexer, either handwritten or generated? Shouldn't that be faster still than the trie? I mean, that giant amount of memory alone must cause lots of performance issues...

TheLoneWolfling11y ago

At least optimize the DFA before you run it...

The equivalent of a DAWG versus a Trie.

j / k navigate · click thread line to collapse

28 comments

25 comments · 8 top-level

lolptdr11y ago· 7 in thread

"Parsers work at the grammatical level, lexers work at the word level."

Is this correct to say?

gliese133711y ago

Yes. For a given definition of "word", anyway, but "word" is fiddly enough we can assume they mean something sensible.

A stricter definition would be "lexers work at the lexical level" for NLP (in which case it's nearly tautologically obvious), or "lexers work at the token level" for programming languages.

munificent11y ago

Yes, as long as you include punctuation in your definition of "word". I like to think of it like this:

A lexer takes a sequence of characters and produces a sequence of tokens. List in, list out. It just chunks runs of characters together.

A parser takes in a sequence of tokens and produces a tree. It produces a nested data structure from a flat input.

ch11y ago

Right, a lexer is a mapping function. A parser is a fold.

1 more reply

jahewson11y ago

Lexers work at the lexeme level, which are generally understood to be words, assuming that a word is a non-decomposable unit of language.

agumonkey11y ago

Similar to the atom / compound forms described by McCarthy.

TheLoneWolfling11y ago

I've always understood it to be this:

A lexer is for a regular language (in the formal sense), whereas a parser is for a context-sensitive (or context-free) language.

The boundary gets a little fuzzy, though. For instance, constants, at least for any language that allows arbitrary-sized integers.

eliben11y ago

bane11y ago· 5 in thread

Tries are amazing data structures, simple and extraordinarily fast -- O(m) on look ups. But they also eat memory at extraordinary rates as well. They're a classic speed vs. memory data structure.

However, most people use naive Tries, just adding elements down a branch until they exhaust the string they're inserting.

Your lookup then is then still O(m) for anything under the maximum branch length, and things longer are still just O(m)+O(n) or whatever.

But your memory usage will shrink dramatically. And you can improve it by fiddling with your branch length and choose say 80% reachable without hashing.

FreakLegion11y ago

Indeed there are a lot of good optimization options for tries.

bane11y ago

> boiled each pattern in the trie down to its most unique dword or qword, and only attempted a full match after an initial hit

Oh that's interesting - like a very fast cmp before bothering to check the rest of the branches.

1 more reply

bunderbunder11y ago

TheLoneWolfling11y ago

A DAWG can be even better, at the expense of considerable setup cost.

Though a naive Trie can be converted to a DAWG incrementally and on-the-fly relatively easily. If you're careful, you can even do this on another thread in the background.

TheLoneWolfling11y ago

One thing about tries that's really nice is that you can transparently combine nodes to turn it into a DAWG.

Especially as you can do it on-the-fly. Memory usage is getting excessive? Stop and do a suffix combination pass until it's decent again. Otherwise? Don't bother.

djoldman11y ago· 2 in thread

Want to build an ultra-fast lexer? Ragel is the way.

http://en.wikipedia.org/wiki/Ragel

http://www.colm.net/open-source/ragel/

JadeNB11y ago

djoldman11y ago

I'm not sure how widely it is used. However the documentation is pretty comprehensive[1], the interface is intuitive, and the compiled code is fast.

[1] http://www.colm.net/files/ragel/ragel-guide-6.9.pdf

haberman11y ago· 1 in thread

I would recommend that the author read up on NFAs and DFAs -- they are a formalism better suited to lexers than tries.

The right tool for this job, as another commenter mentioned, is Ragel. It's basically designed for exactly this. It doesn't appear to support PHP though...

ircmaxell11y ago

> I would recommend that the author read up on NFAs and DFAs -- they are a formalism better suited to lexers than tries.

Author here. The actual regex implementation uses a NFA. The start of it used a Trie, but it moved away.

The majority of what I wanted to get across here was the use of a minimal structure (single-character).

The next step was using a maximal radix implementation (as long of a prefix as possible). Then finally, throwing all of it away and going straight to parsing using a state machine.

bluetech11y ago· 1 in thread

[From all the bad things I hear about PHP, the code is very readble without any previous experience - nice].

Here are some things a lexer for a programming language might have to deal with:

1. Comments (some even do nested - which means regular expressions are out for that).

2. Continuation lines.

3. Includes (if done at the lexical level).

4. Filename/line/column number for nice error messages (can really hurt with branch mispredictions).

5. Evaluation of literals: decimal/hex/octal/binary integers, floats, strings (with escapes), etc.

6. Identifiers.

TheLoneWolfling11y ago

Another thing:

lindig11y ago· 1 in thread

Edit: fixed the regexp to allow for single-char identifiers.

munificent11y ago

Then, after it determines a token is a identifier, it runs another little tiny optimized state machine to see if it's a reserved word or not.

jakobegger11y ago

TheLoneWolfling11y ago

At least optimize the DFA before you run it...

The equivalent of a DAWG versus a Trie.

j / k navigate · click thread line to collapse