Is this correct to say?
A stricter definition would be "lexers work at the lexical level" for NLP (in which case it's nearly tautologically obvious), or "lexers work at the token level" for programming languages.
(Creds: lexing / wordbreaking is the topic of my masters research in linguistics. Tries are a good starting point for that, but it turns out that they are a special case of simple morphological analyzer, and you need more complicated ones to do a good job on natural languages.)
A lexer takes a sequence of characters and produces a sequence of tokens. List in, list out. It just chunks runs of characters together.
A parser takes in a sequence of tokens and produces a tree. It produces a nested data structure from a flat input.
A lexer is for a regular language (in the formal sense), whereas a parser is for a context-sensitive (or context-free) language.
Or, for a less formal description. A lexer's transformation is one that can be expressed as a function of <<input symbol or end of file>, position in file, state> -> <<some finite number of output symbols (including zero)>, <state or done>>, where state is finite. A lexer can work without backtracking or arbitrary memory usage. Whereas a parser potentially requires either arbitrary (but finite) backtracking or arbitrary (but finite) memory usage.
The boundary gets a little fuzzy, though. For instance, constants, at least for any language that allows arbitrary-sized integers.
However, most people use naive Tries, just adding elements down a branch until they exhaust the string they're inserting.
One easy optimization to make with Tries is to set a maximum branch length (based on some statistical analysis of lexeme usage. For example, make 90% of your lookups reachable under that length), any lexeme longer than that length simply gets hung off of the end of the branch in a more space-efficient data structure (like a hash table).
Your lookup then is then still O(m) for anything under the maximum branch length, and things longer are still just O(m)+O(n) or whatever.
But your memory usage will shrink dramatically. And you can improve it by fiddling with your branch length and choose say 80% reachable without hashing.
At a little bit of a speed penalty you can use a bitmap for navigation instead of pointers, which obliterates the memory footprint. In some use cases the result is actually smaller than the equivalent Bloom filter with a sane FP rate (though generally not smaller than a Golomb-compressed set or cuckoo filter).
A compromise between the speed and memory bloat of pointers is to use Radix/Patricia tries, which are a solid go-to option. In some cases they'll be slower than naive tries, but in others they'll be faster.
The fastest and most memory-efficient implementation I've seen, though, boiled each pattern in the trie down to its most unique dword or qword, and only attempted a full match after an initial hit. Locality of reference can work wonders here -- we got a 20x speedup over an already very fast naive trie this way.
Oh that's interesting - like a very fast cmp before bothering to check the rest of the branches.
When you say pattern are you talking about specific letters/morphemes or subsequence? And were you pulling the [dq]words from the initial pattern or generating them some otherway (through some kind of hash or encoding function?).
Though a naive Trie can be converted to a DAWG incrementally and on-the-fly relatively easily. If you're careful, you can even do this on another thread in the background.
Especially as you can do it on-the-fly. Memory usage is getting excessive? Stop and do a suffix combination pass until it's decent again. Otherwise? Don't bother.
The Wikipedia page links two publications by (I suppose) the creator of Ragel, and two publications that do not seem to be specifically about Ragel; and the product page mentions nothing in the way of publications or references to uses in industry. Is Ragel actually widely used?
At a high level, if compiling a lexer for a run-of-the-mill language like JavaScript takes 5 minutes and 2.5GB of RAM, you are most likely doing it wrong. By "doing it wrong," I mean that there is almost certainly a better approach that is far superior in every measurable aspect (CPU, memory, code size, etc).
I don't fully understand what kind of algorithm the author was using, so I can't comment in detail on it, but in general lexers are better thought of as finite automata (NFAs and DFAs) than tries. The two are related, but unlike tries NFAs and DFAs can have cycles, which are essential for representing a lexing state machine efficiently.
Another observation: it's not too terribly surprising that you could beat preg_match_all() with the latter being given a huge regular expression. Most regex engines are backtracking (RE2 being a notable counterexample), which means that a regular expression with high-cardinality alternation (ie. A|B|C|D|E|F|G ...etc) is one of their worst cases. This isn't what they are designed for. A backtracking regex engine will basically try each alternative in sequence until that alternative no longer matches, then back up and try the next one. NFA/DFA based parsers will be much faster for this case.
The right tool for this job, as another commenter mentioned, is Ragel. It's basically designed for exactly this. It doesn't appear to support PHP though...
Author here. The actual regex implementation uses a NFA. The start of it used a Trie, but it moved away.
The majority of what I wanted to get across here was the use of a minimal structure (single-character).
The next step was using a maximal radix implementation (as long of a prefix as possible). Then finally, throwing all of it away and going straight to parsing using a state machine.
Here are some things a lexer for a programming language might have to deal with:
1. Comments (some even do nested - which means regular expressions are out for that).
2. Continuation lines.
3. Includes (if done at the lexical level).
4. Filename/line/column number for nice error messages (can really hurt with branch mispredictions).
5. Evaluation of literals: decimal/hex/octal/binary integers, floats, strings (with escapes), etc.
6. Identifiers.
So matching keywords is mostly the straightforward part. However I have found that matching many keywords is the perfect (and in my experience so far, the only) use case for a perfect hashing tool like gperf - it would normally be much faster than any pointer-chasing trie. gperf mostly elminated keyword matching from the profile of any lexer I've done.
Some languages allow escapes before everything else. Looking at you, Java. So either you need to do a pass beforehand to unescape them, or unescape characters (and in the process do error handling / etc!) on-the-fly.
Edit: fixed the regexp to allow for single-char identifiers.
A lot of production parsers I've seen do this with two state machines. They'll have a top-level state machine (usually hand-implemented using a giant switch statement) that handles all of the main different token types—numbers, strings, punctuators, etc. That has a single type for all identifiers, including keywords.
Then, after it determines a token is a identifier, it runs another little tiny optimized state machine to see if it's a reserved word or not.
The equivalent of a DAWG versus a Trie.