I am at least currently not aware of any "production" and "general purpose" regex engine that is built on derivatives. And I'm not really sure how you'd build one. The biggest hurdle you'd have to over come as far as I can tell is that derivatives are usually used to build a DFA. In this case, the OP does matching while taking the derivative simultaneously. My guess is that you'll run into problems doing that which huge character classes, which are easy to get when Unicode is enabled.
Whether "production" and "general purpose" are the same as "practical" is unclear. To put away the vague words, my understanding is that with derivatives, you'll either get slow match times or slow compilation times. (To the point where "slow" becomes enough to notice and be a problem.)
With that said, the world is full of experts saying you can't do something. What I'm trying to say here is that there are some challenges I've faced in the course of building a regex engine that I simply don't know how I'd address with derivatives.
Another thing worth considering here is the match semantics of the regex. I haven't had time yet to try this particular matcher, but when I do, I'd check for how alternations are matched. For example, what does 'samwise|sam' match in the haystack 'samwise'? Either answer is correct, but one is typically found in POSIX engines and the other found in Perl-like engines. Can derivatives implement either?
It's also worth noting that I am not an expert on regex derivatives. I've never actually build a derivative oriented matcher. If I had, I'm sure I could be a lot more specific with my criticisms. :-)
PCRE semantics was indeed the big thing that required new techniques. Basically, you have to modify the derivation function such that the derivatives model what paths through the pattern a backtracking engine would consider before stopping at the current position.
The big thing derivatives buys you is the ability to apply rewrite rules lazily during the construction. For example, when we can prove that a regex R subsumes a regex T, then an alternation R|T can be rewritten to just R, since T is already included in it. These kinds of rewrites often result in the DFA that gets constructed being minimal or close to so. Of course, you do pay somewhat for the machinery to do this, so best-case construction times suffer compared to in traditional NFA+lazy-DFA engines like RE2 or Rust's, but a larger class of patterns get to stay in the DFA world with derivatives.
I hope our work ignites a new interest in regex matching with derivatives. I believe the ability to apply these syntactic rewrites on-the-fly is really powerful and I'd love to see how far folks like you with extensive experience in optimizing regex matchers can take this.
(There’s a thing called “partial derivatives”, which is to Brzozowski derivatives as NFAs are to DFAs, but it doesn’t interact well with intersection and negation if you want to support those.)
Though there is an implementation that relies only on Brzozowski derivatives: https://github.com/google/redgrep
ETA: Yup. Try (successfully) matching a long string of As against (A*)* and watch it die.
[1] http://ix.io/4qan/ (matching only), http://ix.io/4qap/ (adds parsing and printing).
I saw this and I was immediately sure that I could make this blow up exponentially. But then I got too lazy to think about exactly how, and instead waited for kind strangers on the Internet to provide a ready solution. Thank you :)
BTW, the concept is still cool.
My point was to illustrate the technique. I found no minimalistic python implementations and decided to write one with a short and gentle intro.
I wanted to add the test for 'syntactic' equivalence to reduce the number of terms. But the code became less readable and I rolled it back.
> DFA construction has a risk of exponential growth in the number of states.
Right, I may have phrased that badly. I meant that the length of the derivative can grow multiplicatively for each consumed character, without bound, so even with memoization you won’t end up with a DFA. For example, in your simple implementation each A fed into (A*)* gives you a new (longer) RE, forever, even though a DFA for it only has to have a starting state and a failure one (actual implementations may end up with more).
Brzozowski proves that a RE has only a finite number of derivatives, thus making memoized differentiation equivalent to lazy DFA construction, but only if you respect associativity, commutativity and idempotence of choice ( | in modern syntax, + in his). I’m not actually sure you need all of those, in all circumstances, or if it’s enough to restrict the equivalences to e.g. the vicinity of a repetition operator, and he doesn’t discuss this, but I’ve made a couple of simpler attempts and could still make them blow up after some tinkering.
I found the paper was fairly readable. YMMV.
One, that's a formalization[0] in Coq with big-step semantics, which uncommonly has the intersection operator, and includes several equivalence relations and a proof of the pumping lemma, excepting one case (more on that below).
As a learning exercise and for historical reasons, I've also mostly ported Rust Cox's re1 engine to Rust[1], which includes VM matchers in the style of Henry Spencer, Ken Thompson, and Rob Pike. I also plan to port Doug McIlroy's engine[2], which is interesting for having intersection and complement and special handling for sublanguages, all the way down to just concatenation matched with Knuth-Morris-Pratt. I also want to examine the Rust (thanks burntsushi!), RE2, and Plan 9 engines in more depth.
Once I have time to get back to the project, I want to get back to my regular expression crossword puzzle solver. For that, I'm converting the hint regexps to DFAs, that match strings of some fixed length, and concatenating and intersecting them, until a single regexp is yielded, which should be a string literal, if the puzzle has a single solution. For backreferences, it's more tricky, but I plan on rewriting backreferences to the captured expression, where the lengths of both match, then either executing it with a stack like a pushdown automata or constructing a set of constraints on the characters by index.
As an aside: In my proof of the pumping lemma[3], I got stuck on the case for intersection and I'd love insight. Regular languages are closed under intersection, so the pumping lemma should hold for my implementation. I need to prove that if s =~ re1 and s =~ re2 can be pumped, then so can s =~ And re1 re2. s is is split into different substrings for re1 and re2, s = s11 ++ s12 ++ s13 = s21 ++ s22 ++ s23, then repeated an arbitrary number of times, (forall n, s11 ++ repeat s12 n ++ s13 =~ re1) and (forall n, s21 ++ repeat s22 n ++ s23 =~ re2). My intuition is that s11 = s21, s12 = s22, and s13 = s23, because they both match for the intersection, but I'm not convinced of that and haven't been able to formulate a proof for that.
0: https://github.com/thaliaarchi/recross-coq
1: https://github.com/thaliaarchi/re1-rust
2: https://github.com/arnoldrobbins/mcilroy-regex
3: https://github.com/thaliaarchi/recross-coq/blob/main/theorie...
https://relaxng.org/jclark/design.html
>Unordered content
>SGML provides an & operator: A & B matches A followed by B or B followed by A. XML removed the & operator. RELAX NG reintroduces it with a twist. In SGML, a content model of A & B* requires all the B elements to be consecutive: the required A element cannot occur in between two B elements. Usually, users use the & operator because they want to allow child elements to occur in any order, so this restriction is undesirable. In RELAX NG, the corresponding operator has interleaving semantics. It matches any interleaving of a sequence containing a single A element and a sequence containing zero or more B elements; it thus allows the A element to occur anywhere, including between two B elements.
>XML removed the & operator mainly because of the & operator's reputation for implementation complexity. The most difficult part of implementing the & operator in SGML is detecting whether a content model including & is 1-unambiguous. Unlike SGML, XML and W3C XML Schema, RELAX NG does not restrict content models to be 1-unambiguous, so this implementation difficulty is removed. The classic implementation technique for SGML and XML content models is to construct a Glushkov automaton. The 1-unambiguity restriction is helpful for this technique because it ensures that the Glushkov automaton is deterministic. An interleaving operator causes difficulty with this technique. However, there is an alternative implementation technique available [17] based on derivatives of regular expressions [4]. This handles content models that are not 1-unambiguous without any additional effort and can deal with interleaving without difficulty. RELAX NG imposes restrictions on the use of interleave which are sufficient to ensure that a derivative-based implementation will not exhibit exponential behavior.
[4] Janusz A. Brzozowski. Derivatives of Regular Expressions. Journal of the ACM, Volume 11, Issue 4, 1964.
https://dl.acm.org/doi/10.1145/321239.321249
https://dl.acm.org/doi/pdf/10.1145/321239.321249
[17] Joe English. How to validate XML. 1999. See http://www.flightlab.com/~joe/sgml/validate.html