Show HN: Regex Derivatives (Brzozowski Derivatives) (opens in new tab)

(github.com)

103 pointsc0nstantine3y ago35 comments

A Python sketch of a regex engine in less than 150 lines of code

35 comments

20 comments · 8 top-level

SomeoneOnTheWeb3y ago· 5 in thread

IIRC, that's what the Rust regex engine[1] currently uses

[1] https://github.com/rust-lang/regex

burntsushi3y ago

As the sibling commenter said, indeed, the regex crate does not use derivatives. If you wouldn't mind, could you share what led you to that conclusion? I'd love to fix it!

I am at least currently not aware of any "production" and "general purpose" regex engine that is built on derivatives. And I'm not really sure how you'd build one. The biggest hurdle you'd have to over come as far as I can tell is that derivatives are usually used to build a DFA. In this case, the OP does matching while taking the derivative simultaneously. My guess is that you'll run into problems doing that which huge character classes, which are easy to get when Unicode is enabled.

Whether "production" and "general purpose" are the same as "practical" is unclear. To put away the vague words, my understanding is that with derivatives, you'll either get slow match times or slow compilation times. (To the point where "slow" becomes enough to notice and be a problem.)

With that said, the world is full of experts saying you can't do something. What I'm trying to say here is that there are some challenges I've faced in the course of building a regex engine that I simply don't know how I'd address with derivatives.

Another thing worth considering here is the match semantics of the regex. I haven't had time yet to try this particular matcher, but when I do, I'd check for how alternations are matched. For example, what does 'samwise|sam' match in the haystack 'samwise'? Either answer is correct, but one is typically found in POSIX engines and the other found in Perl-like engines. Can derivatives implement either?

It's also worth noting that I am not an expert on regex derivatives. I've never actually build a derivative oriented matcher. If I had, I'm sure I could be a lot more specific with my criticisms. :-)

osaariki3y ago

We built the new engine behind .NET's RegexOptions.NonBacktracking with derivatives. We will have a paper at PLDI this year on the work that went into that.

PCRE semantics was indeed the big thing that required new techniques. Basically, you have to modify the derivation function such that the derivatives model what paths through the pattern a backtracking engine would consider before stopping at the current position.

The big thing derivatives buys you is the ability to apply rewrite rules lazily during the construction. For example, when we can prove that a regex R subsumes a regex T, then an alternation R|T can be rewritten to just R, since T is already included in it. These kinds of rewrites often result in the DFA that gets constructed being minimal or close to so. Of course, you do pay somewhat for the machinery to do this, so best-case construction times suffer compared to in traditional NFA+lazy-DFA engines like RE2 or Rust's, but a larger class of patterns get to stay in the DFA world with derivatives.

I hope our work ignites a new interest in regex matching with derivatives. I believe the ability to apply these syntactic rewrites on-the-fly is really powerful and I'd love to see how far folks like you with extensive experience in optimizing regex matchers can take this.

1 more reply

Drup3y ago

ocaml-re[1] uses a derivative-style construction to lazily build a DFA. The general idea is to use something similar to Owens et al's DFA construction, but doing it inline with some caching, to compile lazily (and building a thomson-like automaton, for group capture). In practice, it is fairly fast, although not as optimized as the Rust crate. :) Derivatives supports several match semantics very easily (ocaml-re does longest, shortest, greedy, non-greedy, first). It indeed doesn't handle unicode matching though (it's possible, there is a prototype implem, but nobody took the time to push it through). Note that it's not difficult to (lazily or not) build a NFA using derivatives as well (with Antimirov's construction).

[1]: https://github.com/ocaml/ocaml-re/

1 more reply

mananaysiempre3y ago

Nope, and its author (burntsushi) doesn’t think[1,2] you can do a practical regex matcher on top of the techinque. I’m not sure I agree with him, but, well, he’s built a production engine and I haven’t.

(There’s a thing called “partial derivatives”, which is to Brzozowski derivatives as NFAs are to DFAs, but it doesn’t interact well with intersection and negation if you want to support those.)

[1] https://news.ycombinator.com/item?id=31693927

[2] https://news.ycombinator.com/item?id=31693559

c0nstantineOP3y ago

I don't think Rust regex engine relies on this technique. I guess the main point is when you construct the DFA directly you still have the possibility of the exponential explosion of the number of states. That's why modern engines balance between NFA/DFA and lazy DFA.

Though there is an implementation that relies only on Brzozowski derivatives: https://github.com/google/redgrep

mananaysiempre3y ago· 4 in thread

From my own experiments[1], I suspect your engine might still be prone to exponential blowup: you only get proper linear matching if you reassociate and sort the choice expressions and merge identical terms (and Brzozowski’s original paper[2] does point this out). This makes the implementation quite a bit more annoying.

ETA: Yup. Try (successfully) matching a long string of As against (A*)* and watch it die.

[1] http://ix.io/4qan/ (matching only), http://ix.io/4qap/ (adds parsing and printing).

[2] https://doi.org/10.1145/321239.321249

H8crilA3y ago

This comment is not following HN guidelines, but:

I saw this and I was immediately sure that I could make this blow up exponentially. But then I got too lazy to think about exactly how, and instead waited for kind strangers on the Internet to provide a ready solution. Thank you :)

BTW, the concept is still cool.

c0nstantineOP3y ago

I agree. There is no magic. DFA construction has a risk of exponential growth in the number of states.

My point was to illustrate the technique. I found no minimalistic python implementations and decided to write one with a short and gentle intro.

I wanted to add the test for 'syntactic' equivalence to reduce the number of terms. But the code became less readable and I rolled it back.

mananaysiempre3y ago

>> I suspect your engine might still be prone to exponential blowup

> DFA construction has a risk of exponential growth in the number of states.

Right, I may have phrased that badly. I meant that the length of the derivative can grow multiplicatively for each consumed character, without bound, so even with memoization you won’t end up with a DFA. For example, in your simple implementation each A fed into (A*)* gives you a new (longer) RE, forever, even though a DFA for it only has to have a starting state and a failure one (actual implementations may end up with more).

Brzozowski proves that a RE has only a finite number of derivatives, thus making memoized differentiation equivalent to lazy DFA construction, but only if you respect associativity, commutativity and idempotence of choice ( | in modern syntax, + in his). I’m not actually sure you need all of those, in all circumstances, or if it’s enough to restrict the equivalences to e.g. the vicinity of a repetition operator, and he doesn’t discuss this, but I’ve made a couple of simpler attempts and could still make them blow up after some tinkering.

1 more reply

djoldman3y ago

try against this: https://gist.github.com/pervognsen/815b208b86066f6d7a00

evolveyourmind3y ago· 2 in thread

And here the formalized proof in less than 150 lines of code (in Agda) for Brzozowski Derivatives for regex matching (and additional regular languages theorems): https://github.com/desi-ivanov/agda-regexp-automata

c0nstantineOP3y ago

Thanks for sharing. I am not familiar with Agda. Will take a look. There is somewhat similar code in COQ: https://github.com/coq-community/regexp-Brzozowski

DonaldPShimoda3y ago

Just a small FYI, but the language's name (for now) is Coq, not COQ.

enricozb3y ago· 1 in thread

I think there is a typo when defining v(e), it says both v(e) = e and v(e) = empty set.

c0nstantineOP3y ago

Yes there's a typo. Thank you for following the README.md and your attention. It's fixed now.

noelwelsh3y ago

The next logical step is parsing with derivatives: https://matt.might.net/papers/might2011derivatives.pdf

I found the paper was fairly readable. YMMV.

djoldman3y ago

https://gist.github.com/pervognsen/815b208b86066f6d7a00

aarchi3y ago

I'm currently building a couple of regexp engines:

One, that's a formalization[0] in Coq with big-step semantics, which uncommonly has the intersection operator, and includes several equivalence relations and a proof of the pumping lemma, excepting one case (more on that below).

As a learning exercise and for historical reasons, I've also mostly ported Rust Cox's re1 engine to Rust[1], which includes VM matchers in the style of Henry Spencer, Ken Thompson, and Rob Pike. I also plan to port Doug McIlroy's engine[2], which is interesting for having intersection and complement and special handling for sublanguages, all the way down to just concatenation matched with Knuth-Morris-Pratt. I also want to examine the Rust (thanks burntsushi!), RE2, and Plan 9 engines in more depth.

Once I have time to get back to the project, I want to get back to my regular expression crossword puzzle solver. For that, I'm converting the hint regexps to DFAs, that match strings of some fixed length, and concatenating and intersecting them, until a single regexp is yielded, which should be a string literal, if the puzzle has a single solution. For backreferences, it's more tricky, but I plan on rewriting backreferences to the captured expression, where the lengths of both match, then either executing it with a stack like a pushdown automata or constructing a set of constraints on the characters by index.

As an aside: In my proof of the pumping lemma[3], I got stuck on the case for intersection and I'd love insight. Regular languages are closed under intersection, so the pumping lemma should hold for my implementation. I need to prove that if s =~ re1 and s =~ re2 can be pumped, then so can s =~ And re1 re2. s is is split into different substrings for re1 and re2, s = s11 ++ s12 ++ s13 = s21 ++ s22 ++ s23, then repeated an arbitrary number of times, (forall n, s11 ++ repeat s12 n ++ s13 =~ re1) and (forall n, s21 ++ repeat s22 n ++ s23 =~ re2). My intuition is that s11 = s21, s12 = s22, and s13 = s23, because they both match for the intersection, but I'm not convinced of that and haven't been able to formulate a proof for that.

0: https://github.com/thaliaarchi/recross-coq

1: https://github.com/thaliaarchi/re1-rust

2: https://github.com/arnoldrobbins/mcilroy-regex

3: https://github.com/thaliaarchi/recross-coq/blob/main/theorie...

DonHopkins3y ago

This "derivatives of regular expressions" technique is what James Clark used in the design of the efficient implementation of the Relax/NG tree regular expression based XML validator, based on Janusz A. Brzozowski, "Derivatives of Regular Expressions", Journal of the ACM, Volume 11, Issue 4, 1964.

https://relaxng.org/jclark/design.html

>Unordered content

>SGML provides an & operator: A & B matches A followed by B or B followed by A. XML removed the & operator. RELAX NG reintroduces it with a twist. In SGML, a content model of A & B* requires all the B elements to be consecutive: the required A element cannot occur in between two B elements. Usually, users use the & operator because they want to allow child elements to occur in any order, so this restriction is undesirable. In RELAX NG, the corresponding operator has interleaving semantics. It matches any interleaving of a sequence containing a single A element and a sequence containing zero or more B elements; it thus allows the A element to occur anywhere, including between two B elements.

>XML removed the & operator mainly because of the & operator's reputation for implementation complexity. The most difficult part of implementing the & operator in SGML is detecting whether a content model including & is 1-unambiguous. Unlike SGML, XML and W3C XML Schema, RELAX NG does not restrict content models to be 1-unambiguous, so this implementation difficulty is removed. The classic implementation technique for SGML and XML content models is to construct a Glushkov automaton. The 1-unambiguity restriction is helpful for this technique because it ensures that the Glushkov automaton is deterministic. An interleaving operator causes difficulty with this technique. However, there is an alternative implementation technique available [17] based on derivatives of regular expressions [4]. This handles content models that are not 1-unambiguous without any additional effort and can deal with interleaving without difficulty. RELAX NG imposes restrictions on the use of interleave which are sufficient to ensure that a derivative-based implementation will not exhibit exponential behavior.

[4] Janusz A. Brzozowski. Derivatives of Regular Expressions. Journal of the ACM, Volume 11, Issue 4, 1964.

https://dl.acm.org/doi/10.1145/321239.321249

https://dl.acm.org/doi/pdf/10.1145/321239.321249

[17] Joe English. How to validate XML. 1999. See http://www.flightlab.com/~joe/sgml/validate.html

j / k navigate · click thread line to collapse

35 comments

20 comments · 8 top-level

SomeoneOnTheWeb3y ago· 5 in thread

IIRC, that's what the Rust regex engine[1] currently uses

[1] https://github.com/rust-lang/regex

burntsushi3y ago

As the sibling commenter said, indeed, the regex crate does not use derivatives. If you wouldn't mind, could you share what led you to that conclusion? I'd love to fix it!

It's also worth noting that I am not an expert on regex derivatives. I've never actually build a derivative oriented matcher. If I had, I'm sure I could be a lot more specific with my criticisms. :-)

osaariki3y ago

We built the new engine behind .NET's RegexOptions.NonBacktracking with derivatives. We will have a paper at PLDI this year on the work that went into that.

1 more reply

Drup3y ago

[1]: https://github.com/ocaml/ocaml-re/

1 more reply

mananaysiempre3y ago

[1] https://news.ycombinator.com/item?id=31693927

[2] https://news.ycombinator.com/item?id=31693559

c0nstantineOP3y ago

Though there is an implementation that relies only on Brzozowski derivatives: https://github.com/google/redgrep

mananaysiempre3y ago· 4 in thread

ETA: Yup. Try (successfully) matching a long string of As against (A*)* and watch it die.

[1] http://ix.io/4qan/ (matching only), http://ix.io/4qap/ (adds parsing and printing).

[2] https://doi.org/10.1145/321239.321249

H8crilA3y ago

This comment is not following HN guidelines, but:

BTW, the concept is still cool.

c0nstantineOP3y ago

I agree. There is no magic. DFA construction has a risk of exponential growth in the number of states.

My point was to illustrate the technique. I found no minimalistic python implementations and decided to write one with a short and gentle intro.

I wanted to add the test for 'syntactic' equivalence to reduce the number of terms. But the code became less readable and I rolled it back.

mananaysiempre3y ago

>> I suspect your engine might still be prone to exponential blowup

> DFA construction has a risk of exponential growth in the number of states.

1 more reply

djoldman3y ago

try against this: https://gist.github.com/pervognsen/815b208b86066f6d7a00

evolveyourmind3y ago· 2 in thread

c0nstantineOP3y ago

Thanks for sharing. I am not familiar with Agda. Will take a look. There is somewhat similar code in COQ: https://github.com/coq-community/regexp-Brzozowski

DonaldPShimoda3y ago

Just a small FYI, but the language's name (for now) is Coq, not COQ.

enricozb3y ago· 1 in thread

I think there is a typo when defining v(e), it says both v(e) = e and v(e) = empty set.

c0nstantineOP3y ago

Yes there's a typo. Thank you for following the README.md and your attention. It's fixed now.

noelwelsh3y ago

The next logical step is parsing with derivatives: https://matt.might.net/papers/might2011derivatives.pdf

I found the paper was fairly readable. YMMV.

djoldman3y ago