undefined | Better HN

0 pointsmananaysiempre3y ago0 comments

>> I suspect your engine might still be prone to exponential blowup

> DFA construction has a risk of exponential growth in the number of states.

Right, I may have phrased that badly. I meant that the length of the derivative can grow multiplicatively for each consumed character, without bound, so even with memoization you won’t end up with a DFA. For example, in your simple implementation each A fed into (A*)* gives you a new (longer) RE, forever, even though a DFA for it only has to have a starting state and a failure one (actual implementations may end up with more).

Brzozowski proves that a RE has only a finite number of derivatives, thus making memoized differentiation equivalent to lazy DFA construction, but only if you respect associativity, commutativity and idempotence of choice ( | in modern syntax, + in his). I’m not actually sure you need all of those, in all circumstances, or if it’s enough to restrict the equivalences to e.g. the vicinity of a repetition operator, and he doesn’t discuss this, but I’ve made a couple of simpler attempts and could still make them blow up after some tinkering.

0 comments

4 comments · 1 top-level

c0nstantine3y ago· 3 in thread

I think I got your point. It is valid. To construct the DFA and have a finite number of derivatives we have to keep track of the following equivalences:

r + r ~ r

r + s ~ s + r

(r + s) + t ~ r + (s + t)

In [1] authors state this and refer to the proof in the original paper. They even extend it to a set of extended rules to reduce the number of terms (states) even more.

Actually, the code for (lazy) DFA construction code is not even committed yet. The repo contains just sequential per-character application of the derivative to a regex. Which is obviously finite (though not efficient). Again, just to demonstrate the concept.

[1] https://www.ccs.neu.edu/home/turon/re-deriv.pdf

carapace3y ago

Hey hey, FWIW I wrote this up too (also in Python but in a different style, and without as many cool glyphs): https://joypy.osdn.io/notebooks/Derivatives_of_Regular_Expre...

One neat thing is that the "compaction" rules to avoid exponential blowup are symmetrical (they form like, a ring or semi-ring or whatever, sorry I'm not a mathematician.) https://joypy.osdn.io/notebooks/Derivatives_of_Regular_Expre...

c0nstantine3y ago

Hi, thanks for sharing. Didn't know there is a python implementation. Your article is broader and I like the functional flavor.

For the 'compaction' you mention, yes it is useful. But the code will be more complicated and optimization wasn't the point of the sketch.

mananaysiempreOP3y ago

Yes, and even if you aren’t constructing a DFA, only being able to produce a finite number of derivatives from a given RE is still useful:

As there’s only a finite number of derivatives, their length is obviously bounded by a constant for a fixed starting RE (though that constant is still exponential in the length of that RE). This implies your non-DFA-based matcher can only take a bounded time computing the next derivative, so takes a time proportional to the length of a string to process that string (even if the constant of proportionality is exponential in the RE length).

(I’m not good at fitting all of my reasoning into a single comment today, am I?)

j / k navigate · click thread line to collapse