undefined | Better HN

0 pointslorenzhs11y ago0 comments

You implemented an automaton that computes Levenshtein distances. However, Levenshtein automata are quite different from what you describe. Your automaton executes the basic Wagner-Fischer / Needleman-Wunsch / ... algorithm.

Btw, see also https://news.ycombinator.com/item?id=9698785 for another discussion on basically the same problem.

0 comments

3 comments · 1 top-level

jules11y ago· 2 in thread

This is not correct. The end result of the step() based automaton is the same: it prunes exactly the same search paths as any Levenshtein automaton would. And the part where I described how to build the DFA gives you exactly a Levenshtein automaton DFA. The approach is different, yes, that's the point: it's much simpler and still does the job.

lorenzhsOP11y ago

It's not nearly as efficient though, your step function requires O(len(string)) time no matter how well you prune. Since you have len(query) many steps, that gets you to O(len(string)*len(query)), aka quadratic time if they're roughly the same length. Levenshtein automata can do this in linear time because they spend time building the automaton first (preprocessing). So yes, you implemented an algorithm using automata that computes the same result. But you didn't implement Levenshtein automata.

jules11y ago

In practice with Lucene, len(string) and len(query) are like 10. So it's totally irrelevant. Furthermore computing the step is extremely fast: you're just doing a handful of min()'s. Even a single cache miss is going to completely dominate that, let alone a disk seek. What matters is that you don't scan a 10 million word index, instead you want to prune your search to hit, say, 50 words in the index.

That's just the step() approach. After that I described how to build the DFA, which gets you the same optimal linear time that you want which does not depend on the query string size.

Reply to other comment:

> Your DFA construction, while a bit incomplete (you don't say how you do the transitions), achieves roughly the same thing as Levenshtein automata do. But you spend significantly more time to construct it. The point of the original paper was not to show that DFAs can be used to compute Levenshtein distance, but to show how to do it quickly and efficiently.

Why is it incomplete? You just follow the step() automaton and construct a DFA that does the same. Every time you hit a new state you create a new node in the DFA, and if you hit an old state you just point to a previously constructed DFA node. You can even do the DFA construction lazily.

> But you didn't implement Levenshtein automata.

A Levenshtein automaton for a word W is a finite state automaton that accepts all words V within edit distance n from W. That's exactly what I built. The point here is that you turn a 30 second search over the whole index into a 0.02 second search by pruning. If you can then optimize that to 0.015 by making the automaton more efficient that's nice but you can hardly claim that what I did is not a Levenshtein automaton because it's a bit slower (and it's not even clear to me that it would be slower).

1 more reply

j / k navigate · click thread line to collapse