undefined | Better HN

0 pointsjules11y ago0 comments

The difficulty of Levenshtein automata is highly overstated. When I read that Lucene blog post I wrote an implementation in an hour, and prior to reading that post I hadn't even heard of Levenshtein automata.

0 comments

9 comments · 2 top-level

lorenzhs11y ago· 4 in thread

Unless you are the mythical 100x programmer, I doubt that you wrote a full implementation of general Levenshtein automata in an hour. I read the paper that introduced them ( http://link.springer.com/article/10.1007/s10032-002-0082-8 ) and they are quite the complex beast. Not to mention that the paper is very technical and you need to keep a dozen definitions in your head.

That said, there seems to be a fairly readable implementation at https://github.com/universal-automata/liblevenshtein

I'm currently working on implementing fast Levenshtein queries in C++ with a friend, and we intend to implement the paper I linked in my original post. So far, our dynamic programming Levenshtein already beats Lucene++ (C++ implementation of Lucene), which is a bit embarrassing [1]. If you're interested, more advanced stuff will hit https://github.com/xhochy/libfuzzymatch when we get around to implementing it.

[1] Lucene++ spends more time converting strings between UTF-8 and UTF-32 than it does computing Levenshtein distances, says the profiler.

julesOP11y ago

I'm not a 100x programmer, I just did a couple of things that drastically reduced the time:

1. I didn't follow that paper. Even trying to understand that paper would have taken way more time, so after 5 minutes of trying to understand it I gave up on that approach. See this comment for what I did do: https://news.ycombinator.com/item?id=9699870 That saved maybe 20x.

2. I used Python instead of C++ or Java. This saved 5x.

3. The code was throwaway quality code. This saved 2x.

Together that's 200x, but I'm at least a 2x worse programmer than them, so that gives you the 100x ;-)

lorenzhs11y ago

(see my other comment as well)

An algorithmicist would say that all this saved you a constant factor of work for a linear slowdown ;)

1 more reply

jamra11y ago

I'd like to implement the same paper. Perhaps I'm missing something, but I'm not sure how the residual strings are created. Do you have a link to an implementation or a description of the residual strings?

I get that a residual string is the original string with a deletion, incrementing the deletions until you hit edit distance d. What I'm not sure about is if it's all permutations of possible deletions.

lorenzhs11y ago

The residual strings are all subwords where exactly d letters were deleted. For d=1 and the word "Levenshtein", that would be {"evenshtein", "Lvenshtein", "Leenshtein", "Levnshtein", "Leveshtein", "Levenhtein", "Levenstein", "Levenshein", "Levenshtin", "Levenshten", "Levenshtei"}.

The paper does not specify how to generate those efficiently, and I haven't given it any thought yet. I don't know of any implementations of the paper, but this aspect of it should be common enough.

EDIT: sorry, didn't read your comment fully. I'm not sure what you mean with "all permutations of possible deletions". The d-deletion-Neighbourhood of w contains all sub-words of w that you obtain by deleting any d letters from w. For d=2, take any two letters and remove them. N₂(jamra) = {jam,jar,amr,jaa,ama,jmr,jra,ara,mra} (hope I didn't forget any...)

Does that make it clearer?

1 more reply

darklajid11y ago· 3 in thread

I.. have a hard time believing this.

Unless you're talking about a very simplified case (N=1 or something, maybe even for a specific word)?

On the other hand: Maybe the Lucene guys and me are just bad. :/

julesOP11y ago

It was for the general case. The reason that I was able to do this is because I was less persistent than them. I tried to understand the paper but after 5 minutes I realized that even understanding the paper was going to take WAY more time than ignoring the paper and implementing it my own way. Here's how I did it.

If we are looking for

    string = "banana"

Then we can represent the state of the automaton as the last row of the matrix that you get when you compute the Levenshtein distance between two fixed strings. The initial state is (in Python):

    def init():
        return range(len(string)+1)

Then to take a step in the automaton:

    def step(state, char):
        newstate = [0 for x in state]
        newstate[0] = state[0]+1
        for i in range(len(state)-1):
            if i < len(string) and string[i] == char:
                newstate[i+1] = state[i]
            else:
                newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1])
        return newstate

We step like this:

    s0 = init()
    s1 = step(s0, 'c')
    s2 = step(s1, 'a')
    s3 = step(s2, 'b')
    s4 = step(s3, 'a')
    s5 = step(s4, 'n')
    s6 = step(s5, 'a')

Now we can compute the lowerbound of the Levenshtein distance by doing min(s6). In this case it's 2. This means that whatever comes after "cabana", it will always have at least distance 2 to "banana". With this info we can prune away a search path in the full text index if that value is larger than our maximum edit distance.

Those handful of lines of code is all you need to do fuzzy string search in practice. This represents the automaton as a step procedure. If you want you can also generate a DFA from this (though it's probably not necessary in practice). If your maximum edit distance is n then if one of the numbers in the state is greater than n it doesn't matter what it is. In the above example s6 = [6, 5, 4, 3, 2, 3, 2]. If n = 3 then s6 = [4, 4, 4, 3, 2, 3, 2] is equivalent, because in the end it only matters whether a number is >3 or not. So you might as well keep the numbers on 4. Replace:

    newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1])

with:

    newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1],n+1)

where n is the maximum edit distance. Now the state space of the automaton is finite, and you can generate a DFA from it by just exploring all the states with the step() function. One more optimization is to not generate the DFA for the full alphabet. If your search word is "banana" then for the purposes of the automaton the letter 'x' is equivalent to the letter 'y' because both are not equal to any letter in "banana". So instead of creating a DFA for the full ASCII alphabet (or worse, the full unicode alphabet), you can instead work with the reduced 4 letter alphabet (b,a,n,X). X represents any letter other than b,a,n.

You could also do a hybrid where you generate the DFA lazily.

I don't know if that made sense, it's a bit difficult to explain in a short HN comment.

darklajid11y ago

I'm not sure if I can follow - I'll give it some more thought and time. That said: You're doing something completely different as far I can tell. You build an ~automaton~ based on an input word. That's not what the paper does/what I struggled with. The paper describes a general automaton and creating a 'vector' based on the input word, that you use as steps.

At the moment I don't see how you could handle transpositions either.

I'm not saying that your approach is bad. But I do think that the 'I did it in an hour' comment was a quite a bit misleading, if you basically ignored the paper and did something that is different in most ways.

The tradeoffs are immensely different - the whole point of the paper is that you're precomputing a looot of stuff so that the lookup is fast.

1 more reply

lorenzhs11y ago

You implemented an automaton that computes Levenshtein distances. However, Levenshtein automata are quite different from what you describe. Your automaton executes the basic Wagner-Fischer / Needleman-Wunsch / ... algorithm.

Btw, see also https://news.ycombinator.com/item?id=9698785 for another discussion on basically the same problem.

1 more reply

j / k navigate · click thread line to collapse

0 comments

9 comments · 2 top-level

lorenzhs11y ago· 4 in thread

That said, there seems to be a fairly readable implementation at https://github.com/universal-automata/liblevenshtein

[1] Lucene++ spends more time converting strings between UTF-8 and UTF-32 than it does computing Levenshtein distances, says the profiler.

julesOP11y ago

I'm not a 100x programmer, I just did a couple of things that drastically reduced the time:

2. I used Python instead of C++ or Java. This saved 5x.

3. The code was throwaway quality code. This saved 2x.

Together that's 200x, but I'm at least a 2x worse programmer than them, so that gives you the 100x ;-)

lorenzhs11y ago

(see my other comment as well)

An algorithmicist would say that all this saved you a constant factor of work for a linear slowdown ;)

1 more reply

jamra11y ago

lorenzhs11y ago

The paper does not specify how to generate those efficiently, and I haven't given it any thought yet. I don't know of any implementations of the paper, but this aspect of it should be common enough.

Does that make it clearer?

1 more reply

darklajid11y ago· 3 in thread

I.. have a hard time believing this.

Unless you're talking about a very simplified case (N=1 or something, maybe even for a specific word)?

On the other hand: Maybe the Lucene guys and me are just bad. :/

julesOP11y ago

If we are looking for

    string = "banana"

Then we can represent the state of the automaton as the last row of the matrix that you get when you compute the Levenshtein distance between two fixed strings. The initial state is (in Python):

    def init():
        return range(len(string)+1)

Then to take a step in the automaton:

    def step(state, char):
        newstate = [0 for x in state]
        newstate[0] = state[0]+1
        for i in range(len(state)-1):
            if i < len(string) and string[i] == char:
                newstate[i+1] = state[i]
            else:
                newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1])
        return newstate

We step like this:

    s0 = init()
    s1 = step(s0, 'c')
    s2 = step(s1, 'a')
    s3 = step(s2, 'b')
    s4 = step(s3, 'a')
    s5 = step(s4, 'n')
    s6 = step(s5, 'a')

    newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1])

with:

    newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1],n+1)

You could also do a hybrid where you generate the DFA lazily.

I don't know if that made sense, it's a bit difficult to explain in a short HN comment.

darklajid11y ago

At the moment I don't see how you could handle transpositions either.

The tradeoffs are immensely different - the whole point of the paper is that you're precomputing a looot of stuff so that the lookup is fast.

1 more reply

lorenzhs11y ago

Btw, see also https://news.ycombinator.com/item?id=9698785 for another discussion on basically the same problem.

1 more reply

j / k navigate · click thread line to collapse