undefined | Better HN

0 pointsjamra11y ago0 comments

There is a trade off even in the paper that was linked. You have to disallow distances (d) that are too large when the words are small. If you have an array of words that are within distance 3 of a 3 letter word and you have many 3 letter words, you're going to have every single 3 letter word in your result set.

This is a running issue with fuzzy matching in general, but automatons are especially slow and take forever to build.

0 comments

2 comments · 1 top-level

darklajid11y ago· 1 in thread

I originally read the paper and fought with the implementation for OCR use cases. And that was stupid / short-sighted. My problem wasn't a big result set though, it was the size of the automaton at N >= 3..

What I usually need: Fuzzy results, with a quality/rating. Given a database of all customer names, give me the best matches for the OCR result. Same for 'all streets in Germany'.

Since these input words can be quite long, N = 3 (or higher) might still provide decent quality (3 errors in 15 characters might still be interesting), but the automaton explodes in size.

On top of that, ignoring the ~high~ number of edit operations I am interested in: I usually want the real distance, not a short-cut/cut-off. So not 'Give me all results that have less then 3 errors', but 'Give me the 100 best results, ordered by quality (errors/length)'.

Or - and obviously that's always a valid option - I just failed to make that work in any decent fashion.

jamraOP11y ago

I did something similar to what you are looking for with my gocleo implementation. I create what I term the Jaccard(sp?) rating, which is the percent something is off by. github.com/jamra/gocleo

You can combine that with a thing like my LevenshteinTrie github.com/jamra/LevenshteinTrie (or if you can create a DAWG, it will be more compresssed)

j / k navigate · click thread line to collapse