New string search algorithm (opens in new tab)

(volnitsky.com)

181 pointsakaus15y ago59 comments

59 comments

35 comments · 9 top-level

tansey15y ago· 8 in thread

From the site:

>Preprocessing phase in O(M) space and time complexity. Searching phase average O(N) time complexity and O(N*M) worst case complexity.

I don't trust the analysis of someone referring to "average O(N) time"; Big O notation refers to boundary times.

Edit: Okay, based on arguments here and on [1], I'm going to accept that maybe he's just bastardizing the notation.

[1] http://stackoverflow.com/questions/3905355/meaning-of-averag...

mayank15y ago

> Big O notation refers to boundary times.

No it doesn't. You can have an O(N) amortized time. Big-O is a bounding function up to a constant factor, not necessarily a boundary (as in worst-case) time.

http://en.wikipedia.org/wiki/Amortized_analysis

pjscott15y ago

To say that something runs in amortized O(n) time guarantees an upper bound on the average time per operation in a worst-case sequence of operations. It does not deal with average-case time on random or typical data.

1 more reply

leif15y ago

This is not an amortized analysis, it is a probabilistic analysis.

jemfinch15y ago

Big O notation is frequently used to refer to the average case bounds of an algorithm. Haven't you seen an analysis of quicksort?

pjscott15y ago

It could be just an abuse of notation, in the same way that people say that Quicksort is O(n lg n) on average. Sure, on perverse data the best time bound you can prove is O(n^2), but on random data you get expected O(n lg n) time, and on typical data with good partition selection you can generally expect to not go quadratic.

(You could also use the O(n) time median algorithm to construct a truly O(n lg n) Quicksort, but that's just a fun theoretical side-note here.)

repsilat15y ago

> ...an abuse of notation, in the same way that people say that Quicksort is O(n lg n) on average.

I think I may have misunderstood what you meant by this. If I have, I apologise, if I haven't...

Talking about the average-case complexity of algorithms with Big-Oh notation is in no way an "abuse" of that notation. If the average time complexity of an algorithm is `O(f(n))`, then its running time averaged over all inputs of size `n` is bounded above (in some sense) by `f(n)`.

I guess that statement doesn't say that the number of comparisons or swaps or "steps" the algorithms must make to complete is `O(n lg n)`. Perhaps that's what you meant.

leif15y ago

I love how beautiful that median algorithm is. :-D

leif15y ago

Asymptotic bounds can be given in the worst case, the best case, the average case ("in expectation"), or with high probability. All have rigorous definitions and you should go learn about them. There is nothing bastardized about this notation.

mayank15y ago· 7 in thread

No DBLP profile for the author, no proofs on site, fastest algorithm known "to me" qualifier, no results for "suffix tree" on page, not a good sign.

EDIT: Am I missing something???

Complexity analysis according to the author: m = search term, n = text

O(m) preprocessing -- that's right, O(search term). And O(n times m) worst-case query string search, so the worst case traverses the whole text.

Now compare that to suffix trees:

O(n) preprocessing O(m) string search

where worst case complexity is linear in search term.

kragen15y ago

He's a Russian hacker. That's where awesome new algorithms come from these days, including things like Dual-Pivot Quicksort. (And QuickLZ may have been invented by some Scandinavian guy, but I'm pretty sure it was announced on encode.ru.) He's not an academic, but that doesn't mean he can't do competent algorithmic analysis.

I concur with the other commenters that it's silly for you to complain about his not comparing his online string-search algorithm against offline string-search algorithms that search an index of the text, such as suffix-tree algorithms.

mayank15y ago

His being Russian has no bearing on the issue. nginx and the pivot improvement to quicksort don't imply that Russian mathematicians no longer need to prove their claims. Not being an academic doesn't exempt you from having to do rigorous analysis either.

As for the online algorithm issue, see my reply below.

cperciva15y ago

Dual-Pivot Quicksort is an "awesome new algorithm"? Hardly. It's very small step in the direction of Samplesort -- which is asymptotically optimal and has been around for four decades.

Dual-Pivot Quicksort is a demonstration that someone didn't read the existing literature; nothing more.

1 more reply

Radim15y ago

I guess it depends on what type of queries you expect; if you want to find the same (fixed) substring across a body of (dynamic) texts, the O(n) cost of preprocessing (suffix trees/arrays) is terrible.

If, on the other hand, you have a fixed "corpus" and a dynamic query, O(n) search time (this algo, purportedly) is terrible.

mayank15y ago

Hmm...there's no mention of it being suited for a dynamic or "online" setting as another commenter notes (not sure why my other comment was downvoted for that). I don't know what to make of this from the description:

"This algorithm especially well suited for long S, multi-substrings, small alphabet, regex/parsing and search for common substrings."

Long source text: the O(n times m) worst-case time per search kills it. Even for a single search, O(n times m) worst case here versus O(n + m) for suffix trees.

multi-substrings: suffix trees do each substring of length m in O(m), but are not compared.

search for common substrings: again, suffix trees would be more appropriate.

matt471115y ago

I'm assuming he is only comparing online algorithms which process only the pattern not the text.

mayank15y ago

Nevertheless, it's inexcusable to be proposing a string matching algorithm without at least mentioning why suffix trees can't do the job. At the very least, showing that your algorithm beats suffix trees in any instantiation does wonders for credibility. For the record, suffix trees can be built online as well:

http://www.springerlink.com/content/kq55005qu6479276/

1 more reply

dasht15y ago· 4 in thread

Interesting.

Note that the worst case of complexity for this algorithm is much, much worse than the worst case complexity for Boyer Moore. Do not use this algorithm carelessly. For example, if you use it in a thoughtless way in your web server, you may open yourself to a DoS attack.

Note that the author nicely characterizes it as of potential use for small alphabets and possibly multiple substrings (in a single search). That immediately made me think he might have devised it for genomics research. In most applications I would think you'd also want regexp features. Interestingly, DNA research and use in a regexp engine is something he goes on to suggest. (If you are searching for a very large number of regexps in a big genome database, I would not use this algorithm. I found that some simple variants on classic NFA techniques work very well for a wide class of typical regexps (e.g., regexps modeling SNPs, small read position errors, small numbers of read errors, etc. There probably isn't any one obviously right answer, though, and a lot depends on your particular hardware situation, data set sizes, etc.).

The HN headline is very bogus hype. "X2 times faster than Boyer-Moore" is far from true in the general case. "breakthrough" is a gross exaggeration: this is a technique that anyone with some good algorithms course or two under the belt should be able to think of an, for most applications, decide to not use because of the limitations of the thing. I can definitely see it being nice for some applications tolerant of its limitations but... breakthrough it ain't.

thisisnotmyname15y ago

For sequence alignment, the state of the art is BWA, which first compresses the "haystack", then builds a trie.

See http://en.wikipedia.org/wiki/Burrows–Wheeler_transform or http://bioinformatics.oxfordjournals.org/content/early/2009/...

dasht15y ago

Thanks. That's interesting. I'm pretty confident that you don't really need to compress the reference that way. It doesn't make a lot of intuitive sense that you would, in a way: streaming over the reference can be very fast and the question is how many reads you can align per pass, how flexibly, and with how low a pre-processing cost. I think my stuff (which doesn't count since we didn't get to publication stage for other reasons, etc.) was probably faster and more flexible.

1 more reply

lvvlvv15y ago

> Note that the worst case of complexity for this algorithm is much, much worse than the worst case complexity for Boyer Moore.

Can you explain your thought process for why worst case complexity is worse than other? Did you do any measurements? I believe you are incorrect. I am author of this page. I've just did quick test for SS = "aaa ...aaaBaaa...aaa", with SS_size=240. My algorithm is faster than 3 BMs out of 4 tested.

> Do not use this algorithm carelessly. For example, if you use it in a thoughtless way in your web server, you may open yourself to a DoS attack.

Same can be said about all BM, naive and BSD's memmem/strstr. Possibility of DOS for substring search algorithm was known for a long time - but it never materialized. The cure is trivial - limit substring size.

> That immediately made me think he might have devised it for genomics research.

No, it wasn't. It was devised when I was preparing for Google interview (which I failed). And of cause algorithms which pre-index haystack will always be faster. I myself do not consider haystack pre-indexing algorithms a "text search" algorithms (maybe incorrectly).

> gross exaggeration

If you count pre-indexing algorithms and edge cases - then you are correct. For most common case - can you show me something faster (from a student or even yourself)?

eru15y ago

> Can you explain your thought process for why worst case complexity is worse than other? Did you do any measurements? I believe you are incorrect. I am author of this page. I've just did quick test for SS = "aaa ...aaaBaaa...aaa", with SS_size=240. My algorithm is faster than 3 BMs out of 4 tested.

You write on the page that your algorithm has "O(NM) worst case complexity." Compare Boyer-Moore's time complexity. The grandfather was interested in asymptotics. Not in some examples.

> Same can be said about all BM, naive and BSD's memmem/strstr.

Red herring.

> Possibility of DOS for substring search algorithm was known for a long time - but it never materialized.

What are you talking about?

> If you count pre-indexing algorithms and edge cases - then you are correct. For most common case - can you show me something faster (from a student or even yourself)?

What do you mean by egde cases? All the cases that make your program run slow?

Using your terminology, KMP has complexity O(N + M). That's better than O(MN).

bnoordhuis15y ago· 4 in thread

The author states that preprocessing takes O(m) time but that is on average.

A quick review of the code makes me think that its worst case is actually on the order of O((s * (s + 1)) / 2), where s = m / 2.

The Achilles heel is the hash function. It's trivial to create collisions and have the insertion time for word w turn from O(1) to O(w).

martincmartin15y ago

Um, O((s * (s + 1)) / 2) = O(m^2). Quadratic, not exponential.

bnoordhuis15y ago

Sorry, I updated my comment just as you posted yours.

But - and I don't want to sound pedantic - how is m^2 not exponential growth?

Edit: mea culpa guys, I carelessly translated from Dutch. You're all right: quadratic, not exponential growth.

5 more replies

lvvlvv15y ago

I am the author. You are right. I've updated the page.

leif15y ago

Universal hashing will prevent collision problems.

matt471115y ago· 2 in thread

Pattern matching performance also depends on the alphabet size of the text. In his experiment he doesn't report the alphabet size of the text nor does he provide results for different text collections.

The algorithm itself looks very similar to the one used in agrep proposed by Wu and Manber [1].

I also found the book "Flexible Pattern Matching in Strings" to be a very good reference on all things related to pattern matching [2].

[1] S. Wu and U. Manber. A fast algorithm for multi-pattern searching. Report TR-94-17, Department of Computer Science, University of Arizona, 1994.

[2] http://www.amazon.com/Flexible-Pattern-Matching-Strings-Line...

dasht15y ago

He talks a bit about how to pick the right number of successive letters to use as hash keys - which is where you can get a handle on alphabet sizes. I would guess (maybe it actually says) that he Wikipedia dump in the benchmark was UTF-8 or ASCII and, either way, treated as an alphabet of 8-bit characters. The DNA case is kind of interesting (2 bits min but more likely 3 or 4 in a typical genome record).

matt471115y ago

An illustration from the book I cited above showing the importance of the alphabet size (y-axis) and the pattern length (x-axis):

http://i.imgur.com/KGOZW.jpg

In the experiment he used patterns of different length on the same text collection. As you can see in the graph, different algorithms perform best for a certain alphabet size.

He describes the text collection as "text corpus taken from wikipedia text dump" so I'm guessing the alphabet size is around 90?

It's also probably not a good thing that all the strings he is searching for are prefixes of the same pattern.

References:

Shift-Or: http://www-igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION...

BNDM: http://www-igm.univ-mlv.fr/~lecroq/string/bndm.html#SECTION0...

BOM: http://www-igm.univ-mlv.fr/~lecroq/string/bom.html#SECTION00...

zitterbewegung15y ago· 1 in thread

This seems like a very practical website about the algorithm but where is the theory and proofs of the time complexity of the algorithm??

dasht15y ago

I don't mean to be a turd but the proofs are kind of obvious on the face on this one. He's claiming expected linear time in the string being searched for "natural" texts and worst case O(MN). Proof of the worst case is pretty trivial by construction (of examples of that complexity) and contradiction (reaching the non-existence of worse cases). One can't be casual about proofs of course but: try thinking of an O(MN) example and then you can probably see from there why you can't do worse than that. Hint: if you can construct an example where you have to do the length M check for nearly every position in the length N haystack, aaaaaaaah... hmmmmm...., the rest should be clear.

1 more reply

yhlasx15y ago

If i am gonna need a string search algorithm for something serious, i would definitely use KMP (knuth morris pratt). Linear in worst case complexity (wouldn't risk)

aristus15y ago

Skeptical but excited. Will definitely be studying this at the weekend. I had been working on a long writeup on string matching but stopped the project for lack of recent progress.

b0b0b0b15y ago

it seems that his algorithm is faster because it exploits the model of computation (memory aligned accesses and multi-byte operations). He gets up to a constant factor more comparisons for free.

j / k navigate · click thread line to collapse

59 comments

35 comments · 9 top-level

tansey15y ago· 8 in thread

From the site:

>Preprocessing phase in O(M) space and time complexity. Searching phase average O(N) time complexity and O(N*M) worst case complexity.

I don't trust the analysis of someone referring to "average O(N) time"; Big O notation refers to boundary times.

Edit: Okay, based on arguments here and on [1], I'm going to accept that maybe he's just bastardizing the notation.

[1] http://stackoverflow.com/questions/3905355/meaning-of-averag...

mayank15y ago

> Big O notation refers to boundary times.

No it doesn't. You can have an O(N) amortized time. Big-O is a bounding function up to a constant factor, not necessarily a boundary (as in worst-case) time.

http://en.wikipedia.org/wiki/Amortized_analysis

pjscott15y ago

1 more reply

leif15y ago

This is not an amortized analysis, it is a probabilistic analysis.

jemfinch15y ago

Big O notation is frequently used to refer to the average case bounds of an algorithm. Haven't you seen an analysis of quicksort?

pjscott15y ago

(You could also use the O(n) time median algorithm to construct a truly O(n lg n) Quicksort, but that's just a fun theoretical side-note here.)

repsilat15y ago

> ...an abuse of notation, in the same way that people say that Quicksort is O(n lg n) on average.

I think I may have misunderstood what you meant by this. If I have, I apologise, if I haven't...

I guess that statement doesn't say that the number of comparisons or swaps or "steps" the algorithms must make to complete is `O(n lg n)`. Perhaps that's what you meant.

leif15y ago

I love how beautiful that median algorithm is. :-D

leif15y ago

mayank15y ago· 7 in thread

No DBLP profile for the author, no proofs on site, fastest algorithm known "to me" qualifier, no results for "suffix tree" on page, not a good sign.

EDIT: Am I missing something???

Complexity analysis according to the author: m = search term, n = text

O(m) preprocessing -- that's right, O(search term). And O(n times m) worst-case query string search, so the worst case traverses the whole text.

Now compare that to suffix trees:

O(n) preprocessing O(m) string search

where worst case complexity is linear in search term.

kragen15y ago

mayank15y ago

As for the online algorithm issue, see my reply below.

cperciva15y ago

Dual-Pivot Quicksort is an "awesome new algorithm"? Hardly. It's very small step in the direction of Samplesort -- which is asymptotically optimal and has been around for four decades.

Dual-Pivot Quicksort is a demonstration that someone didn't read the existing literature; nothing more.

1 more reply

Radim15y ago

If, on the other hand, you have a fixed "corpus" and a dynamic query, O(n) search time (this algo, purportedly) is terrible.

mayank15y ago

"This algorithm especially well suited for long S, multi-substrings, small alphabet, regex/parsing and search for common substrings."

Long source text: the O(n times m) worst-case time per search kills it. Even for a single search, O(n times m) worst case here versus O(n + m) for suffix trees.

multi-substrings: suffix trees do each substring of length m in O(m), but are not compared.

search for common substrings: again, suffix trees would be more appropriate.

matt471115y ago

I'm assuming he is only comparing online algorithms which process only the pattern not the text.

mayank15y ago

http://www.springerlink.com/content/kq55005qu6479276/

1 more reply

dasht15y ago· 4 in thread

Interesting.

thisisnotmyname15y ago

For sequence alignment, the state of the art is BWA, which first compresses the "haystack", then builds a trie.

See http://en.wikipedia.org/wiki/Burrows–Wheeler_transform or http://bioinformatics.oxfordjournals.org/content/early/2009/...

dasht15y ago

1 more reply

lvvlvv15y ago

> Note that the worst case of complexity for this algorithm is much, much worse than the worst case complexity for Boyer Moore.

> Do not use this algorithm carelessly. For example, if you use it in a thoughtless way in your web server, you may open yourself to a DoS attack.

> That immediately made me think he might have devised it for genomics research.

> gross exaggeration

If you count pre-indexing algorithms and edge cases - then you are correct. For most common case - can you show me something faster (from a student or even yourself)?

eru15y ago

You write on the page that your algorithm has "O(NM) worst case complexity." Compare Boyer-Moore's time complexity. The grandfather was interested in asymptotics. Not in some examples.

> Same can be said about all BM, naive and BSD's memmem/strstr.

Red herring.

> Possibility of DOS for substring search algorithm was known for a long time - but it never materialized.

What are you talking about?

> If you count pre-indexing algorithms and edge cases - then you are correct. For most common case - can you show me something faster (from a student or even yourself)?

What do you mean by egde cases? All the cases that make your program run slow?

Using your terminology, KMP has complexity O(N + M). That's better than O(MN).

bnoordhuis15y ago· 4 in thread

The author states that preprocessing takes O(m) time but that is on average.

A quick review of the code makes me think that its worst case is actually on the order of O((s * (s + 1)) / 2), where s = m / 2.

The Achilles heel is the hash function. It's trivial to create collisions and have the insertion time for word w turn from O(1) to O(w).

martincmartin15y ago

Um, O((s * (s + 1)) / 2) = O(m^2). Quadratic, not exponential.

bnoordhuis15y ago

Sorry, I updated my comment just as you posted yours.

But - and I don't want to sound pedantic - how is m^2 not exponential growth?

Edit: mea culpa guys, I carelessly translated from Dutch. You're all right: quadratic, not exponential growth.

5 more replies

lvvlvv15y ago

I am the author. You are right. I've updated the page.

leif15y ago

Universal hashing will prevent collision problems.

matt471115y ago· 2 in thread

The algorithm itself looks very similar to the one used in agrep proposed by Wu and Manber [1].

I also found the book "Flexible Pattern Matching in Strings" to be a very good reference on all things related to pattern matching [2].

[1] S. Wu and U. Manber. A fast algorithm for multi-pattern searching. Report TR-94-17, Department of Computer Science, University of Arizona, 1994.

[2] http://www.amazon.com/Flexible-Pattern-Matching-Strings-Line...

dasht15y ago

matt471115y ago

An illustration from the book I cited above showing the importance of the alphabet size (y-axis) and the pattern length (x-axis):

http://i.imgur.com/KGOZW.jpg

In the experiment he used patterns of different length on the same text collection. As you can see in the graph, different algorithms perform best for a certain alphabet size.

He describes the text collection as "text corpus taken from wikipedia text dump" so I'm guessing the alphabet size is around 90?

It's also probably not a good thing that all the strings he is searching for are prefixes of the same pattern.

References:

Shift-Or: http://www-igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION...

BNDM: http://www-igm.univ-mlv.fr/~lecroq/string/bndm.html#SECTION0...

BOM: http://www-igm.univ-mlv.fr/~lecroq/string/bom.html#SECTION00...

zitterbewegung15y ago· 1 in thread

This seems like a very practical website about the algorithm but where is the theory and proofs of the time complexity of the algorithm??

dasht15y ago

1 more reply

yhlasx15y ago

If i am gonna need a string search algorithm for something serious, i would definitely use KMP (knuth morris pratt). Linear in worst case complexity (wouldn't risk)

aristus15y ago

Skeptical but excited. Will definitely be studying this at the weekend. I had been working on a long writeup on string matching but stopped the project for lack of recent progress.

b0b0b0b15y ago

it seems that his algorithm is faster because it exploits the model of computation (memory aligned accesses and multi-byte operations). He gets up to a constant factor more comparisons for free.

j / k navigate · click thread line to collapse