Note that the worst case of complexity for this algorithm is much, much worse than the worst case complexity for Boyer Moore. Do not use this algorithm carelessly. For example, if you use it in a thoughtless way in your web server, you may open yourself to a DoS attack.
Note that the author nicely characterizes it as of potential use for small alphabets and possibly multiple substrings (in a single search). That immediately made me think he might have devised it for genomics research. In most applications I would think you'd also want regexp features. Interestingly, DNA research and use in a regexp engine is something he goes on to suggest. (If you are searching for a very large number of regexps in a big genome database, I would not use this algorithm. I found that some simple variants on classic NFA techniques work very well for a wide class of typical regexps (e.g., regexps modeling SNPs, small read position errors, small numbers of read errors, etc. There probably isn't any one obviously right answer, though, and a lot depends on your particular hardware situation, data set sizes, etc.).
The HN headline is very bogus hype. "X2 times faster than Boyer-Moore" is far from true in the general case. "breakthrough" is a gross exaggeration: this is a technique that anyone with some good algorithms course or two under the belt should be able to think of an, for most applications, decide to not use because of the limitations of the thing. I can definitely see it being nice for some applications tolerant of its limitations but... breakthrough it ain't.
See http://en.wikipedia.org/wiki/Burrows–Wheeler_transform or http://bioinformatics.oxfordjournals.org/content/early/2009/...
Can you explain your thought process for why worst case complexity is worse than other? Did you do any measurements? I believe you are incorrect. I am author of this page. I've just did quick test for SS = "aaa ...aaaBaaa...aaa", with SS_size=240. My algorithm is faster than 3 BMs out of 4 tested.
> Do not use this algorithm carelessly. For example, if you use it in a thoughtless way in your web server, you may open yourself to a DoS attack.
Same can be said about all BM, naive and BSD's memmem/strstr. Possibility of DOS for substring search algorithm was known for a long time - but it never materialized. The cure is trivial - limit substring size.
> That immediately made me think he might have devised it for genomics research.
No, it wasn't. It was devised when I was preparing for Google interview (which I failed). And of cause algorithms which pre-index haystack will always be faster. I myself do not consider haystack pre-indexing algorithms a "text search" algorithms (maybe incorrectly).
> gross exaggeration
If you count pre-indexing algorithms and edge cases - then you are correct. For most common case - can you show me something faster (from a student or even yourself)?
You write on the page that your algorithm has "O(NM) worst case complexity." Compare Boyer-Moore's time complexity. The grandfather was interested in asymptotics. Not in some examples.
> Same can be said about all BM, naive and BSD's memmem/strstr.
Red herring.
> Possibility of DOS for substring search algorithm was known for a long time - but it never materialized.
What are you talking about?
> If you count pre-indexing algorithms and edge cases - then you are correct. For most common case - can you show me something faster (from a student or even yourself)?
What do you mean by egde cases? All the cases that make your program run slow?
Using your terminology, KMP has complexity O(N + M). That's better than O(MN).
EDIT: Am I missing something???
Complexity analysis according to the author: m = search term, n = text
O(m) preprocessing -- that's right, O(search term). And O(n times m) worst-case query string search, so the worst case traverses the whole text.
Now compare that to suffix trees:
O(n) preprocessing O(m) string search
where worst case complexity is linear in search term.
I concur with the other commenters that it's silly for you to complain about his not comparing his online string-search algorithm against offline string-search algorithms that search an index of the text, such as suffix-tree algorithms.
As for the online algorithm issue, see my reply below.
Dual-Pivot Quicksort is a demonstration that someone didn't read the existing literature; nothing more.
If, on the other hand, you have a fixed "corpus" and a dynamic query, O(n) search time (this algo, purportedly) is terrible.
"This algorithm especially well suited for long S, multi-substrings, small alphabet, regex/parsing and search for common substrings."
Long source text: the O(n times m) worst-case time per search kills it. Even for a single search, O(n times m) worst case here versus O(n + m) for suffix trees.
multi-substrings: suffix trees do each substring of length m in O(m), but are not compared.
search for common substrings: again, suffix trees would be more appropriate.
The algorithm itself looks very similar to the one used in agrep proposed by Wu and Manber [1].
I also found the book "Flexible Pattern Matching in Strings" to be a very good reference on all things related to pattern matching [2].
[1] S. Wu and U. Manber. A fast algorithm for multi-pattern searching. Report TR-94-17, Department of Computer Science, University of Arizona, 1994.
[2] http://www.amazon.com/Flexible-Pattern-Matching-Strings-Line...
In the experiment he used patterns of different length on the same text collection. As you can see in the graph, different algorithms perform best for a certain alphabet size.
He describes the text collection as "text corpus taken from wikipedia text dump" so I'm guessing the alphabet size is around 90?
It's also probably not a good thing that all the strings he is searching for are prefixes of the same pattern.
References:
Shift-Or: http://www-igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION...
BNDM: http://www-igm.univ-mlv.fr/~lecroq/string/bndm.html#SECTION0...
BOM: http://www-igm.univ-mlv.fr/~lecroq/string/bom.html#SECTION00...
A quick review of the code makes me think that its worst case is actually on the order of O((s * (s + 1)) / 2), where s = m / 2.
The Achilles heel is the hash function. It's trivial to create collisions and have the insertion time for word w turn from O(1) to O(w).
But - and I don't want to sound pedantic - how is m^2 not exponential growth?
Edit: mea culpa guys, I carelessly translated from Dutch. You're all right: quadratic, not exponential growth.
>Preprocessing phase in O(M) space and time complexity. Searching phase average O(N) time complexity and O(N*M) worst case complexity.
I don't trust the analysis of someone referring to "average O(N) time"; Big O notation refers to boundary times.
Edit: Okay, based on arguments here and on [1], I'm going to accept that maybe he's just bastardizing the notation.
[1] http://stackoverflow.com/questions/3905355/meaning-of-averag...
No it doesn't. You can have an O(N) amortized time. Big-O is a bounding function up to a constant factor, not necessarily a boundary (as in worst-case) time.
(You could also use the O(n) time median algorithm to construct a truly O(n lg n) Quicksort, but that's just a fun theoretical side-note here.)
I think I may have misunderstood what you meant by this. If I have, I apologise, if I haven't...
Talking about the average-case complexity of algorithms with Big-Oh notation is in no way an "abuse" of that notation. If the average time complexity of an algorithm is `O(f(n))`, then its running time averaged over all inputs of size `n` is bounded above (in some sense) by `f(n)`.
I guess that statement doesn't say that the number of comparisons or swaps or "steps" the algorithms must make to complete is `O(n lg n)`. Perhaps that's what you meant.