Fix Boyer-Moore searcher with the Rytter correction (opens in new tab)

(github.com)

193 pointsHenryR5y ago53 comments

53 comments

What is a minimal test case (needle + haystack) for which the old code yields a different result than the new code?

The smallest test case appears to be a needle of "aaa" and a haystack of "xxaaa", where VS 2019 16.5's boyer_moore_searcher will report "not found", while the corrected code finds the needle at offset 2 in the haystack.

jameshart5y ago

Seems astonishing that nobody has run into this in the wild. How does it do on finding ‘www’ in ‘http://www.example.com/, for example?

2 more replies

nostrademons5y ago

There are a bunch of test cases here:

https://github.com/microsoft/STL/pull/724/commits/a7da5cce8c...

Looks like it can happen with quite small needles, eg. "aa".

StephanTLavavej5y ago

After comparing the outputs of the old and new algorithms, I believe that a needle needs more than two repeats in order to trigger the bug. That is, "aa" and "abab" don't trigger the bug, but "aaa" and "ababa" do, the last one having a partial third repeat (needle "ababa" with haystack "WXYZababa" was definitely wrong). I tested the table output for "aa" for completeness (and just in case we manage to damage it in the future).

Disclaimer: I don't understand the algorithms deeply enough to write them from scratch, and I'm writing this at 6 AM.

oefrha5y ago

The papers:

- Boyle, Moore (1977) https://www.cs.utexas.edu/users/moore/publications/fstrpos.p...

- Knuth, Morris, Pratt (1977) https://pdfs.semanticscholar.org/4479/9559a1067e06b5a6bf052f...

- Rytter (1980) https://doi.org/10.1137/0209037 (paywalled without subscription, but you know where to find it cough scihub cough).

dang5y ago

Submitted title was "A C++17 STL bug that can be attributed to Knuth, and its 40-year-old fix", which is nice additional context, but can now go here rather than there.

tacitusarc5y ago

What I really enjoy about this merge request are these comments:

> Rename _Suffix_fn to _Fx, again following the papers. Note that in the usage below, there is a semantic change: _Suffix_fn stored 0-based values, while _Fx stores 1-based values. This undoes a micro-optimization (_Suffix_fn avoided unnecessarily translating between the 0-based and 1-based domains), but with the increased usage of f in the Rytter correction, I wanted greater correspondence with the published algorithms in order to verify the implementation.

And

> Rename 1-based _Idx to 1-based _Jx. (While the code was correct, I found it confusing that _Idx was 0-based in other loops but 1-based in this loop.)

Naming is one of the most strangely difficult aspects of programming, but if there is one rule it's be consistent. I really hate code that uses a variable name to mean one thing in one place and other somewhere else. I just learned what that meant! Why change it?! And yet this is common, and sometimes a comment about it in code review might get a response like "the code works, why does it matter?" This. This is why it matters. It causes confusion, which makes code difficult to verify, which causes bugs.

ahelwer5y ago

I've implement a few string search algorithms and translating between 1- and 0- based offset functions was the source of a huge number of bugs, as you might expect. For some reason a lot of sources present the algorithms as 1-based.

ivanbakel5y ago

>For some reason a lot of sources present the algorithms as 1-based.

Because that's a natural mathematical model. "the n-th element" being at position n is handy and natural. What's nuts is that array indexing based on pointer offsets has made its way into nearly all high-level languages.

3 more replies

040919485y ago

maybe it's because of Pascal. I remember even Cormen used to have pascal style pseudo-code.

a13692099935y ago

Speaking of things that make code needlessly difficult to verify, we really need to start insisting on correct (zero-based) indexing in academic algorithms.

dvt5y ago

What I find most interesting about this bugfix is simply how terrible of a source Wikipedia is and, at the same time, how valuable professional insight can be (e.g. someone that knows what they're talking about). And to add insult to injury, someone quickly edited Wikipedia (for the "street cred" I'm sure):

> The original paper contained errors for computing the table of pattern shifts which was corrected by Wojciech Rytter in 1980 [1]

But that's still actually wrong. Computing the table was not even discussed in the "original paper" -- see the papers linked by @oefrha in this thread. They're two different 1977 papers!

[1] https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-sea...

ahelwer5y ago

Yeah I wrote that Wikipedia article, as much as a single person can. Used the textbook Strings, Trees, and Sequences by Gusfield; it never made mention of Rytter. I wouldn't say Wikipedia is a "terrible" source, though; compared to what? Papers? Textbooks? During the course of my career I've found major technical errors in about half of CS papers. Now the error is being documented & fixed in the Wikipedia article.

My main takeaway was that people should avoid BM in favor of KMP as much as possible if you're writing a string search function, because you're guaranteed to have some bugs when implementing BM. Writing your own string search function seems nearly on the same level as implementing your own cryptographic functions, though.

Interestingly someone reported a bug in my BM sample code on the wikipedia article talk page some years ago (still not yet fixed) which might be related to this Rytter issue. And someone forked my wikipedia article code repo yesterday, which I thought was very weird; must be related to this.

I guess coming to a further defence of wikipedia: as you read technical books & papers (or philosophy!) you see knowledge is extremely unevenly-distributed in society. Much of it is locked up inside these dense, expensive books (if you're lucky!) and even more of it is sitting undigested in papers written for people with five years of postgrad education, and even more is in expert's brains and never put to paper. These barriers stay up for a long time, decades even. We just witnessed one such barrier falling. Wikipedia is an optimal resting place for such liberated knowledge, and I would go so far as to say knowledge that is not on Wikipedia is not yet liberated.

StephanTLavavej5y ago

> you're guaranteed to have some bugs when implementing BM. Writing your own string search function seems nearly on the same level as implementing your own cryptographic functions, though.

And that’s the Curse Of The Standard Library Implementers: we’re the ones that have to implement Boyer-Moore, floating-point string conversions, etc. so everyone else can use them :-)

1 more reply

billyoneal5y ago

He meant the original paper for calculating delta2 which was merely referenced by the original BM paper.

Sadly this seems to be the pattern for most everything coming from 'academia' ever.

praptak5y ago

Lol, I was taught by Rytter like 20 years ago. He did much more in text algorithms than just correcting B-M. The two books by Crochemore & Rytter are basically the books on text algorithms.

burntsushi5y ago

> the books on text algorithms.

And also "Flexible Pattern Matching in Strings" by Navarro and Raffinot.

skyde5y ago

This guy is a hero! I have so much respect for his work.

By i’m still wondering if this bug could have been detected by automatic fuzzing testing ?

StephanTLavavej5y ago

I believe that fuzz testing would have found it, yes. I'm not sure if fuzz testing would have been guided towards highly repetitive patterns, given the relative lack of branches in the table construction code, but the incorrectly handled patterns can be very short so fuzz testing should have stumbled upon them (given a pattern that generated an incorrect table, finding a haystack which doesn't work is merely a matter of having the needle occur at a specific offset).

Currently we don't fuzz test MSVC's STL, but that is an extremely interesting area to explore in the future. I played around with this a little while working on std::from_chars() (and found no bugs), but it involved actually porting the code to Linux to be compiled with American Fuzzy Lop, so I didn't permanently add such test coverage to our repo.

marvy5y ago

It seems AFL might have been ported to Windows:

https://github.com/ivanfratric/winafl

(don't know; just found this two minutes ago)

1 more reply

hyperpape5y ago

A technique that I've used while fuzz testing string algorithms is to run all tests first with small alphabets (perhaps 4-6 characters), then with full alphabets. It seems to increase the probability of finding bugs with relatively small runs.

My work is on a personal project, not algorithms that have been in production for years, so it's possible that this wouldn't be that productive.

1 more reply

typon5y ago

I know him as the cpp pirate

zvrba5y ago

> However, the published algorithm for dd' was incorrect! This was discovered and fixed by Rytter in 1980, which we weren't aware of until we received a bug report.

Fits what Knuth said: "Beware of bugs in the above code; I have only proved it correct, not tried it."

karma_fountain5y ago

Nice fix! It's been a while since I've done c++ so does the following code

  const auto elapsed = steady_clock::now() - start;
  if (elapsed > 10s) {

actually use a units quantifier on the 10, i.e. does 10s mean 10 seconds? Is that a library thing? Cool if so.

sgn5y ago

It’s called user-defined literal suffix in C++11. https://en.cppreference.com/w/cpp/language/user_literal The “s” suffix come with C++14 https://en.cppreference.com/w/cpp/chrono/operator%22%22s

andijcr5y ago

that's a user defined literal, a language feature: https://en.cppreference.com/w/cpp/language/user_literal

the std::chrono library reserved some suffixes (under std::literals::chrono_literals) for time units https://en.cppreference.com/w/cpp/symbol_index/chrono_litera...

StephanTLavavej5y ago

And this works because I have `using namespace std;` in the test, which makes all of the Standard Library's "user"-defined literals available. So does `using namespace std::literals;` (without dragging in other names). There's also the extremely fine-grained `using namespace std::chrono_literals;` (no need to type `using namespace std::literals::chrono_literals;` due to how inline namespaces work) which drags in just the chrono duration literals. Finally, `using namespace std::chrono;` drags in all of the chrono names and the chrono literals.

The reason that a using-directive of some kind is necessary is that UDLs are provided by operators, and directly spelling out the operator in order to namespace-qualify it would be self-defeating. (Many people dislike `using namespace std;` and I understand their point, but when one works on the STL all day every day, this using-directive is sure nice for test code.)

My exhaustive test case for this is (demonstrating all the different ways that chrono literals can be accessed): https://github.com/microsoft/STL/blob/31419650d472932dd24b9c...

rurban5y ago

Nice, but using decade old and overly slow string search algorithms still is questionable. There are hundreds of faster algos, the fastest and most stable currently being epsm (2013). https://www.dmi.unict.it/~faro/smart/algorithms.php?algorith...

mcbain5y ago

A while back I filed an issue on the MSVC std::vector implementation as the behaviour of a weird corner case differed from GCC/Clang.

The reply I got from STL himself was fantastic, pointing out that it was [paraphrasing here] a wart in the spec, and that neither were actually incorrect.

To my lasting regret, I didn't keep an ahem offsite backup of some of those corp emails. Hence I can't recall the exact details.

StephanTLavavej5y ago

This sounds like the vector<list<unique_ptr<T>>> fiasco, where a vector contains elements that are movable-only (you can't copy a list<unique_ptr<T>>, only move it), yet with throwing move constructors (as MSVC's list, set, and other node-based containers dynamically allocate sentinel nodes when default/move-constructed - this is a behavior difference from GCC's libstdc++ and Clang's libc++, where both dynamically-allocated and container-internal sentinel nodes are permitted by the Standard). The doom begins when STL containers like list (also vector itself) don't have "constrained copy constructors" - that is, list(const list&) always appears to be available, as far as type traits like is_copy_constructible are concerned, even if that copy constructor can't actually compile. The doom accelerates when vector<Elem> needs to ask the question "are you nothrow-movable? If so, I can move you efficiently. Otherwise, are you copyable? If so, I can fall back to copying you during reallocation, in order to preserve the strong exception guarantee. Otherwise, you're throwing movable-only, so I'll move you and you get the basic guarantee". The doom is complete when list<unique_ptr<T>> correctly reports that it isn't nothrow-move-constructible (it would throw due to allocating a sentinel), yet incorrectly claims to be is_copy_constructible. Therefore the vector tries to copy the list<unique_ptr<T>> and that fails to instantiate, resulting in an epic compiler error.

This situation is a fiasco because there is no good way out of it (constraining container copy constructors isn't possible because the Standard permits incomplete element types for many containers, and users use them for other containers even though they technically shouldn't, weakening EH guarantees is potentially problematic, changing the nature of sentinel nodes is ABI-breaking and affects iterator invalidation guarantees). It is also highly memorable, which is why I can guess what it was :-)

mcbain5y ago

I only just noticed this reply - many apologies.

Yes, that's the one! Our fix was explicitly deleting the copy constructor. In essence a one-liner for us, but you can't see the iceberg of an issue beneath the surface.

Thanks once again. :)

skyde5y ago

Any one know why the Turbo Boyer-Moore algorithm is not more popular. Any case when Boyer Moore would be better?

MattPalmer10865y ago

There are literally hundreds of string search algorithms published now (including a large number of Boyer Moore variants).

There's a great resource called SMART for those interested in exploring this, with implementations of many of them and a simple benchmarking tool.

https://smart-tool.github.io/smart/

Although Boyer Moore is well known, there are generally faster algorithms available these days.

For short patterns, SHIFTOR will tend to be a lot faster than BM. Even the much simpler variant of BM, Horspool is usually faster than BM.

This highlights an interesting tension in search algorithm design. Often a simpler algorithm outperforms one with theoretically higher performance, but not always!

rurban5y ago

Fastest and best is the latest, EPSM. For all cases, given you are on x86_64 or aarch64 with sse4.2

The cited implementation has only minor problems, which I fixed in my fork.

billyoneal5y ago

1. What empirical data we have suggests that the 'good suffix' rule engages fairly rarely; that's why the -Horspool variant wins most tests. (It isn't just that you don't have to build Delta2, it also saves a lot of comparisons in the algorithm itself for most inputs.) 2. Because the standard says this is the Boyer-Moore algorithm :D

bjoli5y ago

because BM or BM-horspool is generally very good over all and very few algorithms are always better.

kazinator5y ago

> I find it very curious that it isn't constantly mentioned when explaining Boyer-Moore

Papers need "superseded by ..." type forward references, like IETF RFC's. Or rather "corrected by", in this case.

Maybe important algorithms need to be summarized in RFC-like documents.

buckminster5y ago

Exactly the same problem occurs with legislation. For years this was solved by third-parties publishing enormous annotated legal references. Nowadays the government (in the UK anyway) publishes legislation online incorporating all the edits made by later legislation.

Knuth's TAOCP serves as an annotated third-party reference, except that in this case he's not a third-party. I don't have a copy to hand to see if it has the correct version of the algorithm.

NCG_Mike5y ago

I remember Dr Dobbs, way back when, having an article on the search algorithm. I think they had C code for it.

I did a version in 68K on the Mac (MPW Shell days) for a SGML editor I was working on.

kzrdude5y ago

So there is potentially a simpler fix than the Rytter correction, is that also yet to discover in published papers somewhere, or do we need to ask Knuth himself?

StephanTLavavej5y ago

Apparently there are multiple fixes, some of which are said to be simpler. See https://github.com/microsoft/STL/issues/727 and the cited comments within, which quote Knuth. (I'm learning a lot by having this PR appear on HN!)

fithisux5y ago

Is Dlang affected?

j / k navigate · click thread line to collapse

53 comments

bumbledraven5y ago

What is a minimal test case (needle + haystack) for which the old code yields a different result than the new code?

StephanTLavavej5y ago

jameshart5y ago

Seems astonishing that nobody has run into this in the wild. How does it do on finding ‘www’ in ‘http://www.example.com/, for example?

2 more replies

nostrademons5y ago

There are a bunch of test cases here:

https://github.com/microsoft/STL/pull/724/commits/a7da5cce8c...

Looks like it can happen with quite small needles, eg. "aa".

StephanTLavavej5y ago

Disclaimer: I don't understand the algorithms deeply enough to write them from scratch, and I'm writing this at 6 AM.

oefrha5y ago

The papers:

- Boyle, Moore (1977) https://www.cs.utexas.edu/users/moore/publications/fstrpos.p...

- Knuth, Morris, Pratt (1977) https://pdfs.semanticscholar.org/4479/9559a1067e06b5a6bf052f...

- Rytter (1980) https://doi.org/10.1137/0209037 (paywalled without subscription, but you know where to find it cough scihub cough).

dang5y ago

Submitted title was "A C++17 STL bug that can be attributed to Knuth, and its 40-year-old fix", which is nice additional context, but can now go here rather than there.

tacitusarc5y ago

What I really enjoy about this merge request are these comments:

And

> Rename 1-based _Idx to 1-based _Jx. (While the code was correct, I found it confusing that _Idx was 0-based in other loops but 1-based in this loop.)

ahelwer5y ago

ivanbakel5y ago

>For some reason a lot of sources present the algorithms as 1-based.

3 more replies

040919485y ago

maybe it's because of Pascal. I remember even Cormen used to have pascal style pseudo-code.

a13692099935y ago

Speaking of things that make code needlessly difficult to verify, we really need to start insisting on correct (zero-based) indexing in academic algorithms.

dvt5y ago

> The original paper contained errors for computing the table of pattern shifts which was corrected by Wojciech Rytter in 1980 [1]

But that's still actually wrong. Computing the table was not even discussed in the "original paper" -- see the papers linked by @oefrha in this thread. They're two different 1977 papers!

[1] https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-sea...

ahelwer5y ago

StephanTLavavej5y ago

> you're guaranteed to have some bugs when implementing BM. Writing your own string search function seems nearly on the same level as implementing your own cryptographic functions, though.

And that’s the Curse Of The Standard Library Implementers: we’re the ones that have to implement Boyer-Moore, floating-point string conversions, etc. so everyone else can use them :-)

1 more reply

billyoneal5y ago

He meant the original paper for calculating delta2 which was merely referenced by the original BM paper.

Sadly this seems to be the pattern for most everything coming from 'academia' ever.

praptak5y ago

Lol, I was taught by Rytter like 20 years ago. He did much more in text algorithms than just correcting B-M. The two books by Crochemore & Rytter are basically the books on text algorithms.

burntsushi5y ago

> the books on text algorithms.

And also "Flexible Pattern Matching in Strings" by Navarro and Raffinot.

skyde5y ago

This guy is a hero! I have so much respect for his work.

By i’m still wondering if this bug could have been detected by automatic fuzzing testing ?

StephanTLavavej5y ago

marvy5y ago

It seems AFL might have been ported to Windows:

https://github.com/ivanfratric/winafl

(don't know; just found this two minutes ago)

1 more reply

hyperpape5y ago

My work is on a personal project, not algorithms that have been in production for years, so it's possible that this wouldn't be that productive.

1 more reply

typon5y ago

I know him as the cpp pirate

zvrba5y ago

> However, the published algorithm for dd' was incorrect! This was discovered and fixed by Rytter in 1980, which we weren't aware of until we received a bug report.

Fits what Knuth said: "Beware of bugs in the above code; I have only proved it correct, not tried it."

karma_fountain5y ago

Nice fix! It's been a while since I've done c++ so does the following code

  const auto elapsed = steady_clock::now() - start;
  if (elapsed > 10s) {

actually use a units quantifier on the 10, i.e. does 10s mean 10 seconds? Is that a library thing? Cool if so.

sgn5y ago

It’s called user-defined literal suffix in C++11. https://en.cppreference.com/w/cpp/language/user_literal The “s” suffix come with C++14 https://en.cppreference.com/w/cpp/chrono/operator%22%22s

andijcr5y ago

that's a user defined literal, a language feature: https://en.cppreference.com/w/cpp/language/user_literal

the std::chrono library reserved some suffixes (under std::literals::chrono_literals) for time units https://en.cppreference.com/w/cpp/symbol_index/chrono_litera...

StephanTLavavej5y ago

My exhaustive test case for this is (demonstrating all the different ways that chrono literals can be accessed): https://github.com/microsoft/STL/blob/31419650d472932dd24b9c...

rurban5y ago

mcbain5y ago

A while back I filed an issue on the MSVC std::vector implementation as the behaviour of a weird corner case differed from GCC/Clang.

The reply I got from STL himself was fantastic, pointing out that it was [paraphrasing here] a wart in the spec, and that neither were actually incorrect.

To my lasting regret, I didn't keep an ahem offsite backup of some of those corp emails. Hence I can't recall the exact details.

StephanTLavavej5y ago

mcbain5y ago

I only just noticed this reply - many apologies.

Yes, that's the one! Our fix was explicitly deleting the copy constructor. In essence a one-liner for us, but you can't see the iceberg of an issue beneath the surface.

Thanks once again. :)

skyde5y ago

Any one know why the Turbo Boyer-Moore algorithm is not more popular. Any case when Boyer Moore would be better?

MattPalmer10865y ago

There are literally hundreds of string search algorithms published now (including a large number of Boyer Moore variants).

There's a great resource called SMART for those interested in exploring this, with implementations of many of them and a simple benchmarking tool.

https://smart-tool.github.io/smart/

Although Boyer Moore is well known, there are generally faster algorithms available these days.

For short patterns, SHIFTOR will tend to be a lot faster than BM. Even the much simpler variant of BM, Horspool is usually faster than BM.

This highlights an interesting tension in search algorithm design. Often a simpler algorithm outperforms one with theoretically higher performance, but not always!

rurban5y ago

Fastest and best is the latest, EPSM. For all cases, given you are on x86_64 or aarch64 with sse4.2

The cited implementation has only minor problems, which I fixed in my fork.

billyoneal5y ago

bjoli5y ago

because BM or BM-horspool is generally very good over all and very few algorithms are always better.

kazinator5y ago

> I find it very curious that it isn't constantly mentioned when explaining Boyer-Moore

Papers need "superseded by ..." type forward references, like IETF RFC's. Or rather "corrected by", in this case.

Maybe important algorithms need to be summarized in RFC-like documents.

buckminster5y ago

Knuth's TAOCP serves as an annotated third-party reference, except that in this case he's not a third-party. I don't have a copy to hand to see if it has the correct version of the algorithm.

NCG_Mike5y ago

I remember Dr Dobbs, way back when, having an article on the search algorithm. I think they had C code for it.

I did a version in 68K on the Mac (MPW Shell days) for a SGML editor I was working on.

kzrdude5y ago

So there is potentially a simpler fix than the Rytter correction, is that also yet to discover in published papers somewhere, or do we need to ask Knuth himself?

StephanTLavavej5y ago

fithisux5y ago

Is Dlang affected?

j / k navigate · click thread line to collapse