https://github.com/microsoft/STL/pull/724/commits/a7da5cce8c...
Looks like it can happen with quite small needles, eg. "aa".
Disclaimer: I don't understand the algorithms deeply enough to write them from scratch, and I'm writing this at 6 AM.
- Boyle, Moore (1977) https://www.cs.utexas.edu/users/moore/publications/fstrpos.p...
- Knuth, Morris, Pratt (1977) https://pdfs.semanticscholar.org/4479/9559a1067e06b5a6bf052f...
- Rytter (1980) https://doi.org/10.1137/0209037 (paywalled without subscription, but you know where to find it cough scihub cough).
> Rename _Suffix_fn to _Fx, again following the papers. Note that in the usage below, there is a semantic change: _Suffix_fn stored 0-based values, while _Fx stores 1-based values. This undoes a micro-optimization (_Suffix_fn avoided unnecessarily translating between the 0-based and 1-based domains), but with the increased usage of f in the Rytter correction, I wanted greater correspondence with the published algorithms in order to verify the implementation.
And
> Rename 1-based _Idx to 1-based _Jx. (While the code was correct, I found it confusing that _Idx was 0-based in other loops but 1-based in this loop.)
Naming is one of the most strangely difficult aspects of programming, but if there is one rule it's be consistent. I really hate code that uses a variable name to mean one thing in one place and other somewhere else. I just learned what that meant! Why change it?! And yet this is common, and sometimes a comment about it in code review might get a response like "the code works, why does it matter?" This. This is why it matters. It causes confusion, which makes code difficult to verify, which causes bugs.
Because that's a natural mathematical model. "the n-th element" being at position n is handy and natural. What's nuts is that array indexing based on pointer offsets has made its way into nearly all high-level languages.
> The original paper contained errors for computing the table of pattern shifts which was corrected by Wojciech Rytter in 1980 [1]
But that's still actually wrong. Computing the table was not even discussed in the "original paper" -- see the papers linked by @oefrha in this thread. They're two different 1977 papers!
[1] https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-sea...
My main takeaway was that people should avoid BM in favor of KMP as much as possible if you're writing a string search function, because you're guaranteed to have some bugs when implementing BM. Writing your own string search function seems nearly on the same level as implementing your own cryptographic functions, though.
Interestingly someone reported a bug in my BM sample code on the wikipedia article talk page some years ago (still not yet fixed) which might be related to this Rytter issue. And someone forked my wikipedia article code repo yesterday, which I thought was very weird; must be related to this.
I guess coming to a further defence of wikipedia: as you read technical books & papers (or philosophy!) you see knowledge is extremely unevenly-distributed in society. Much of it is locked up inside these dense, expensive books (if you're lucky!) and even more of it is sitting undigested in papers written for people with five years of postgrad education, and even more is in expert's brains and never put to paper. These barriers stay up for a long time, decades even. We just witnessed one such barrier falling. Wikipedia is an optimal resting place for such liberated knowledge, and I would go so far as to say knowledge that is not on Wikipedia is not yet liberated.
And that’s the Curse Of The Standard Library Implementers: we’re the ones that have to implement Boyer-Moore, floating-point string conversions, etc. so everyone else can use them :-)
Sadly this seems to be the pattern for most everything coming from 'academia' ever.
And also "Flexible Pattern Matching in Strings" by Navarro and Raffinot.
By i’m still wondering if this bug could have been detected by automatic fuzzing testing ?
Currently we don't fuzz test MSVC's STL, but that is an extremely interesting area to explore in the future. I played around with this a little while working on std::from_chars() (and found no bugs), but it involved actually porting the code to Linux to be compiled with American Fuzzy Lop, so I didn't permanently add such test coverage to our repo.
https://github.com/ivanfratric/winafl
(don't know; just found this two minutes ago)
My work is on a personal project, not algorithms that have been in production for years, so it's possible that this wouldn't be that productive.
Fits what Knuth said: "Beware of bugs in the above code; I have only proved it correct, not tried it."
const auto elapsed = steady_clock::now() - start;
if (elapsed > 10s) {
actually use a units quantifier on the 10, i.e. does 10s mean 10 seconds? Is that a library thing? Cool if so.the std::chrono library reserved some suffixes (under std::literals::chrono_literals) for time units https://en.cppreference.com/w/cpp/symbol_index/chrono_litera...
The reason that a using-directive of some kind is necessary is that UDLs are provided by operators, and directly spelling out the operator in order to namespace-qualify it would be self-defeating. (Many people dislike `using namespace std;` and I understand their point, but when one works on the STL all day every day, this using-directive is sure nice for test code.)
My exhaustive test case for this is (demonstrating all the different ways that chrono literals can be accessed): https://github.com/microsoft/STL/blob/31419650d472932dd24b9c...
The reply I got from STL himself was fantastic, pointing out that it was [paraphrasing here] a wart in the spec, and that neither were actually incorrect.
To my lasting regret, I didn't keep an ahem offsite backup of some of those corp emails. Hence I can't recall the exact details.
This situation is a fiasco because there is no good way out of it (constraining container copy constructors isn't possible because the Standard permits incomplete element types for many containers, and users use them for other containers even though they technically shouldn't, weakening EH guarantees is potentially problematic, changing the nature of sentinel nodes is ABI-breaking and affects iterator invalidation guarantees). It is also highly memorable, which is why I can guess what it was :-)
Yes, that's the one! Our fix was explicitly deleting the copy constructor. In essence a one-liner for us, but you can't see the iceberg of an issue beneath the surface.
Thanks once again. :)
There's a great resource called SMART for those interested in exploring this, with implementations of many of them and a simple benchmarking tool.
https://smart-tool.github.io/smart/
Although Boyer Moore is well known, there are generally faster algorithms available these days.
For short patterns, SHIFTOR will tend to be a lot faster than BM. Even the much simpler variant of BM, Horspool is usually faster than BM.
This highlights an interesting tension in search algorithm design. Often a simpler algorithm outperforms one with theoretically higher performance, but not always!
The cited implementation has only minor problems, which I fixed in my fork.
Papers need "superseded by ..." type forward references, like IETF RFC's. Or rather "corrected by", in this case.
Maybe important algorithms need to be summarized in RFC-like documents.
Knuth's TAOCP serves as an annotated third-party reference, except that in this case he's not a third-party. I don't have a copy to hand to see if it has the correct version of the algorithm.
I did a version in 68K on the Mac (MPW Shell days) for a SGML editor I was working on.