Out of curiosity: how much optimizing potential is in the software that was programmed by Burntsushi in Rust?
In the literature, there are generally two different variants of Aho-Corasick. The traditional formulation follows failure transitions during search, which comes with overhead. In effect, it's an NFA since search can traverse through multiple states for each byte of input that it sees. The other variant is the "advanced" form, which is just effectively turning the NFA into a DFA by precomputing all of the failure transitions into a single table. This can be quite a bit faster depending on the workload, but obviously uses more memory and takes more time to build. My library supports the DFA version, but it had a bug (which has now been fixed) which prevented the OP from effectively benchmarking it.
The OP hints that some substantial number of samples use a very small number of patterns. Sometimes only one. In that case, it probably wouldn't be best to use Aho-Corasick at all. Instead, it should use an optimized single substring search algorithm, such as Two Way. Even for multiple patterns, if there aren't that many of them, it might be better to use Teddy from Hyperscan.
This article is fairly timely. I'm almost done a rewrite of my Aho-Corasick library. It will bring some small performance improvements (but more importantly a simpler API), and I hope to move the Teddy algorithm into it at some point.
As always, specific workloads can often have their performance improved quite a bit.
The DFA approach is something that I had not heard about before (I never looked into what "full" meant), but we should definitely try it; we reuse the automaton against possibly millions of haystacks, so extra memory or preprocessing time is likely worth it. Thanks for pointing this out!
An adaptive approach would indeed work better for fewer needles, but the threshold depends a lot on the data. We compared Alfred–Margaret against ICU regexes for case-insensitive replaces, and we found that even for a single needle, ours was faster in half of the cases. (It is a bit unfair because of FFI call overhead.)
many thanks for your contribution. I was not expecting to see such an extensive elaboration. Looking forward to see your rewrite, and learn something from it!
Cheers!
AC sucks. It's simple enough, but it's always going to be either slow or big - sometimes both. I'd say that's a source of frustration to me, but it's not really, as we made a lot of money selling string matching to customers back when Hyperscan was closed-source. It's so much easier to sell string matching than regex matching as regexes are full of surprises, but strings are quite tractable.
I'm not delighted with where we left string matching in Hyperscan - "FDR" was wheezing under load in a bunch of ways, and I had some better things in the pipeline. But it still should generally murder AC barring some corner cases.
Unfortunately, Hyperscan has moved away from having a pure-play literal matcher in it (what remains is really more of a specialist 1- to 8-char literal matcher), which I regard as a semi-mistake (so there may be more corner cases than there used to be).
I've read masters theses that were significantly less thorough and well-researched.
I don’t have much to add to BurntSushi’s reply. I studied the implementation briefly to take inspiration from when we decided to write our own, and there is no obvious low-hanging fruit. One small thing is that it uses a Vec of transitions per state. In Alfred–Margaret we pack all transition tables together in one array (with a separate array that maps state to start index), so it can use the cache more efficiently.
There are ways to do parallel automaton traversal with SIMD, but it only works for a very small number of states.
Yeah, the DFA packs everything into a single table. The NFA keeps them distinct so that it can support a dense representation near the root of the trie (for speed) but a sparse representation farther away from the root (for smaller memory footprint). (Confusingly, "dense" and "sparse" are flipped in the source code, much to my chagrin.) It would be possible to convert this into contiguous memory, but I'm not sure it's worth it.
Beyond that, the other possible (maybe micro) optimizations in this realm are:
1. Use equivalence classes of bytes for your state transitions instead of all possible byte values. However, this is only applicable in a dense representation, and even then, if your alphabet is 2^16 instead of 2^8, it's not clear how much this helps. Byte classes require an extra lookup at search time, but the decreased size of the automaton can be dramtic, which has means better locality and overall better performance. But again, this is compared to the typical dense representation which is probably not relevant in your case.
2. Premultiply state identifiers. Again, probably only pertinent for a dense representation. Although, it sounds like this could eliminate your extra table that maps state identifiers to start indices. The only real downside of this is that it potentially increases the minimum required integer size to represent state identifiers. But that's only relevant if you support using smaller integer sizes for state identifiers in the first place.
3. Rearrange the states such that all match states precede all non-match states (with perhaps an exception or two for the fail/dead states, depending on how you implement Aho-Corasick). In the core match loop, a match state can be determined with a comparison against a fixed integer that's probably in a register instead of a memory access.
With all that said, the best possible next step for y'all is to probably switch to a dense representation. But that might imply finding a way to efficiently use UTF-8 since a dense representation with a 2^16 sized alphabet is tricky.
All of this stuff should be available in my rewrite of the crate. It's also in my regex-automata crate.
based on a very readable paper https://www.microsoft.com/en-us/research/wp-content/uploads/...
It also seemed to require a lot of work to construct the code to make tail call optimization happen, but that’s par for the course in Haskell.
Either way, optimization in any language often requires annotations. The main benefit of Haskell though is still equational reasoning, which this does not obviate. You can get rid of all the annotations and your code will still run, be correct, and can be reasoned about
Similarly, the existence of generators and zero-argument functions in Python doesn't undermine the idea that strictness by default could have benefits.
Though, I suppose you could then say that if the rules are different for low level optimization, they may as well have imported an external library written in Rust.
> A naive Python script that loops over all needles and calls str.find in a loop to find all matches.
> working through all 3.5 gigabytes of test data once
> Time is ~6.5s.
I love these kinds of benchmarks. They consistently show that for specialised cases you do need to drop down to C/Rust/hand-written Haskell implementations. In many-many more cases, well, even Python with a naive implementation is good enough.
Not to diss Haskell or Rust. Only saying: chose your tools as you see fit, and benchmark.
Also kudos to the authors to use mean instead of average in the graphs.
I'm not going to outright call this "cheating", but most of the time, virtually unconditionally, when Python is fast it's because it isn't Python, it's C.
It definitely feels worth noting that Python is radically slower than other languages, and the main "tool" it provides for improving performance is to rely on another language entirely.
But I don't disagree that the Python owe's it's performance to C.. it owes everything to C :)
What do you mean? "Mean" with no qualification usually denotes arithmetic average (or mean).
For benchmarks much better option would be geometric mean. Because the relation of the results values not the difference.
Just because it isnt identical in implementation to the other languages doesn't mean its much simpler. Its fast because enough people cared to make it that way - it wasn't by some accident
Once you start working on that data using procedures actually written I python performance is usually far from stellar.
I have had that happen to me many times. Back in the guile 2.0 days I tried porting some utils to python based on preliminary benchmarks that were often an order of magnitude faster than my guile ones, but when the logic was implemented the difference was gone. Then guile 2.2 (and now guile 3) happened and everything got magically faster, even though less and less of the runtime is written in C.
[1]: http://mirrors.ctan.org/graphics/pgf/base/doc/pgfmanual.pdf [2]: http://mirrors.ctan.org/graphics/pgf/contrib/pgfplots/doc/pg...
https://stackoverflow.com/questions/23765903/why-is-text-utf...
It's a shame Haskell copies strings across FFI boundaries, because binding the Rust version would have been almost as fast, with (presumably) significantly less effort.
Presumably creating lots of small strings is a typical use case for Text so losing compacting gc hurts. ByteString on the other hand is more used with binary data and is designed for ffi use.
The other common fusion technique used in haskell is stream fusion which is very similar to the mutual-recursion-encoded-as-sum-type version in the blog post.
There is also indexing-based fusion which the Repa library uses. It allows more control over traversal order (think optimizing stencil codes) but requires regular arrays, e.g. no filtering.
There are some other research projects like specialized compiler passes via hermit but afaik nothing really production ready.
1) i see they are doing fully qualified imports. please add major bounds to your version deps, stack doesn't excuse not tracking what versions you used!
2) they have
"data Next a = Done !a | Step !a " in their code,
this tells me they could have possibly benefited using the stream abstraction in Vector!
https://hackage.haskell.org/package/vector-0.12.0.2/docs/Dat... is a link to the latter
The exact versions they used are specified by the stack.yaml file.
In this case, see https://stackage.org/lts-13.10
Stack has a feature that lets you automatically added bounds to a hackage upload based on the stack.yaml. (However, I personally do not recommend using this feature, as those bounds usually end up being overly restrictive. Manually-written bounds are better.)
I do agree human judgement based bounds are best, but stackage snapshots aren't that. Furthermore, stackage/fpo LTS is not LTS in the same sense as linux-distributions and their BSD-unix cousins. Proper long term support in any meaningful sense is a commitment by the distro provider to backport security/bug fixes to the LTS. Stackage LTS doesnt do that.
stack tricks its users into not thinking about their relationship with their dependencies and poisons library code from being usable by haskell devs/users/contributors who dont use stack. :)
the more you know! :)
To do this right you need a fast prefilter and a reasonable fast string match. Our string matcher in Hyperscan blew AC away on size and speed, typically. Unfortunately, I never had a chance to build its successor. But AC is a disaster of one dependent load after another and is mediocre at best in terms of performance.
1. There is no write up of the Hyperscan algorithm of which you speak. I wrote up a part of Teddy, but I still don't understand the full algorithm. Hyperscan's code takes serious effort to understand. I spent days staring at Teddy and I still didn't get everything.
2. Aho-Corasick is reasonably simple to implement and does fairly decently for a large number of tasks.
3. Aho-Corasick is trivially portable.
The fact that AC is portable and simple and that everyone feels good about themselves as a result of implementing a classical-sounding algorithm (why AC and not Rabin-Karp? who knows?) is why is keeps cropping up, like kudzu. It's still either (a) huge, (b) slow, (c) both and (d) a poor fit for modern architectures - it has a way of turning the problem into a latency problem for whichever level of memory its structures fit into (rather than a throughput problem).
It's a solution. Big deal. There are half-a-dozen better ones just floating around in the literate, but this shitty one seems to be the one people fixate on. Maybe it's because Aho and Corasick sort earlier in the dictionary than Rabin and Karp?
> Relying so much on optimizations to ensure that very high level code compiles to the right low-level code is fragile. Case in point: after upgrading our Stackage snapshot from LTS 10 with GHC 8.2.2 to LTS 13 with GHC 8.6.3, we saw the running time of our benchmark double.
This would scare we about doing any real performance optimization in a language as high-level as Haskell: am I just fiddling with compiler internals now?
I'm the author of pyahocrasick, a python C extension that implements the AC. I'd love to see it in the comparison. :)