How we made Haskell search strings as fast as Rust (opens in new tab)

(tech.channable.com)

246 pointsduijf7y ago87 comments

87 comments

57 comments · 11 top-level

wuschel7y ago· 11 in thread

Interesting article.

Out of curiosity: how much optimizing potential is in the software that was programmed by Burntsushi in Rust?

There is definitely some room. As usual, it depends. I'm away from the computer, so I haven't had a chance to look and see if the OP's benchmarks are published. So it's hard for me to say anything specific at the moment.

In the literature, there are generally two different variants of Aho-Corasick. The traditional formulation follows failure transitions during search, which comes with overhead. In effect, it's an NFA since search can traverse through multiple states for each byte of input that it sees. The other variant is the "advanced" form, which is just effectively turning the NFA into a DFA by precomputing all of the failure transitions into a single table. This can be quite a bit faster depending on the workload, but obviously uses more memory and takes more time to build. My library supports the DFA version, but it had a bug (which has now been fixed) which prevented the OP from effectively benchmarking it.

The OP hints that some substantial number of samples use a very small number of patterns. Sometimes only one. In that case, it probably wouldn't be best to use Aho-Corasick at all. Instead, it should use an optimized single substring search algorithm, such as Two Way. Even for multiple patterns, if there aren't that many of them, it might be better to use Teddy from Hyperscan.

This article is fairly timely. I'm almost done a rewrite of my Aho-Corasick library. It will bring some small performance improvements (but more importantly a simpler API), and I hope to move the Teddy algorithm into it at some point.

As always, specific workloads can often have their performance improved quite a bit.

ruuda7y ago

Thanks for the elaborate reply! Our benchmark input data contains customer data which we cannot share, but we could share the benchmark program.

The DFA approach is something that I had not heard about before (I never looked into what "full" meant), but we should definitely try it; we reuse the automaton against possibly millions of haystacks, so extra memory or preprocessing time is likely worth it. Thanks for pointing this out!

An adaptive approach would indeed work better for fewer needles, but the threshold depends a lot on the data. We compared Alfred–Margaret against ICU regexes for case-insensitive replaces, and we found that even for a single needle, ours was faster in half of the cases. (It is a bit unfair because of FFI call overhead.)

wuschel7y ago

Dear burntsushi,

many thanks for your contribution. I was not expecting to see such an extensive elaboration. Looking forward to see your rewrite, and learn something from it!

Cheers!

glangdale7y ago

You should borrow "FDR" from Hyperscan as well. :-)

AC sucks. It's simple enough, but it's always going to be either slow or big - sometimes both. I'd say that's a source of frustration to me, but it's not really, as we made a lot of money selling string matching to customers back when Hyperscan was closed-source. It's so much easier to sell string matching than regex matching as regexes are full of surprises, but strings are quite tractable.

I'm not delighted with where we left string matching in Hyperscan - "FDR" was wheezing under load in a bunch of ways, and I had some better things in the pipeline. But it still should generally murder AC barring some corner cases.

Unfortunately, Hyperscan has moved away from having a pure-play literal matcher in it (what remains is really more of a specialist 1- to 8-char literal matcher), which I regard as a semi-mistake (so there may be more corner cases than there used to be).

1 more reply

sevensor7y ago

I recommend reading https://blog.burntsushi.net/ripgrep/

I've read masters theses that were significantly less thorough and well-researched.

wuschel7y ago

I just had a quick look - indeed, the level of detail in the ripgrep blog post is impressive. Many thanks for the link!

ruuda7y ago

Let me start by saying that I am a big fan of BurntSushi’s work, and this post was by no means intended to detract from that. If anything, it set a high bar for us.

I don’t have much to add to BurntSushi’s reply. I studied the implementation briefly to take inspiration from when we decided to write our own, and there is no obvious low-hanging fruit. One small thing is that it uses a Vec of transitions per state. In Alfred–Margaret we pack all transition tables together in one array (with a separate array that maps state to start index), so it can use the cache more efficiently.

There are ways to do parallel automaton traversal with SIMD, but it only works for a very small number of states.

burntsushi7y ago

Thanks for the kind words. :)

Yeah, the DFA packs everything into a single table. The NFA keeps them distinct so that it can support a dense representation near the root of the trie (for speed) but a sparse representation farther away from the root (for smaller memory footprint). (Confusingly, "dense" and "sparse" are flipped in the source code, much to my chagrin.) It would be possible to convert this into contiguous memory, but I'm not sure it's worth it.

Beyond that, the other possible (maybe micro) optimizations in this realm are:

1. Use equivalence classes of bytes for your state transitions instead of all possible byte values. However, this is only applicable in a dense representation, and even then, if your alphabet is 2^16 instead of 2^8, it's not clear how much this helps. Byte classes require an extra lookup at search time, but the decreased size of the automaton can be dramtic, which has means better locality and overall better performance. But again, this is compared to the typical dense representation which is probably not relevant in your case.

2. Premultiply state identifiers. Again, probably only pertinent for a dense representation. Although, it sounds like this could eliminate your extra table that maps state identifiers to start indices. The only real downside of this is that it potentially increases the minimum required integer size to represent state identifiers. But that's only relevant if you support using smaller integer sizes for state identifiers in the first place.

3. Rearrange the states such that all match states precede all non-match states (with perhaps an exception or two for the fail/dead states, depending on how you implement Aho-Corasick). In the core match loop, a match state can be determined with a comparison against a fixed integer that's probably in a register instead of a memory access.

With all that said, the best possible next step for y'all is to probably switch to a dense representation. But that might imply finding a way to efficiently use UTF-8 since a dense representation with a 2^16 sized alphabet is tricky.

All of this stuff should be available in my rewrite of the crate. It's also in my regex-automata crate.

unhammer7y ago

Cf. parallel automaton traversal, there's a talk on doing that in Haskell at https://www.youtube.com/watch?v=b4bb8EP_pIE&feature=youtu.be... slides: https://github.com/Gabriel439/slides/blob/master/zurihac/sli...

based on a very readable paper https://www.microsoft.com/en-us/research/wp-content/uploads/...

dswalter7y ago

Nobody's perfect, but BurntSushi isn't someone who typically does things by half measures.

biesnecker7y ago

That would be ToastedSushi.

1 more reply

olliej7y ago· 8 in thread

This does seem to actively circumvent a number of the things that are always held up as why Haskell is a great language: it uses strict types in a number of places, has explicit inline and no-inline annotations everywhere.

It also seemed to require a lot of work to construct the code to make tail call optimization happen, but that’s par for the course in Haskell.

tathougies7y ago

I don't think that's true at all. The goal here was to make it as fast as rust (which basically compiles down to 'straight' native code by default). Haskell, while also compiled, kind of runs on a virtual machine (the STG). It should be expected that a language with a different execution model will need some work to fit another one. Of course, one would expect the same work if you were trying to make rust run as fast as Haskell on a processor designed to execute STG expressions (of which there are none, but there could be, like the lisp machines of the past).

Either way, optimization in any language often requires annotations. The main benefit of Haskell though is still equational reasoning, which this does not obviate. You can get rid of all the annotations and your code will still run, be correct, and can be reasoned about

T-R7y ago

The thesis of laziness by default isn't "everything should be lazy", just that there's some benefit to having strictness be something you add as necessary - maybe that it's easier to add strictness to something lazy than add laziness to something strict, that it gives you more abstraction power, or maybe increases surface area for optimizations.

Similarly, the existence of generators and zero-argument functions in Python doesn't undermine the idea that strictness by default could have benefits.

j88439h847y ago

What's the significance of zero-argument functions in Python?

1 more reply

orblivion7y ago

I'm not that versed in Haskell myself, but is this really the case? I would think that expectations for low level optimization and general code are different.

Though, I suppose you could then say that if the rules are different for low level optimization, they may as well have imported an external library written in Rust.

monocasa7y ago

Optimized Rust feels more idiomatic at least.

1 more reply

mseri7y ago

They explain in the post that they did consider it but decided to avoid it due to the way the FFI and Gc interact (requiring the string to be copied)

ruuda7y ago

The inline/noinline annotations on the inner functions we only had to add for GHC 8.6 though; on GHC 8.2 it was fast without these. (That turned out to be a downside, relying on a long chain of optimizations is fragile.)

gowld7y ago

No one says "Haskell is great because it's fast". They say variants of "Haskell is great because it offers ultra-high-level abstraction, mathematically precise data types, static type-checking and inference, and still giving you the freedom to write nasty low-level preprocesser line noise when you need to be as fast as C"

dmitriid7y ago· 7 in thread

What actually amazes me is that... Python is good enough:

> A naive Python script that loops over all needles and calls str.find in a loop to find all matches.

> working through all 3.5 gigabytes of test data once

> Time is ~6.5s.

I love these kinds of benchmarks. They consistently show that for specialised cases you do need to drop down to C/Rust/hand-written Haskell implementations. In many-many more cases, well, even Python with a naive implementation is good enough.

Not to diss Haskell or Rust. Only saying: chose your tools as you see fit, and benchmark.

Also kudos to the authors to use mean instead of average in the graphs.

staticassertion7y ago

> Firstly, the program spends most of its time in the heavily-optimized str.find function, which is implemented in C.

I'm not going to outright call this "cheating", but most of the time, virtually unconditionally, when Python is fast it's because it isn't Python, it's C.

It definitely feels worth noting that Python is radically slower than other languages, and the main "tool" it provides for improving performance is to rely on another language entirely.

zie7y ago

Well Python is written IN C, at least the standard version is, so I really wouldn't call it anywhere near cheating... :).

But I don't disagree that the Python owe's it's performance to C.. it owes everything to C :)

1 more reply

kachnuv_ocasek7y ago

> Also kudos to the authors to use mean instead of average in the graphs.

What do you mean? "Mean" with no qualification usually denotes arithmetic average (or mean).

dmitriid7y ago

Argh! Confused it with median. No kudos then!

balodja7y ago

> Also kudos to the authors to use mean instead of average in the graphs.

For benchmarks much better option would be geometric mean. Because the relation of the results values not the difference.

https://dl.acm.org/citation.cfm?id=5673

jzoch7y ago

> even Python with a naive implementation is good enough. There is nothing naive about their implementation. 1) It uses C FFI (as many low level python functions do) 2) It uses a suboptimal but still not naive algo.

Just because it isnt identical in implementation to the other languages doesn't mean its much simpler. Its fast because enough people cared to make it that way - it wasn't by some accident

bjoli7y ago

That is a fragile benchmark. For simple scripts where you rely exclusively on python procedures that are written in C python is often very fast.

Once you start working on that data using procedures actually written I python performance is usually far from stellar.

I have had that happen to me many times. Back in the guile 2.0 days I tried porting some utils to python based on preliminary benchmarks that were often an order of magnitude faster than my guile ones, but when the logic was implemented the difference was gone. Then guile 2.2 (and now guile 3) happened and everything got magically faster, even though less and less of the runtime is written in C.

whalesalad7y ago· 6 in thread

I can't speak to the content or Haskell at all but I really like the clean formatting of the post, particularly the charts and images. Does anyone know what tool might have been used to produce them? So clean and crisp.

duijfOP7y ago

It's LaTeX, tikz, and pgfplots for the images. Complied with xelatex to pdf and pdf2svg to svg. The rest is custom HTML and CSS. We use the Hakyll static site generator.

ruuda7y ago

Yes, can confirm. Check out [1] and [2] to learn more about TikZ and PGFPlots.

[1]: http://mirrors.ctan.org/graphics/pgf/base/doc/pgfmanual.pdf [2]: http://mirrors.ctan.org/graphics/pgf/contrib/pgfplots/doc/pg...

josephg7y ago

I particularly appreciate how the text in the diagrams matches the font (& font size) of the body text. Almost nobody does that and it looks great.

seanmcdirmid7y ago

The graph looks a lot like GraphViz output. Not sure about the charts, however.

andrepd7y ago

As always, I will bitch about the colour of the text being grey (#333) instead of black. Hurts readability, doesn't improve design.

whalesalad7y ago

#333 is a great color for text on the web.

4 more replies

sevensor7y ago· 4 in thread

Why UTF-16? Seems like an odd choice of character encodings.

theoh7y ago

It's what Haskell's 'text' type uses internally. This stackoverflow question seems to discuss why that is the case:

https://stackoverflow.com/questions/23765903/why-is-text-utf...

marcosdumay7y ago

Just to add, there are a few use cases in what UTF-8 makes a large difference, and there is a UTF-8 version of Text on Hackage for those cases.

spullara7y ago

Sadly every language developed in the 90s has this flaw. Stuck somewhere between bytes and actual UNICODE.

2 more replies

steveklabnik7y ago

> Because we use the Text type for strings in Haskell, which is an array of UTF-16 code units in native endianness, the benchmarks would take UTF-16 data as input.

timerol7y ago· 3 in thread

> Clearly there was plenty of room for speeding up our string matching. We considered binding the Rust library, but due to the way that Haskell’s foreign function interface interacts with the garbage collector, passing a string to a foreign function requires copying it. That copy alone puts any foreign library at a significant disadvantage, so we decided to see how close we could get in Haskell alone.

It's a shame Haskell copies strings across FFI boundaries, because binding the Rust version would have been almost as fast, with (presumably) significantly less effort.

chrisdone7y ago

As I wrote elsewhere that depends on the string type used: https://www.reddit.com/r/haskell/comments/b0l93l/how_we_made...

Tarean7y ago

I wondered about this and poked at the internals of ByteString (suitable for ffi) and Text. Turns out Text could allow ffi by changing a single function (newByteArray to newAlignedPinnedByteArray).

Presumably creating lots of small strings is a typical use case for Text so losing compacting gc hurts. ByteString on the other hand is more used with binary data and is designed for ffi use.

whateveracct7y ago

Hmm ByteString has the opposite issue and solved it with ShortByteString. Seems like PinnedText could be the analogue.

eridius7y ago· 3 in thread

Converting the function to act as a fold instead of just returning a lazy list, is that necessary because all the unboxing and strictness internally meant the matches couldn't actually be lazy?

Tarean7y ago

Lazy data structures require allocation. GHC's lists use build/fold fusion to remove intermediate lists but that's basically just a less reliable way of converting to folds http://hackage.haskell.org/package/base-4.12.0.0/docs/src/GH...

The other common fusion technique used in haskell is stream fusion which is very similar to the mutual-recursion-encoded-as-sum-type version in the blog post.

There is also indexing-based fusion which the Repa library uses. It allows more control over traversal order (think optimizing stencil codes) but requires regular arrays, e.g. no filtering.

There are some other research projects like specialized compiler passes via hermit but afaik nothing really production ready.

eridius7y ago

Ok so not using a lazy list for matches is basically just around avoiding the requisite allocation? I suppose if you consume the list immediately at the call-site the compiler could in theory after inlining convert it to the closure-based approach for you, but probably wouldn't.

ruuda7y ago

We only did this to fix the performance regression we observed after upgrading to GHC 8.6, but in hindsight I think it is cleaner anyway. With GHC 8.2 the list construction was eliminated, but the optimization turned out to be fragile.

carterschonwald7y ago· 2 in thread

as maintainer of two of the haskell libraries they use, a member of the core libraries committee, and a hackage trustee, i'd like to throw in my two cents:

1) i see they are doing fully qualified imports. please add major bounds to your version deps, stack doesn't excuse not tracking what versions you used!

2) they have

"data Next a = Done !a | Step !a " in their code,

this tells me they could have possibly benefited using the stream abstraction in Vector!

https://hackage.haskell.org/package/vector-0.12.0.2/docs/Dat... is a link to the latter

drb2267y ago

About (1).

The exact versions they used are specified by the stack.yaml file.

In this case, see https://stackage.org/lts-13.10

Stack has a feature that lets you automatically added bounds to a hackage upload based on the stack.yaml. (However, I personally do not recommend using this feature, as those bounds usually end up being overly restrictive. Manually-written bounds are better.)

carterschonwald7y ago

yet its off by default and most users dont use it.

I do agree human judgement based bounds are best, but stackage snapshots aren't that. Furthermore, stackage/fpo LTS is not LTS in the same sense as linux-distributions and their BSD-unix cousins. Proper long term support in any meaningful sense is a commitment by the distro provider to backport security/bug fixes to the LTS. Stackage LTS doesnt do that.

stack tricks its users into not thinking about their relationship with their dependencies and poisons library code from being usable by haskell devs/users/contributors who dont use stack. :)

the more you know! :)

glangdale7y ago· 2 in thread

It's a pity that Aho-Corasick always seems to be the starting point for this kind of work, because it's as slow as a wet week and/or the tables are huge (pick your poison).

To do this right you need a fast prefilter and a reasonable fast string match. Our string matcher in Hyperscan blew AC away on size and speed, typically. Unfortunately, I never had a chance to build its successor. But AC is a disaster of one dependent load after another and is mediocre at best in terms of performance.

burntsushi7y ago

It's really not a pity at all, considering that:

1. There is no write up of the Hyperscan algorithm of which you speak. I wrote up a part of Teddy, but I still don't understand the full algorithm. Hyperscan's code takes serious effort to understand. I spent days staring at Teddy and I still didn't get everything.

2. Aho-Corasick is reasonably simple to implement and does fairly decently for a large number of tasks.

3. Aho-Corasick is trivially portable.

glangdale7y ago

Ha, there's a paper about FDR now.

The fact that AC is portable and simple and that everyone feels good about themselves as a result of implementing a classical-sounding algorithm (why AC and not Rabin-Karp? who knows?) is why is keeps cropping up, like kudzu. It's still either (a) huge, (b) slow, (c) both and (d) a poor fit for modern architectures - it has a way of turning the problem into a latency problem for whichever level of memory its structures fit into (rather than a throughput problem).

It's a solution. Big deal. There are half-a-dozen better ones just floating around in the literate, but this shitty one seems to be the one people fixate on. Maybe it's because Aho and Corasick sort earlier in the dictionary than Rabin and Karp?

1 more reply

paulddraper7y ago

Impressive work! The clincher is at the end

> Relying so much on optimizations to ensure that very high level code compiles to the right low-level code is fragile. Case in point: after upgrading our Stackage snapshot from LTS 10 with GHC 8.2.2 to LTS 13 with GHC 8.6.3, we saw the running time of our benchmark double.

This would scare we about doing any real performance optimization in a language as high-level as Haskell: am I just fiddling with compiler internals now?

wmu7y ago

"A naive Python script that loops over all needles and calls str.find in a loop to find all matches. This is not Aho–Corasick [...]".

I'm the author of pyahocrasick, a python C extension that implements the AC. I'd love to see it in the comparison. :)

j / k navigate · click thread line to collapse

87 comments

57 comments · 11 top-level

wuschel7y ago· 11 in thread

Interesting article.

Out of curiosity: how much optimizing potential is in the software that was programmed by Burntsushi in Rust?

burntsushi7y ago

As always, specific workloads can often have their performance improved quite a bit.

ruuda7y ago

Thanks for the elaborate reply! Our benchmark input data contains customer data which we cannot share, but we could share the benchmark program.

wuschel7y ago

Dear burntsushi,

many thanks for your contribution. I was not expecting to see such an extensive elaboration. Looking forward to see your rewrite, and learn something from it!

Cheers!

glangdale7y ago

You should borrow "FDR" from Hyperscan as well. :-)

1 more reply

sevensor7y ago

I recommend reading https://blog.burntsushi.net/ripgrep/

I've read masters theses that were significantly less thorough and well-researched.

wuschel7y ago

I just had a quick look - indeed, the level of detail in the ripgrep blog post is impressive. Many thanks for the link!

ruuda7y ago

Let me start by saying that I am a big fan of BurntSushi’s work, and this post was by no means intended to detract from that. If anything, it set a high bar for us.

There are ways to do parallel automaton traversal with SIMD, but it only works for a very small number of states.

burntsushi7y ago

Thanks for the kind words. :)

Beyond that, the other possible (maybe micro) optimizations in this realm are:

All of this stuff should be available in my rewrite of the crate. It's also in my regex-automata crate.

unhammer7y ago

based on a very readable paper https://www.microsoft.com/en-us/research/wp-content/uploads/...

dswalter7y ago

Nobody's perfect, but BurntSushi isn't someone who typically does things by half measures.

biesnecker7y ago

That would be ToastedSushi.

1 more reply

olliej7y ago· 8 in thread

It also seemed to require a lot of work to construct the code to make tail call optimization happen, but that’s par for the course in Haskell.

tathougies7y ago

T-R7y ago

Similarly, the existence of generators and zero-argument functions in Python doesn't undermine the idea that strictness by default could have benefits.

j88439h847y ago

What's the significance of zero-argument functions in Python?

1 more reply

orblivion7y ago

I'm not that versed in Haskell myself, but is this really the case? I would think that expectations for low level optimization and general code are different.

Though, I suppose you could then say that if the rules are different for low level optimization, they may as well have imported an external library written in Rust.

monocasa7y ago

Optimized Rust feels more idiomatic at least.

1 more reply

mseri7y ago

They explain in the post that they did consider it but decided to avoid it due to the way the FFI and Gc interact (requiring the string to be copied)

ruuda7y ago

gowld7y ago

dmitriid7y ago· 7 in thread

What actually amazes me is that... Python is good enough:

> A naive Python script that loops over all needles and calls str.find in a loop to find all matches.

> working through all 3.5 gigabytes of test data once

> Time is ~6.5s.

Not to diss Haskell or Rust. Only saying: chose your tools as you see fit, and benchmark.

Also kudos to the authors to use mean instead of average in the graphs.

staticassertion7y ago

> Firstly, the program spends most of its time in the heavily-optimized str.find function, which is implemented in C.

I'm not going to outright call this "cheating", but most of the time, virtually unconditionally, when Python is fast it's because it isn't Python, it's C.

It definitely feels worth noting that Python is radically slower than other languages, and the main "tool" it provides for improving performance is to rely on another language entirely.

zie7y ago

Well Python is written IN C, at least the standard version is, so I really wouldn't call it anywhere near cheating... :).

But I don't disagree that the Python owe's it's performance to C.. it owes everything to C :)

1 more reply

kachnuv_ocasek7y ago

> Also kudos to the authors to use mean instead of average in the graphs.

What do you mean? "Mean" with no qualification usually denotes arithmetic average (or mean).

dmitriid7y ago

Argh! Confused it with median. No kudos then!

balodja7y ago

> Also kudos to the authors to use mean instead of average in the graphs.

For benchmarks much better option would be geometric mean. Because the relation of the results values not the difference.

https://dl.acm.org/citation.cfm?id=5673

jzoch7y ago

Just because it isnt identical in implementation to the other languages doesn't mean its much simpler. Its fast because enough people cared to make it that way - it wasn't by some accident

bjoli7y ago

That is a fragile benchmark. For simple scripts where you rely exclusively on python procedures that are written in C python is often very fast.

Once you start working on that data using procedures actually written I python performance is usually far from stellar.

whalesalad7y ago· 6 in thread

duijfOP7y ago

It's LaTeX, tikz, and pgfplots for the images. Complied with xelatex to pdf and pdf2svg to svg. The rest is custom HTML and CSS. We use the Hakyll static site generator.

ruuda7y ago

Yes, can confirm. Check out [1] and [2] to learn more about TikZ and PGFPlots.

[1]: http://mirrors.ctan.org/graphics/pgf/base/doc/pgfmanual.pdf [2]: http://mirrors.ctan.org/graphics/pgf/contrib/pgfplots/doc/pg...

josephg7y ago

I particularly appreciate how the text in the diagrams matches the font (& font size) of the body text. Almost nobody does that and it looks great.

seanmcdirmid7y ago

The graph looks a lot like GraphViz output. Not sure about the charts, however.

andrepd7y ago

As always, I will bitch about the colour of the text being grey (#333) instead of black. Hurts readability, doesn't improve design.

whalesalad7y ago

#333 is a great color for text on the web.

4 more replies

sevensor7y ago· 4 in thread

Why UTF-16? Seems like an odd choice of character encodings.

theoh7y ago

It's what Haskell's 'text' type uses internally. This stackoverflow question seems to discuss why that is the case:

https://stackoverflow.com/questions/23765903/why-is-text-utf...

marcosdumay7y ago

Just to add, there are a few use cases in what UTF-8 makes a large difference, and there is a UTF-8 version of Text on Hackage for those cases.

spullara7y ago

Sadly every language developed in the 90s has this flaw. Stuck somewhere between bytes and actual UNICODE.

2 more replies

steveklabnik7y ago

> Because we use the Text type for strings in Haskell, which is an array of UTF-16 code units in native endianness, the benchmarks would take UTF-16 data as input.

timerol7y ago· 3 in thread

It's a shame Haskell copies strings across FFI boundaries, because binding the Rust version would have been almost as fast, with (presumably) significantly less effort.

chrisdone7y ago

As I wrote elsewhere that depends on the string type used: https://www.reddit.com/r/haskell/comments/b0l93l/how_we_made...

Tarean7y ago

I wondered about this and poked at the internals of ByteString (suitable for ffi) and Text. Turns out Text could allow ffi by changing a single function (newByteArray to newAlignedPinnedByteArray).

Presumably creating lots of small strings is a typical use case for Text so losing compacting gc hurts. ByteString on the other hand is more used with binary data and is designed for ffi use.

whateveracct7y ago

Hmm ByteString has the opposite issue and solved it with ShortByteString. Seems like PinnedText could be the analogue.

eridius7y ago· 3 in thread

Converting the function to act as a fold instead of just returning a lazy list, is that necessary because all the unboxing and strictness internally meant the matches couldn't actually be lazy?

Tarean7y ago

The other common fusion technique used in haskell is stream fusion which is very similar to the mutual-recursion-encoded-as-sum-type version in the blog post.

There is also indexing-based fusion which the Repa library uses. It allows more control over traversal order (think optimizing stencil codes) but requires regular arrays, e.g. no filtering.

There are some other research projects like specialized compiler passes via hermit but afaik nothing really production ready.

eridius7y ago

ruuda7y ago

carterschonwald7y ago· 2 in thread

as maintainer of two of the haskell libraries they use, a member of the core libraries committee, and a hackage trustee, i'd like to throw in my two cents:

1) i see they are doing fully qualified imports. please add major bounds to your version deps, stack doesn't excuse not tracking what versions you used!

2) they have

"data Next a = Done !a | Step !a " in their code,

this tells me they could have possibly benefited using the stream abstraction in Vector!

https://hackage.haskell.org/package/vector-0.12.0.2/docs/Dat... is a link to the latter

drb2267y ago

About (1).

The exact versions they used are specified by the stack.yaml file.

In this case, see https://stackage.org/lts-13.10

carterschonwald7y ago

yet its off by default and most users dont use it.

stack tricks its users into not thinking about their relationship with their dependencies and poisons library code from being usable by haskell devs/users/contributors who dont use stack. :)

the more you know! :)

glangdale7y ago· 2 in thread

It's a pity that Aho-Corasick always seems to be the starting point for this kind of work, because it's as slow as a wet week and/or the tables are huge (pick your poison).

burntsushi7y ago

It's really not a pity at all, considering that:

2. Aho-Corasick is reasonably simple to implement and does fairly decently for a large number of tasks.

3. Aho-Corasick is trivially portable.

glangdale7y ago

Ha, there's a paper about FDR now.

1 more reply

paulddraper7y ago

Impressive work! The clincher is at the end

This would scare we about doing any real performance optimization in a language as high-level as Haskell: am I just fiddling with compiler internals now?

wmu7y ago

"A naive Python script that loops over all needles and calls str.find in a loop to find all matches. This is not Aho–Corasick [...]".

I'm the author of pyahocrasick, a python C extension that implements the AC. I'd love to see it in the comparison. :)

j / k navigate · click thread line to collapse