story

High speed Unicode routines using SIMD (opens in new tab)

github.com

126 pointsPeter53y ago20 comments

20 comments

I do not know about the real world implications of this, but just reading of a 20x performance increase for standard cases makes me excited.

dhosek3y ago

There are some basic things that can give huge performance increases without even parallelization: I wrote a Unicode crate because I needed a different interface to getting grapheme clusters from a string than the existing crates offered. I wrote a character category crate as well because I thought I would need it for this purpose (I didn’t). I managed to get double the performance of the existing segmentation crate and 10–20x the performance of the existing Unicode category crate.

https://crates.io/crates/finl_unicode

rurban3y ago

All with two-step tables instead of range- and binary search?

That's extremely interesting, as I'm still favoring range- and binary search for most cases, just normalization lookup with two-step tables.

dhosek3y ago

Yes. The two-step tables are really not that expensive and they enable features not possible with range and binary search, like identifying the category of a character cheaply.

vanderZwan3y ago

The readme has a link to a technical paper on arxiv that was uploaded last year[0], has that perhaps been discussed before?

(not meant as a complaint that this might have been submitted before already, I'm just curious about what might have already been said about it)

[0] https://arxiv.org/abs/2109.10433

Peter5OP3y ago

I searched for it and couldn't find a match. I chose the code instead of the document, as the repository links to the paper too, and almost always, code is more welcome than a paper.

michelb3y ago

Mabye this? https://news.ycombinator.com/item?id=26887438

vanderZwan3y ago

Thanks! That looks more like an name-adjacent repo tackling the same topic than a direct port. However, that doesn't really matter since most of the discussion is still relevant given the overlap :)

belter3y ago

"Fast UTF-8 validation" - https://news.ycombinator.com/item?id=24839113

vanderZwan3y ago

Thank you, looks like a satisfyingly meaty discussion from a technical standpoint too.

erk__3y ago

IBM's Z/Architecture mainframes have native support for some of these functions, it could be interesting to see how the speed compares

aqrit3y ago

One of the authors/contributors was looking into it: https://old.reddit.com/r/mainframe/comments/wevyvg/whats_the... I don't think this project was ready for a public announcement, yet.

dragontamer3y ago

I dunno much about Unicode, but I imagine it is a regular language? (Aka: acception / rejection can be determined by regular expressions / finite state machines)

If so, regular languages are one of those 'surprising things that can be parallelized'. It doesn't seem possible at first thought though.

lifthrasiir3y ago

Just in case, here "Unicode routines" refer to Unicode encoding routines, not other more complex things like normalization. They are pretty regular but their automata would be relatively simple. (By the way, SIMD-accelerated regular expression parsing is indeed a thing [1], if anyone wondered.)

[1] https://www.microsoft.com/en-us/research/wp-content/uploads/...

wooosh3y ago

Intel has an implementation of this technique here as well:

https://github.com/intel/hyperscan

nextaccountic3y ago

ripgrep also uses simd https://news.ycombinator.com/item?id=19271497

yxhuvud3y ago

That a regular language may be parallelized does not mean all can be.

That said, there are techniques around to evaluate a context free grammars (and thus also regular grammars) in parallel. One way to do it is as follows:

Break up the input string into n separate strings. The processing of the first string starts at a single state, the starting state. Each other string starts at the set of possible states, with the output being a mapping between state at the first character to the state after the last.

Then at the end the output is stitched together.

dragontamer3y ago

There are multiple parallelization techniques that apply to finite state machines in general.

Yes, it is a surprising result. But that is what makes them worthy of study.

Hillis / Steele even did it using the standard technique of prefix sums over an associative operator. So it's the sort of thing that a parallel programmer should be able to invent on their own (if they are expert enough with prefix sum pattern)

----

EDIT: For regular languages, here's a "really simple" method that achieves 2x parallelism.

1. Thread#1 parses from the start.

2. Reverse the language / FSM, and Thread#2 parses from the end working to the beginning using the reversed FSM.

3. When Thread#1 and Thread#2 meet in the middle, ensure that there is a valid state-transition between the two states. If so, accept the string. Otherwise, reject it.

--------

This concept can become generalized to "starting in the middle" of any regular language, but that's the tricky part. The first and last characters are the "obvious" starting points. Once you can "start at any location", and learn to match up states, you've got general parallelism for the entire set of characters in the whole string.

> Each other string starts at the set of possible states

A good start.

The Hillis / Steele prefix sum methodology is a bit more nuanced.

Compute the following in parallel:

* a[0] + a[1]

* a[2] + a[3]

* a[4] + a[5]

...

* a[2n] + a[2n+1]

Where + is the transition function between states. This results in n/2 possible FSM machines. Once you've done this, then compute:

* (a[0] + a[1]) + (a[2] + a[3])

* (a[4] + a[5]) + (a[6] + a[7])

...

* (a[4n] + a[4n+1]) + (a[4n+2] + a[4n+3])

Which represents "taking 4 steps of the finite-state-machine, starting at a[0], a[4], a[8], etc. etc.".

Each time you do this, you double the length of the substrings computed. After O(log2(n)) steps, you'll have processed the entire string with O(n) processors.

---

I don't think the Hillis / Steele algorithm is practical. But it proves that the problem is in Nick's Class of complexity (aka: NC complexity) and therefore worthy of parallelism study.

Genbox3y ago

I'm at the point where I can no longer see any reason to use UTF-16. UTF-8 is used everywhere today and constantly converting between the two is not only inefficient, but also introduce a risk of bugs/corruption.

jahewson3y ago

Yes it’s annoying that we’re stuck with UTF-16 inside many programming languages. Still, I think it’s better than having two types of string (C, C++) or introducing a breaking change (looking at you Python 3).

j / k navigate · click thread line to collapse

20 comments

jansan3y ago

I do not know about the real world implications of this, but just reading of a 20x performance increase for standard cases makes me excited.

dhosek3y ago

https://crates.io/crates/finl_unicode

rurban3y ago

All with two-step tables instead of range- and binary search?

That's extremely interesting, as I'm still favoring range- and binary search for most cases, just normalization lookup with two-step tables.

dhosek3y ago

Yes. The two-step tables are really not that expensive and they enable features not possible with range and binary search, like identifying the category of a character cheaply.

vanderZwan3y ago

The readme has a link to a technical paper on arxiv that was uploaded last year[0], has that perhaps been discussed before?

(not meant as a complaint that this might have been submitted before already, I'm just curious about what might have already been said about it)

[0] https://arxiv.org/abs/2109.10433

Peter5OP3y ago

I searched for it and couldn't find a match. I chose the code instead of the document, as the repository links to the paper too, and almost always, code is more welcome than a paper.

michelb3y ago

Mabye this? https://news.ycombinator.com/item?id=26887438

vanderZwan3y ago

Thanks! That looks more like an name-adjacent repo tackling the same topic than a direct port. However, that doesn't really matter since most of the discussion is still relevant given the overlap :)

belter3y ago

"Fast UTF-8 validation" - https://news.ycombinator.com/item?id=24839113

vanderZwan3y ago

Thank you, looks like a satisfyingly meaty discussion from a technical standpoint too.

erk__3y ago

IBM's Z/Architecture mainframes have native support for some of these functions, it could be interesting to see how the speed compares

aqrit3y ago

One of the authors/contributors was looking into it: https://old.reddit.com/r/mainframe/comments/wevyvg/whats_the... I don't think this project was ready for a public announcement, yet.

dragontamer3y ago

I dunno much about Unicode, but I imagine it is a regular language? (Aka: acception / rejection can be determined by regular expressions / finite state machines)

If so, regular languages are one of those 'surprising things that can be parallelized'. It doesn't seem possible at first thought though.

lifthrasiir3y ago

[1] https://www.microsoft.com/en-us/research/wp-content/uploads/...

wooosh3y ago

Intel has an implementation of this technique here as well:

https://github.com/intel/hyperscan

nextaccountic3y ago

ripgrep also uses simd https://news.ycombinator.com/item?id=19271497

yxhuvud3y ago

That a regular language may be parallelized does not mean all can be.

That said, there are techniques around to evaluate a context free grammars (and thus also regular grammars) in parallel. One way to do it is as follows:

Then at the end the output is stitched together.

dragontamer3y ago

There are multiple parallelization techniques that apply to finite state machines in general.

Yes, it is a surprising result. But that is what makes them worthy of study.

----

EDIT: For regular languages, here's a "really simple" method that achieves 2x parallelism.

1. Thread#1 parses from the start.

2. Reverse the language / FSM, and Thread#2 parses from the end working to the beginning using the reversed FSM.

3. When Thread#1 and Thread#2 meet in the middle, ensure that there is a valid state-transition between the two states. If so, accept the string. Otherwise, reject it.

--------

> Each other string starts at the set of possible states

A good start.

The Hillis / Steele prefix sum methodology is a bit more nuanced.

Compute the following in parallel:

* a[0] + a[1]

* a[2] + a[3]

* a[4] + a[5]

...

* a[2n] + a[2n+1]

Where + is the transition function between states. This results in n/2 possible FSM machines. Once you've done this, then compute:

* (a[0] + a[1]) + (a[2] + a[3])

* (a[4] + a[5]) + (a[6] + a[7])

...

* (a[4n] + a[4n+1]) + (a[4n+2] + a[4n+3])

Which represents "taking 4 steps of the finite-state-machine, starting at a[0], a[4], a[8], etc. etc.".

Each time you do this, you double the length of the substrings computed. After O(log2(n)) steps, you'll have processed the entire string with O(n) processors.

---

I don't think the Hillis / Steele algorithm is practical. But it proves that the problem is in Nick's Class of complexity (aka: NC complexity) and therefore worthy of parallelism study.

Genbox3y ago

jahewson3y ago

j / k navigate · click thread line to collapse