Removing characters from strings faster with AVX-512 (opens in new tab)

(lemire.me)

146 pointsmdb314y ago85 comments

85 comments

44 comments · 11 top-level

gslin4y ago· 10 in thread

A problem is slowing down the CPU frequency significantly when AVX-512 is involved, e.g. https://en.wikichip.org/wiki/intel/xeon_gold/6262v this, which usually cancels out the benefit in the Real World (tm).

PragmaticPulp4y ago

This was massively exaggerated by journalists when AVX-512 was first announced.

It is true that randomly applied AVX-512 instructions can cause a slight clock speed reduction, the proper way to use libraries like this would be within specific hot code loops where the mild clock speed reduction is more than offset by the huge parallelism increase.

This doesn’t make sense if you’re a consumer doing something multitasking and a background process is invoking the AVX-512 penalty in the background, but it usually would make sense in a server scenario.

adgjlsfhk14y ago

the thing I never understood about this is why Intel didn't just add latency to the avx512 instructions instead? that seems much easier than downclocking the whole cpu

1 more reply

pclmulqdq4y ago

Intel has been trying to reduce the penalty for AVX-512, and barring that, advertise that there is no penalty. Most things on Ice Lake run fine with 256 bit vectors, but Skylake and earlier really needed 128 bit or narrower if you weren't doing serious vector math.

Forget about 512 bit vectors or FMAs.

alksjdalkj4y ago

I think this is less of a problem on newer CPUs: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre...

pclmulqdq4y ago

Those are client CPUs, which have very different behavior around power management than server parts. However, AVX downclocking has mostly gone away with ice lake and hopefully sapphire rapids does away with it permanently (except on 512 bit vectors).

mhh__4y ago

Unless someone has data for the latest Intel chips (i.e. sapphire rapids) showing the opposite I'm inclined to think this is a meme from 2016/7 that needs to go the way of the dodo.

Twirrim4y ago

It was largely wrong then, too. Cloudflare, who really kicked off a large amount of the fuss, had "Bronze" class Xeon chips, that weren't designed or marketed for what they were attempting to use them for. They were only ever intended for small business stuff. Not large scale high performance operations. Their performance downclock for AVX-512 is way, way higher on Bronze.

1 more reply

janwas4y ago

I would love to see an example of reasonable code not seeing any benefit. On first generation SKX, we saw 1.5x speedups vs AVX2, and that was IIRC even without taking much advantage of AVX3-only instructions.

SemanticStrengh4y ago

Please stop spreading this fallacy, while downclocking can happen, usually the benefit is still strong and superior to avx256. Even 256 can induce downclocking. AVX 512 when properly utilized simply demolish non AVX 512 cpus.

vlovich1234y ago

On that one task. The challenge is if the avx512 pieces aren’t a bottleneck in every single concurrent workload you run. It’s fine if the most important thing your running on them is code optimized for AVX512. Realistically though, is that the case for the target market of CPUs capable of AVX512, since consumer use cases aren’t? The predominant workload would be cloud right? Where you’re running heterogeneous workloads right? You’d have to get real smart by coalescing AVX512 and non AVX512 workloads onto separate machines and disabling it on the machines that don’t need it. Very complicated work to do because you’d have to have each workload annotated by hand (memcpy is optimized to use AVX512 when available so the presence of AVX512 in the code is insufficient)

The more generous interpretation is that Intel fixed that issue a while back although the CPUs with that problem are still in rotation and you have to think about that when compiling your code.

Andoryuuta4y ago· 9 in thread

Intel is removing AVX-512 support from their newer CPU's (Alder Lake +). :/

https://www.igorslab.de/en/intel-deactivated-avx-512-on-alde...

PragmaticPulp4y ago

Server and workstation chips still have AVX-512. It’s only unsupported on CPUs with smaller E(fficeincy) cores.

AVX-512 was never really supported in newer consumer CPUs with heterogeneous architecture. These CPUs have a mix of powerful cores and efficiency cores. The AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the purpose of efficiency cores.

There was previously a hidden option to disable the efficiency cores and enable AVX-512 on the remaining power cores, but the number of workloads that would warrant turning off a lot of your cores to speed up AVX-512 calculations is virtually non-existent in the consumer world (where these cheap CPUs are targeted).

The whole journalism controversy around AVX-512 has been a bit of a joke because many of the same journalists tried to generate controversy when AVX-512 was first introduced and they realized that AVX-512 code would reduce the CPU clock speed. There were numerous articles about turning off AVX-512 on previous generation CPUs to avoid this downclocking and to make overclocks more stable.

pantalaimon4y ago

Catching the bad instruction fault on the E-cores and only scheduling the thread on the P-cores would be something that could be added to Linux (there were already third party patches towards that goal) if Intel had not disable the feature entirely.

2 more replies

zozbot2344y ago

> The AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the purpose of efficiency cores.

And this is why scalable vector ISA's like the RISC-V vector extensions are superior to fixed-size SIMD. You can support both kinds of microarchitecture while running the exact same code.

willis9364y ago

>The AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the purpose of efficiency cores.

Isn't the purpose of efficiency cores to be more power efficient? It's more power efficient to vectorize instructions and minimize pipeline re-ordering.

1 more reply

R0b0t14y ago

That's not a valid reason why I can't use them on the P cores. Some motherboards can enable them on the i9-12900k, it works fine, but you need to pin to a P core.

1 more reply

mhh__4y ago

You're forgetting about server CPUs, and we don't know yet about Raptor Lake.

Andoryuuta4y ago

Ah, yep. You're totally right. I didn't even consider server CPUs. Also, I thought I read somewhere that it was for all consumer CPUs starting at Alder Lake, but I have no idea where, so I could be entirely wrong. :)

electricshampo14y ago

This is only on the client side; server still has and will have AVX512 for the foreseeable future.

SemanticStrengh4y ago

And zen 4 is rumoured to add support for it ^^

mdb31OP4y ago· 6 in thread

Cool performance enhancement, with an accompanying implementation in a real-world library (https://github.com/lemire/despacer).

Still, what does it signal that vector extensions are required to get better string performance on x86? Wouldn't it be better if Intel invested their AVX transistor budget into simply making existing REPB prefixes a lot faster?

37ef_ced34y ago

AVX-512 is an elegant, powerful, flexible set of masked vector instructions that is useful for many purposes. For example, low-cost neural net inference (https://NN-512.com). To suggest that Intel and AMD should instead make "existing REPB prefixes a lot faster" is missing the big picture. The masked compression instructions (one of which is used in Lemire's article) are endlessly useful, not just for stripping spaces out of a string!

mhh__4y ago

Many people seem to think AVX-512 is just wider AVX, which is a shame.

NN-512 is cool. I think the Go code is pretty ugly but I like the concept of the compiler a lot.

janwas4y ago

Why is a large speedup from vectors surprising? Considering that the energy required for scheduling/dispatching an instruction on OoO cores dwarfs that of the actual operation (add/mul etc), amortizing over multiple elements (=SIMD) is an obvious win.

mdb31OP4y ago

Where do I say that the speedup is surprising?

My question is whether Intel investing in AVX-512 is wise, given that: -Most existing code is not aware of AVX anyway; -Developers are especially wary of AVX-512, since they expect it to be discontinued soon.

Consequently, wouldn't Intel be better off by using the silicon dedicated to AVX-512 to speed up instruction patterns that are actually used?

2 more replies

ip264y ago

Is it generally possible to convert rep str sequences to AVX? Could the hardware or compiler already be doing this?

AVX is just the SIMD unit. I would argue the transistors were spent on SIMD, and the hitch is simply the best way to send str commands to the SIMD hardware.

nwmcsween4y ago

Why? IIRC something like 99% of string operations are on 20 chars or less. If you're hitting bottlenecks then optimize.

1 more reply

protoman30004y ago· 4 in thread

Please correct me if I'm wrong, but wouldn't we normally scale these things instead on a GPU?

raphlinus4y ago

The short answer is no, but the long answer is that this is a very complex tradeoff space. Going forward, we may see more of these types of tasks moving to GPU, but for the moment it is generally not a good choice.

The GPU is incredible at raw throughput, and this particular problem can actually implemented fairly straightforwardly (it's a stream compaction, which in turn can be expressed in terms of prefix sum). However, where the GPU absolutely falls down is when you want to interleave CPU and GPU computations. To give round numbers, the roundtrip latency is on the order of 100µs, and even aside from that, the memcpy back and forth between host and device memory might actually be slower than just solving the problem on the CPU. So you only win when the strings are very large, again using round numbers about a megabyte.

Things change if you are able to pipeline a lot of useful computation on the GPU. This is an area of active research (including my own). Aaron Hsu has been doing groundbreaking work implementing an entire compiler on the GPU, and there's more recent work[1], implemented in Futhark, that suggests that that this approach is promising.

I have a paper in the pipeline that includes an extraordinarily high performance (~12G elements/s) GPU implementation of the parentheses matching problem, which is the heart of parsing. If anyone would like to review a draft and provide comments, please add a comment to the GitHub issue[2] I'm using to track this. It's due very soon and I'm on a tight timeline to get all the measurements done, so actionable suggestions on how to improve the text would be most welcome.

[1]: https://theses.liacs.nl/pdf/2020-2021-VoetterRobin.pdf

[2]: https://github.com/raphlinus/raphlinus.github.io/issues/66#i...

mwcampbell4y ago

> To give round numbers, the roundtrip latency is on the order of 100µs

I can't help but notice that, at least in my experience on Windows, this is the same order of magnitude as for inter-process communication on the local machine. Tangent: That latency was my nemesis as a Windows screen reader developer; the platform accessibility APIs weren't well designed to take it into account. Windows 11 finally has a good solution for this problem (yes, I helped implement that while I was at Microsoft).

fancyfredbot4y ago

I wonder if this applies to the same extent for an on-package GPU which shares the same physical memory as the CPU. I'd expect round trip times in that case to be minimal and the available processing power would probably be competitive with AVX512. I have been wondering if this is the reason for deprecating AVX512 on consumer processors - these are likely to have a GPU available.

1 more reply

curling_grad4y ago

Maybe because of IO costs?

jquery4y ago· 2 in thread

I prefer AMDs approach that allows them to put more cores on the die instead of supporting a rarely used instruction set.

fulafel4y ago

Zen 4 is rumored to have AVX512. AMD has in the past had support for wide SIMD instructions with half internal width implementation, so the die area requirements and instruction set support are somewhat orthogonal. There's many other interesting things in AVX512 besides the wide vectors.

pclmulqdq4y ago

AVX-512 finally gets a lot of things right about vector manipulation and plugged a lot of the holes in the instruction set. Part of me is upset that it came with the "512" name - they could have called it "AVX3" or "AVX Version 2" (since it's intel and they love confusing names).

3 more replies

gfody4y ago· 1 in thread

there's more whitespace above 0x20 https://en.m.wikipedia.org/wiki/Whitespace_character#Unicode

brrrrrm4y ago

The complication involved with UTF-8 encoded space removal is immense and likely quite far out of scope.

tedunangst4y ago· 1 in thread

What would be a practical application of this? The linked post mentions a trim like operation, but in practice I only want to remove white space from the ends, not the interior of the string, and finding the ends is basically the whole problem. Or maybe I want to compress some json, but a simple approach won't work because there can be spaces inside string values which must be preserved.

jandrewrogers4y ago

I agree that the whitespace in text example seems a bit contrived but I've done similar types of byte elision operations on binary streams (e.g. for compression purposes), which this could be trivially adapted to.

watmough4y ago

This is really cool.

I just got through doing some work with vectorization.

On the simplest workload I have, splitting a 3 MByte text file into lines, writing a pointer to each string to an array, GCC will not vectorize the naive loop, though ICC might I guess.

With simple vectorization to AVX512 (64 unsigned chars in a vector), finding all the line breaks goes from 1.3 msec to 0.1 msec, so a little better than a 10x speedup, still just on the one core, which keeps things simple.

I was using Agner Fog's VCL 2, Apache licensed C++ Vector Class Library. It's super easy.

brrrrrm4y ago

What's the generated assembly look like? I suspect clang isn't smart enough to store things into registers. The latency of VPCOMPRESSB seems quite high (according to the table here at least https://uops.info/table.html), so you'll probably want to induce a bit more pipelining by manually unrolling into the register variant.

I don't have an AVX512 machine with VBMI2, but here's what my untested code might look like:

  __m512i spaces = _mm512_set1_epi8(' ');
  size_t i = 0;
  for (; i + (64 * 4 - 1) < howmany; i += 64 * 4) {
    // 4 input regs, 4 output regs, you can actually do up to 8 because there are 8 mask registers
    __m512i in0 = _mm512_loadu_si512(bytes + i);
    __m512i in1 = _mm512_loadu_si512(bytes + i + 64);
    __m512i in2 = _mm512_loadu_si512(bytes + i + 128);
    __m512i in3 = _mm512_loadu_si512(bytes + i + 192);

    __mmask64 mask0 = _mm512_cmpgt_epi8_mask (in0, spaces);
    __mmask64 mask1 = _mm512_cmpgt_epi8_mask (in1, spaces);
    __mmask64 mask2 = _mm512_cmpgt_epi8_mask (in2, spaces);
    __mmask64 mask3 = _mm512_cmpgt_epi8_mask (in3, spaces);

    auto reg0 = _mm512_maskz_compress_epi8 (mask0, x);
    auto reg1 = _mm512_maskz_compress_epi8 (mask1, x);
    auto reg2 = _mm512_maskz_compress_epi8 (mask2, x);
    auto reg3 = _mm512_maskz_compress_epi8 (mask3, x);

    _mm512_storeu_si512(bytes + pos, reg0);
    pos += _popcnt64(mask0);
    _mm512_storeu_si512(bytes + pos, reg1);
    pos += _popcnt64(mask1);
    _mm512_storeu_si512(bytes + pos, reg2);
    pos += _popcnt64(mask2);
    _mm512_storeu_si512(bytes + pos, reg3);
    pos += _popcnt64(mask3);
  }
  // old code can go here, since it handles a smaller size well

You can probably do better by chunking up the input and using temporary memory (coalesced at the end).

bertr4nd4y ago

I love Daniel’s vectorized string processing posts. There’s always some clever trickery that’s hard for a guy like me (who mostly uses vector extensions for ML kernels) to get quickly.

I found myself wondering if one could create a domain-specific language for specifying string processing tasks, and then automate some of the tricks with a compiler (possibly with human-specified optimization annotations). Halide did this sort of thing for image processing (and ML via TVM to some extent) and it was a pretty significant success.

GICodeWarrior4y ago

Here's a list of processors supporting AVX-512:

https://ark.intel.com/content/www/us/en/ark/search/featurefi...

The author mentions it's difficult to identify which features are supported on which processor, but ark.intel.com has a quite good catalog.

j / k navigate · click thread line to collapse

85 comments

44 comments · 11 top-level

gslin4y ago· 10 in thread

PragmaticPulp4y ago

This was massively exaggerated by journalists when AVX-512 was first announced.

adgjlsfhk14y ago

the thing I never understood about this is why Intel didn't just add latency to the avx512 instructions instead? that seems much easier than downclocking the whole cpu

1 more reply

pclmulqdq4y ago

Forget about 512 bit vectors or FMAs.

alksjdalkj4y ago

I think this is less of a problem on newer CPUs: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre...

pclmulqdq4y ago

mhh__4y ago

Unless someone has data for the latest Intel chips (i.e. sapphire rapids) showing the opposite I'm inclined to think this is a meme from 2016/7 that needs to go the way of the dodo.

Twirrim4y ago

1 more reply

janwas4y ago

SemanticStrengh4y ago

vlovich1234y ago

The more generous interpretation is that Intel fixed that issue a while back although the CPUs with that problem are still in rotation and you have to think about that when compiling your code.

Andoryuuta4y ago· 9 in thread

Intel is removing AVX-512 support from their newer CPU's (Alder Lake +). :/

https://www.igorslab.de/en/intel-deactivated-avx-512-on-alde...

PragmaticPulp4y ago

Server and workstation chips still have AVX-512. It’s only unsupported on CPUs with smaller E(fficeincy) cores.

pantalaimon4y ago

2 more replies

zozbot2344y ago

> The AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the purpose of efficiency cores.

And this is why scalable vector ISA's like the RISC-V vector extensions are superior to fixed-size SIMD. You can support both kinds of microarchitecture while running the exact same code.

willis9364y ago

>The AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the purpose of efficiency cores.

Isn't the purpose of efficiency cores to be more power efficient? It's more power efficient to vectorize instructions and minimize pipeline re-ordering.

1 more reply

R0b0t14y ago

That's not a valid reason why I can't use them on the P cores. Some motherboards can enable them on the i9-12900k, it works fine, but you need to pin to a P core.

1 more reply

mhh__4y ago

You're forgetting about server CPUs, and we don't know yet about Raptor Lake.

Andoryuuta4y ago

electricshampo14y ago

This is only on the client side; server still has and will have AVX512 for the foreseeable future.

SemanticStrengh4y ago

And zen 4 is rumoured to add support for it ^^

mdb31OP4y ago· 6 in thread

Cool performance enhancement, with an accompanying implementation in a real-world library (https://github.com/lemire/despacer).

37ef_ced34y ago

mhh__4y ago

Many people seem to think AVX-512 is just wider AVX, which is a shame.

NN-512 is cool. I think the Go code is pretty ugly but I like the concept of the compiler a lot.

janwas4y ago

mdb31OP4y ago

Where do I say that the speedup is surprising?

Consequently, wouldn't Intel be better off by using the silicon dedicated to AVX-512 to speed up instruction patterns that are actually used?

2 more replies

ip264y ago

Is it generally possible to convert rep str sequences to AVX? Could the hardware or compiler already be doing this?

AVX is just the SIMD unit. I would argue the transistors were spent on SIMD, and the hitch is simply the best way to send str commands to the SIMD hardware.

nwmcsween4y ago

Why? IIRC something like 99% of string operations are on 20 chars or less. If you're hitting bottlenecks then optimize.

1 more reply

protoman30004y ago· 4 in thread

Please correct me if I'm wrong, but wouldn't we normally scale these things instead on a GPU?

raphlinus4y ago

[1]: https://theses.liacs.nl/pdf/2020-2021-VoetterRobin.pdf

[2]: https://github.com/raphlinus/raphlinus.github.io/issues/66#i...

mwcampbell4y ago

> To give round numbers, the roundtrip latency is on the order of 100µs

fancyfredbot4y ago

1 more reply

curling_grad4y ago

Maybe because of IO costs?

jquery4y ago· 2 in thread

I prefer AMDs approach that allows them to put more cores on the die instead of supporting a rarely used instruction set.

fulafel4y ago

pclmulqdq4y ago

3 more replies

gfody4y ago· 1 in thread

there's more whitespace above 0x20 https://en.m.wikipedia.org/wiki/Whitespace_character#Unicode

brrrrrm4y ago

The complication involved with UTF-8 encoded space removal is immense and likely quite far out of scope.

tedunangst4y ago· 1 in thread

jandrewrogers4y ago

watmough4y ago

This is really cool.

I just got through doing some work with vectorization.

On the simplest workload I have, splitting a 3 MByte text file into lines, writing a pointer to each string to an array, GCC will not vectorize the naive loop, though ICC might I guess.

I was using Agner Fog's VCL 2, Apache licensed C++ Vector Class Library. It's super easy.

brrrrrm4y ago

I don't have an AVX512 machine with VBMI2, but here's what my untested code might look like:

  __m512i spaces = _mm512_set1_epi8(' ');
  size_t i = 0;
  for (; i + (64 * 4 - 1) < howmany; i += 64 * 4) {
    // 4 input regs, 4 output regs, you can actually do up to 8 because there are 8 mask registers
    __m512i in0 = _mm512_loadu_si512(bytes + i);
    __m512i in1 = _mm512_loadu_si512(bytes + i + 64);
    __m512i in2 = _mm512_loadu_si512(bytes + i + 128);
    __m512i in3 = _mm512_loadu_si512(bytes + i + 192);

    __mmask64 mask0 = _mm512_cmpgt_epi8_mask (in0, spaces);
    __mmask64 mask1 = _mm512_cmpgt_epi8_mask (in1, spaces);
    __mmask64 mask2 = _mm512_cmpgt_epi8_mask (in2, spaces);
    __mmask64 mask3 = _mm512_cmpgt_epi8_mask (in3, spaces);

    auto reg0 = _mm512_maskz_compress_epi8 (mask0, x);
    auto reg1 = _mm512_maskz_compress_epi8 (mask1, x);
    auto reg2 = _mm512_maskz_compress_epi8 (mask2, x);
    auto reg3 = _mm512_maskz_compress_epi8 (mask3, x);

    _mm512_storeu_si512(bytes + pos, reg0);
    pos += _popcnt64(mask0);
    _mm512_storeu_si512(bytes + pos, reg1);
    pos += _popcnt64(mask1);
    _mm512_storeu_si512(bytes + pos, reg2);
    pos += _popcnt64(mask2);
    _mm512_storeu_si512(bytes + pos, reg3);
    pos += _popcnt64(mask3);
  }
  // old code can go here, since it handles a smaller size well

You can probably do better by chunking up the input and using temporary memory (coalesced at the end).

bertr4nd4y ago

I love Daniel’s vectorized string processing posts. There’s always some clever trickery that’s hard for a guy like me (who mostly uses vector extensions for ML kernels) to get quickly.

GICodeWarrior4y ago

Here's a list of processors supporting AVX-512:

https://ark.intel.com/content/www/us/en/ark/search/featurefi...

The author mentions it's difficult to identify which features are supported on which processor, but ark.intel.com has a quite good catalog.

j / k navigate · click thread line to collapse