https://www.igorslab.de/en/intel-deactivated-avx-512-on-alde...
AVX-512 was never really supported in newer consumer CPUs with heterogeneous architecture. These CPUs have a mix of powerful cores and efficiency cores. The AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the purpose of efficiency cores.
There was previously a hidden option to disable the efficiency cores and enable AVX-512 on the remaining power cores, but the number of workloads that would warrant turning off a lot of your cores to speed up AVX-512 calculations is virtually non-existent in the consumer world (where these cheap CPUs are targeted).
The whole journalism controversy around AVX-512 has been a bit of a joke because many of the same journalists tried to generate controversy when AVX-512 was first introduced and they realized that AVX-512 code would reduce the CPU clock speed. There were numerous articles about turning off AVX-512 on previous generation CPUs to avoid this downclocking and to make overclocks more stable.
And this is why scalable vector ISA's like the RISC-V vector extensions are superior to fixed-size SIMD. You can support both kinds of microarchitecture while running the exact same code.
Isn't the purpose of efficiency cores to be more power efficient? It's more power efficient to vectorize instructions and minimize pipeline re-ordering.
It is true that randomly applied AVX-512 instructions can cause a slight clock speed reduction, the proper way to use libraries like this would be within specific hot code loops where the mild clock speed reduction is more than offset by the huge parallelism increase.
This doesn’t make sense if you’re a consumer doing something multitasking and a background process is invoking the AVX-512 penalty in the background, but it usually would make sense in a server scenario.
Forget about 512 bit vectors or FMAs.
The more generous interpretation is that Intel fixed that issue a while back although the CPUs with that problem are still in rotation and you have to think about that when compiling your code.
I just got through doing some work with vectorization.
On the simplest workload I have, splitting a 3 MByte text file into lines, writing a pointer to each string to an array, GCC will not vectorize the naive loop, though ICC might I guess.
With simple vectorization to AVX512 (64 unsigned chars in a vector), finding all the line breaks goes from 1.3 msec to 0.1 msec, so a little better than a 10x speedup, still just on the one core, which keeps things simple.
I was using Agner Fog's VCL 2, Apache licensed C++ Vector Class Library. It's super easy.
Still, what does it signal that vector extensions are required to get better string performance on x86? Wouldn't it be better if Intel invested their AVX transistor budget into simply making existing REPB prefixes a lot faster?
NN-512 is cool. I think the Go code is pretty ugly but I like the concept of the compiler a lot.
My question is whether Intel investing in AVX-512 is wise, given that: -Most existing code is not aware of AVX anyway; -Developers are especially wary of AVX-512, since they expect it to be discontinued soon.
Consequently, wouldn't Intel be better off by using the silicon dedicated to AVX-512 to speed up instruction patterns that are actually used?
AVX is just the SIMD unit. I would argue the transistors were spent on SIMD, and the hitch is simply the best way to send str commands to the SIMD hardware.
I don't have an AVX512 machine with VBMI2, but here's what my untested code might look like:
__m512i spaces = _mm512_set1_epi8(' ');
size_t i = 0;
for (; i + (64 * 4 - 1) < howmany; i += 64 * 4) {
// 4 input regs, 4 output regs, you can actually do up to 8 because there are 8 mask registers
__m512i in0 = _mm512_loadu_si512(bytes + i);
__m512i in1 = _mm512_loadu_si512(bytes + i + 64);
__m512i in2 = _mm512_loadu_si512(bytes + i + 128);
__m512i in3 = _mm512_loadu_si512(bytes + i + 192);
__mmask64 mask0 = _mm512_cmpgt_epi8_mask (in0, spaces);
__mmask64 mask1 = _mm512_cmpgt_epi8_mask (in1, spaces);
__mmask64 mask2 = _mm512_cmpgt_epi8_mask (in2, spaces);
__mmask64 mask3 = _mm512_cmpgt_epi8_mask (in3, spaces);
auto reg0 = _mm512_maskz_compress_epi8 (mask0, x);
auto reg1 = _mm512_maskz_compress_epi8 (mask1, x);
auto reg2 = _mm512_maskz_compress_epi8 (mask2, x);
auto reg3 = _mm512_maskz_compress_epi8 (mask3, x);
_mm512_storeu_si512(bytes + pos, reg0);
pos += _popcnt64(mask0);
_mm512_storeu_si512(bytes + pos, reg1);
pos += _popcnt64(mask1);
_mm512_storeu_si512(bytes + pos, reg2);
pos += _popcnt64(mask2);
_mm512_storeu_si512(bytes + pos, reg3);
pos += _popcnt64(mask3);
}
// old code can go here, since it handles a smaller size well
You can probably do better by chunking up the input and using temporary memory (coalesced at the end).I found myself wondering if one could create a domain-specific language for specifying string processing tasks, and then automate some of the tricks with a compiler (possibly with human-specified optimization annotations). Halide did this sort of thing for image processing (and ML via TVM to some extent) and it was a pretty significant success.
https://ark.intel.com/content/www/us/en/ark/search/featurefi...
The author mentions it's difficult to identify which features are supported on which processor, but ark.intel.com has a quite good catalog.
The GPU is incredible at raw throughput, and this particular problem can actually implemented fairly straightforwardly (it's a stream compaction, which in turn can be expressed in terms of prefix sum). However, where the GPU absolutely falls down is when you want to interleave CPU and GPU computations. To give round numbers, the roundtrip latency is on the order of 100µs, and even aside from that, the memcpy back and forth between host and device memory might actually be slower than just solving the problem on the CPU. So you only win when the strings are very large, again using round numbers about a megabyte.
Things change if you are able to pipeline a lot of useful computation on the GPU. This is an area of active research (including my own). Aaron Hsu has been doing groundbreaking work implementing an entire compiler on the GPU, and there's more recent work[1], implemented in Futhark, that suggests that that this approach is promising.
I have a paper in the pipeline that includes an extraordinarily high performance (~12G elements/s) GPU implementation of the parentheses matching problem, which is the heart of parsing. If anyone would like to review a draft and provide comments, please add a comment to the GitHub issue[2] I'm using to track this. It's due very soon and I'm on a tight timeline to get all the measurements done, so actionable suggestions on how to improve the text would be most welcome.
[1]: https://theses.liacs.nl/pdf/2020-2021-VoetterRobin.pdf
[2]: https://github.com/raphlinus/raphlinus.github.io/issues/66#i...
I can't help but notice that, at least in my experience on Windows, this is the same order of magnitude as for inter-process communication on the local machine. Tangent: That latency was my nemesis as a Windows screen reader developer; the platform accessibility APIs weren't well designed to take it into account. Windows 11 finally has a good solution for this problem (yes, I helped implement that while I was at Microsoft).