Here’s how I would do that in AVX2: https://gist.github.com/Const-me/eed10bfe690b5804d2fc8266e02...
I wonder how does the performance compare to your version.
Using srand(0) as in your gist; changing int8_t to char ; using int for sum:
BM_hn 10155300 ns
BM_avx512 9218652 ns
-----
BM_avx512 9372319 ns
BM_hn 10428792 ns
Your code is getting ~18-19GB/s read bandwidth compared to roughly ~21GB/s for my code.
I wonder how much of that is due to interleaved summing vs not compared to AVX512 vs AVX2.
Assuming none of us screwed up too badly, that gives us about 10% profit from AVX512 compared to AVX2. Not sure that’s a good enough win to justify vendor lock-in to Intel.
P.S. int8_t is a typedef for char, that change was stylistic and should not affect performance whatsoever. I tend to avoid char data type for numbers, as opposed to characters.
Update: you can try to make another AVX512 version with _mm512_sad_epu8 like I did for AVX2, instead of that integer dot product. I won’t be surprised to find out vpsadbw is faster than vpdpbusd, fundamentally addition is simpler than multiplication.
sum(byteVals) + sum(intVals) + 128 * len(intVals)?
" UPDATE: see https://www.realworldtech.com/forum/?threadid=200693&curpost... for a dramatic simplification. Not catching this is an oversight on my part. This post will be updated to include numbers with the mentioned strategy.
UPDATE: To my surprise and after much fiddling, I did not manage to write a version that was measurably faster (indeed they were at least a percent slower) than the hand written sum_avx512 shown below. There is almost certainly something that I am doing wrong but I can’t seem to figure out what it is. I will take this opportunity to leave this as an exercise for the reader :). "
sum(byteVals != -128) + sum(intVals)
should vectorize more nicely and be at least as cache friendly.It might make sense to resubmit a completely rewritten and pared-down version of the article. The dot product trick is neat.
The main focus of the article is to show a worked example of how to analyze the behavior of simple programs. The circuitous path taken in the article is similar to that which you might face when analyzing your own programs. The article is only incidentally about AVX512 and summing numbers. Thus any specific technique used or even the precise runtimes measured should not be given too much weight. Forgive me, but the journey is more important than the destination.
The article is also meant to inspire others to learn and spend time deeply understanding (by looking at data that the hardware makes available) what actually happens to their code when it runs. It is too easy to lose sight of that in a professional setting, where business needs/requirements leave little time for deep analysis.
In [8]: vec = np.random.randint(-200, 200, (100_000_000,))
In [9]: %timeit vec.sum()
63 ms ± 4.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The branchless C++ version took 125ms, and the AVX 512 version took ~9ms.My AVX version.