On Summing Integers (opens in new tab)

(unomerite.medium.com)

26 pointselectricshampo15y ago13 comments

13 comments

13 comments · 6 top-level

Const-me5y ago· 3 in thread

I don’t believe in the bright future of AVX512 tech, and I don’t have hardware either, my desktop PC has AMD Zen2 CPU.

Here’s how I would do that in AVX2: https://gist.github.com/Const-me/eed10bfe690b5804d2fc8266e02...

I wonder how does the performance compare to your version.

electricshampo1OP5y ago

(author here)

Using srand(0) as in your gist; changing int8_t to char ; using int for sum:

BM_hn 10155300 ns

BM_avx512 9218652 ns

-----

BM_avx512 9372319 ns

BM_hn 10428792 ns

Your code is getting ~18-19GB/s read bandwidth compared to roughly ~21GB/s for my code.

I wonder how much of that is due to interleaved summing vs not compared to AVX512 vs AVX2.

Const-me5y ago

Interesting, thanks for the benchmark.

Assuming none of us screwed up too badly, that gives us about 10% profit from AVX512 compared to AVX2. Not sure that’s a good enough win to justify vendor lock-in to Intel.

P.S. int8_t is a typedef for char, that change was stylistic and should not affect performance whatsoever. I tend to avoid char data type for numbers, as opposed to characters.

Update: you can try to make another AVX512 version with _mm512_sad_epu8 like I did for AVX2, instead of that integer dot product. I won’t be surprised to find out vpsadbw is faster than vpdpbusd, fundamentally addition is simpler than multiplication.

schmide5y ago

I learned a lot from your code. Kudos

plesner5y ago· 3 in thread

Could you not just do

sum(byteVals) + sum(intVals) + 128 * len(intVals)?

electricshampo1OP5y ago

That is essentially the approach mentioned in the article at

" UPDATE: see https://www.realworldtech.com/forum/?threadid=200693&curpost... for a dramatic simplification. Not catching this is an oversight on my part. This post will be updated to include numbers with the mentioned strategy.

UPDATE: To my surprise and after much fiddling, I did not manage to write a version that was measurably faster (indeed they were at least a percent slower) than the hand written sum_avx512 shown below. There is almost certainly something that I am doing wrong but I can’t seem to figure out what it is. I will take this opportunity to leave this as an exercise for the reader :). "

schmide5y ago

There are many ways to to solve every problem. Sometimes the factor is easy to see and it drops right out of the equation.

tom_mellior5y ago

And even if you didn't, I wonder what the motivation is for trying to do it all in one loop.

    sum(byteVals != -128) + sum(intVals)

should vectorize more nicely and be at least as cache friendly.

tom_mellior5y ago· 1 in thread

After the updates to the article, the takeaway seems to be "you can use AVX-512 dot product instructions to sum an array of bytes to int and get a 15% speedup over more straightforward vector code". That's an interesting point, but it's now well-hidden among irrelevant things like the compressed representation that was only relevant to the article's original point.

It might make sense to resubmit a completely rewritten and pared-down version of the article. The dot product trick is neat.

electricshampo1OP5y ago

(Author here)

The main focus of the article is to show a worked example of how to analyze the behavior of simple programs. The circuitous path taken in the article is similar to that which you might face when analyzing your own programs. The article is only incidentally about AVX512 and summing numbers. Thus any specific technique used or even the precise runtimes measured should not be given too much weight. Forgive me, but the journey is more important than the destination.

The article is also meant to inspire others to learn and spend time deeply understanding (by looking at data that the hardware makes available) what actually happens to their code when it runs. It is too easy to lose sight of that in a professional setting, where business needs/requirements leave little time for deep analysis.

orf5y ago

numpy, for comparison:

   In [8]: vec = np.random.randint(-200, 200, (100_000_000,))

   In [9]: %timeit vec.sum()
   63 ms ± 4.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The branchless C++ version took 125ms, and the AVX 512 version took ~9ms.

schmide5y ago

Ehh. I like playing with vectors and have a weird coding style.

My AVX version.

https://github.com/schmide/sumint/blob/main/sumint.cpp

electricshampo1OP5y ago

NOTE: The article has been updated and expanded since the initial post.

j / k navigate · click thread line to collapse

13 comments

13 comments · 6 top-level

Const-me5y ago· 3 in thread

I don’t believe in the bright future of AVX512 tech, and I don’t have hardware either, my desktop PC has AMD Zen2 CPU.

Here’s how I would do that in AVX2: https://gist.github.com/Const-me/eed10bfe690b5804d2fc8266e02...

I wonder how does the performance compare to your version.

electricshampo1OP5y ago

(author here)

Using srand(0) as in your gist; changing int8_t to char ; using int for sum:

BM_hn 10155300 ns

BM_avx512 9218652 ns

-----

BM_avx512 9372319 ns

BM_hn 10428792 ns

Your code is getting ~18-19GB/s read bandwidth compared to roughly ~21GB/s for my code.

I wonder how much of that is due to interleaved summing vs not compared to AVX512 vs AVX2.

Const-me5y ago

Interesting, thanks for the benchmark.

Assuming none of us screwed up too badly, that gives us about 10% profit from AVX512 compared to AVX2. Not sure that’s a good enough win to justify vendor lock-in to Intel.

P.S. int8_t is a typedef for char, that change was stylistic and should not affect performance whatsoever. I tend to avoid char data type for numbers, as opposed to characters.

schmide5y ago

I learned a lot from your code. Kudos

plesner5y ago· 3 in thread

Could you not just do

sum(byteVals) + sum(intVals) + 128 * len(intVals)?

electricshampo1OP5y ago

That is essentially the approach mentioned in the article at

schmide5y ago

There are many ways to to solve every problem. Sometimes the factor is easy to see and it drops right out of the equation.

tom_mellior5y ago

And even if you didn't, I wonder what the motivation is for trying to do it all in one loop.

    sum(byteVals != -128) + sum(intVals)

should vectorize more nicely and be at least as cache friendly.

tom_mellior5y ago· 1 in thread

It might make sense to resubmit a completely rewritten and pared-down version of the article. The dot product trick is neat.

electricshampo1OP5y ago

(Author here)

orf5y ago

numpy, for comparison:

   In [8]: vec = np.random.randint(-200, 200, (100_000_000,))

   In [9]: %timeit vec.sum()
   63 ms ± 4.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The branchless C++ version took 125ms, and the AVX 512 version took ~9ms.

schmide5y ago

Ehh. I like playing with vectors and have a weird coding style.

My AVX version.

https://github.com/schmide/sumint/blob/main/sumint.cpp

electricshampo1OP5y ago

NOTE: The article has been updated and expanded since the initial post.

j / k navigate · click thread line to collapse