The Parallelism Blues: when faster code is slower (opens in new tab)

(pythonspeed.com)

36 pointsbibyte6y ago23 comments

23 comments

17 comments · 7 top-level

longemen30006y ago· 5 in thread

In Julia, where the paralleization options are explicit (SIMD, AVX, threads or multiprocessing), it always depends on the load, for small operation (around 10000 elements) a single thread is faster only for the thread spawning time (around 1 microsecond). And there is the issue of the independent Blas threaded model, where the Blas threads sometimes interfere with Julia threads... In a nutshell, parallelization is not a magical bullet, but is a good bullet to have at your disposal anyway

ChrisRackauckas6y ago

> And there is the issue of the independent Blas threaded model, where the Blas threads sometimes interfere with Julia threads

Julia has composible multithreading, and using that model fixed composing FFTW threads with Julia's. This can be done to OpenBLAS as well, and IIRC there is a PR open for it.

longemen30006y ago

Yeah, I'm waiting for that PR haahah

The_rationalist6y ago

Do you know if Julia will add OpenMP support? It's clearly the way to go for offloading to hardware in a productive way.

dagw6y ago

Julia is actually initially had OpenMP backed parallelism (ParallelAccelerator.jl), but they're moving away from OpenMP towards a novel and native task parallelism framework more inspired by things like Cilk[0].

[0] https://julialang.org/blog/2019/07/multithreading/

1 more reply

eigenspace6y ago

I don't know about "clearly the way to go". I think Julia's parallelism models have proven themselves to be very robust, performant and composeable, moreso than OpenMP as far as I'm aware.

1 more reply

maweki6y ago· 3 in thread

"It would be extremely surprising, then, if running with N threads actually gave ×N performance."

Basically impossible by Ahmdal's law.

twoodfin6y ago

Of course in the general case Amdahl's law is inescapable, but some tasks on modern systems can show > ×N speedup over single-threaded performance if, for example, a single thread can only exploit at maximum some fraction of the total memory bandwidth or some level of the cache hierarchy.

jfim6y ago

Do you have examples where running n threads instead of 1 results in a speedup greater than n?

The only thing I can think of would be that the additional threads would kick the CPU into using a higher frequency, but a single thread using 100% of the CPU should already do that.

1 more reply

convolvatron6y ago

well before that kicks in, if your code requires any coordination at all (not Monte Carlo) then those overheads can scale with the number of processes. so rather than hitting an asymptote as you'd get with just Ahmdal, you actually start to go down in absolute terms as the number of processes increases.

_bxg16y ago· 1 in thread

None of this is surprising, right? Unless your system has fewer threads than cores (which it probably doesn't even without your program) there will always be some context-switching overhead. It's worth keeping in mind I guess - especially the fact that numpy parallelizes transparently - but generally these results are to be expected.

The title is also misleading; it suggests that the wall clock time might be longer for parallel code in certain cases. While not impossible, that isn't what the article covers.

winkeltripel6y ago

> the wall clock time might be longer for parallel code

That is exactly the case, if CPU is the bottleneck in your already-parallel application. It's a case where we really shouldn't be layering different parallel bits together in one codebase, but might be doing it naively.

wtracy6y ago· 1 in thread

This is perfectly normal behavior when Intel Hyperthreading is involved.

I'm on my phone, so rather than trying to type out an explanation, I'm going to link to Wikipedia: https://en.wikipedia.org/wiki/Hyper-threading

itamarst6y ago

The benchmarks were run on a computer with hyperthreading disabled.

AzN1337c0d3r6y ago

There's also the problem of Turbo Boost.

My laptop's 9980HK will boost to ~4.5 GHz when only loaded to a single core.

However, when I load up all 8 cores, it might only sustain ~3.5 GHz.

Therefore the 8 cores might not actually result in the work being completed 8 times as fast, only 6.2x (8*[3.5/4.5]) real-time due to the lowered clock rate of each individual core.

This will show up as additional user time, since each individual core is able to do less work for each unit of time (seconds) compared to the single-core case.

ncmncm6y ago

The article uses the term "parallelism" when it is talking, instead, about concurrency.

Parallelism is specifically the stuff that actually does happen completely independently on all processing units, that actually goes Nx as fast on N units (clock depression aside). Concurrency refers to the overhead of coordinating activity of those units, that keeps you from getting your Nx. It is overhead on top of any actually serial parts of the computation, which Amdahl's law addresses.

In other words: Parallelism giveth, and concurrency taketh away.

The distinction gets more useful the more you think about the subject.

yaroslavvb6y ago

On 64 core Xeon E5 this example gives 1.8x increase in user time, but 8x decrease in wall-clock time

j / k navigate · click thread line to collapse

23 comments

17 comments · 7 top-level

longemen30006y ago· 5 in thread

ChrisRackauckas6y ago

> And there is the issue of the independent Blas threaded model, where the Blas threads sometimes interfere with Julia threads

Julia has composible multithreading, and using that model fixed composing FFTW threads with Julia's. This can be done to OpenBLAS as well, and IIRC there is a PR open for it.

longemen30006y ago

Yeah, I'm waiting for that PR haahah

The_rationalist6y ago

Do you know if Julia will add OpenMP support? It's clearly the way to go for offloading to hardware in a productive way.

dagw6y ago

[0] https://julialang.org/blog/2019/07/multithreading/

1 more reply

eigenspace6y ago

I don't know about "clearly the way to go". I think Julia's parallelism models have proven themselves to be very robust, performant and composeable, moreso than OpenMP as far as I'm aware.

1 more reply

maweki6y ago· 3 in thread

"It would be extremely surprising, then, if running with N threads actually gave ×N performance."

Basically impossible by Ahmdal's law.

twoodfin6y ago

jfim6y ago

Do you have examples where running n threads instead of 1 results in a speedup greater than n?

The only thing I can think of would be that the additional threads would kick the CPU into using a higher frequency, but a single thread using 100% of the CPU should already do that.

1 more reply

convolvatron6y ago

_bxg16y ago· 1 in thread

The title is also misleading; it suggests that the wall clock time might be longer for parallel code in certain cases. While not impossible, that isn't what the article covers.

winkeltripel6y ago

> the wall clock time might be longer for parallel code

wtracy6y ago· 1 in thread

This is perfectly normal behavior when Intel Hyperthreading is involved.

I'm on my phone, so rather than trying to type out an explanation, I'm going to link to Wikipedia: https://en.wikipedia.org/wiki/Hyper-threading

itamarst6y ago

The benchmarks were run on a computer with hyperthreading disabled.

AzN1337c0d3r6y ago

There's also the problem of Turbo Boost.

My laptop's 9980HK will boost to ~4.5 GHz when only loaded to a single core.

However, when I load up all 8 cores, it might only sustain ~3.5 GHz.

Therefore the 8 cores might not actually result in the work being completed 8 times as fast, only 6.2x (8*[3.5/4.5]) real-time due to the lowered clock rate of each individual core.

This will show up as additional user time, since each individual core is able to do less work for each unit of time (seconds) compared to the single-core case.

ncmncm6y ago

The article uses the term "parallelism" when it is talking, instead, about concurrency.

In other words: Parallelism giveth, and concurrency taketh away.

The distinction gets more useful the more you think about the subject.

yaroslavvb6y ago

On 64 core Xeon E5 this example gives 1.8x increase in user time, but 8x decrease in wall-clock time

j / k navigate · click thread line to collapse