undefined | Better HN

0 pointsjandrewrogers5y ago0 comments

Parallelism in 2021 should not be tightly coupled across threads if performance matters, the limitations of that model are well-understood. There is no way to make that comparatively efficient; the CPU cache waste alone ensures that. Nothing you can do with thread support in a programming language will be competitive with e.g. a purpose-built scheduler + native coroutines. That’s right up against the theoretical limit of what is possible in terms of throughput and it doesn’t have any thread overhead. It does introduce the problem of load shedding across cores but that’s solve for all practical purposes.

I’ve been writing parallel code at the largest scales most of my career. The state-of-the-art architectures are all, effectively, single-threaded with latency-hiding. This model has a lot of mechanical sympathy with real silicon which is why it is used. It is also pleasantly simple in practice.

0 comments

12 comments · 5 top-level

andi9995y ago· 4 in thread

Thanks. What is latency hiding?

jandrewrogersOP5y ago

There are many operations on data that are relatively slow from a CPU’s perspective — filling a cache line, page faulting, cache coherency, acquiring a lock, waiting on I/O to complete, etc. All of these add latency by stalling execution. In conventional software, when these events occur you simply stall execution, possibly triggering a context switch (which is very expensive). In many types of modern systems, these events are extremely frequent.

Latency hiding is a technique where 1) most workloads are trivially decomposed into independent components that can be executed separately and 2) you can infer or directly determine when any particular operation will stall. There are many ways to execute these high latency operations in an asynchronous and non-blocking way such that you can immediate work on some other part of the workload. The “latency-hiding” part is that the CPU is rarely stalled, always switching to a part of the workload that is immediately runnable if possible so that the CPU is never stalled and always doing real, constructive work. Latency-hiding optimizes for throughput, maximizing utilization of the CPU, but potentially increasing the latency of specific sub-operations by virtue of reordering the execution schedule to “hide” the latency of operations that would stall the processor. For many workloads, the latency of the sub-operations doesn’t matter, only the throughput of the total operation. The real advantage of latency-hiding architectures is that you can approach the theoretical IPC of the silicon in real software.

There are exotic CPU architectures explicitly designed for latency hiding, mostly used in supercomputing. Cray/Tera MTA architecture is probably the canonical example as well as the original Xeon Phi. As a class, latency-hiding CPU architectures are sometimes referred to as “barrel processors”. In the case of Cray MTA, the CPU can track 128 separate threads of execution in hardware and automatically switch to a thread that is immediately runnable at each clock cycle. Thread coordination is effectively “free”. In software, switching between logical threads of execution is much more by inference but often sees huge gains in throughput. The only caveat is that you can’t ignore tail latencies in the design — a theoretically optimal latency-hiding architecture may defer execution of an operation indefinitely.

Const-me5y ago

> There are exotic CPU architectures explicitly designed for latency hiding, mostly used in supercomputing.

I don’t know much about supercomputers, but what you described is precisely how all modern GPUs deal with VRAM latency. Each core runs multiple threads, the count is bound by resources used: the more registers and group shared memory a shader uses, the less threads of the shader can be scheduled on the same core. The GPU then switches threads instead of waiting for that latency.

That’s how GPUs can saturate their RAM bandwidth, which exceeds 500 GB/second in modern high-end GPUs.

spockz5y ago

Is there some tool like ps/top /time that I can use measure how much “constructive” work my cpu spend doing?

1 more reply

Jweb_Guru5y ago

Latency hiding is a a way to substantially increase throughput by queuing massive numbers of requests or tasks while waiting on expensive resources (e.g. main memory access can have latency in the hundreds of cycles for GPUs). By scheduling enough tasks or enough requests that can be dispatched in parallel (or very soon after one another), after an initial delay you may be able to process the queued requests very quickly (possibly once per clock cycle or two) providing similar overall performance to running the same set of tasks sequentially with very low latency. However, such long pipelines are very prone to stalling, especially if there are data dependencies that prevent loads from being issued early, so getting maximum performance out of code on architecture that heavily exploits latency hiding techniques can require a lot of very specific domain knowledge.

CJefferson5y ago· 3 in thread

I don't understand -- isn't what you are suggesting single threaded async code? That might be useful for servers, where you are mostly waiting for other things (like databases and networks), but in ithe places the point of parallel is to get all your CPUs doing useful work, and then (in my experience, happy to be shown counterexamples), coroutines aren't very useful. You just want to blast a bunch of threads (or rightly coupled processes)

jandrewrogersOP5y ago

Yes, roughly single-threaded async, with each core running a disjoint subset of the workload on data private to that core. You can’t beat the operation throughput. The software architecture challenge is shedding load between cores, since this will hotspot under real workloads with a naive design. Fortunately, smoothly and dynamically shedding load across cores with minimal overhead and latency is a solved design problem.

It works pretty well for ordinary heavy crunch code too. I originally designed code like this on supercomputers. You do need a practical mechanism for efficiently decomposing loads at a fine granularity but you rarely see applications that don’t have this property that are also compute bound.

CJefferson5y ago

Ah, I understand now. I misinterpreted "no thread overhead" as meaning "I'm not running things in multiple threads", like the current node.js/javascript obsession, where we just run code in one thread and use a bunch of async to "parallelise". Sorry!

I've (badly) written code like you describe -- usually by abusing fork to do my initial data structure setup, then using C pipes to pass jobs around. I suspect there are much better ways of doing it, but that parallelised well enough for the stuff I was doing. I'd be interested to know if there are good libraries (or best practices) for doing this kind of parallelism.

1 more reply

truckerbill5y ago

Do you know of any resources to learn more about this approach?

Jweb_Guru5y ago

There is not a meaningful semantic difference between what you're describing and what tools like rayon provide (and BTW, threads do just fine when pinned to a core and appropriately managed as they should be in large data processing workloads). Whether threads are used on the backend is largely a distraction, you still have to write things roughly the same way to create correct code (for example, you cannot share memory between tasks on different cores, or on different nodes, without synchronizing somehow).

eptcyka5y ago

You are mistaking parallelism for concurrency.

otabdeveloper45y ago

> purpose-built scheduler + native coroutines

"Threads" are nothing but kernel-mode coroutines with purpose-built schedulers in the kernel.

Redoing the same machinery except in usermode is not the way to get performance.

The problem is that scripting languages don't let you access the kernel API's cleanly due to various braindead design decisions - global locks in the garbage collector, etc.

But the solution isn't to rewrite the kernel in every scripting language, the solution is to learn to make scripting languages that aren't braindead.

j / k navigate · click thread line to collapse

0 comments

12 comments · 5 top-level

andi9995y ago· 4 in thread

Thanks. What is latency hiding?

jandrewrogersOP5y ago

Const-me5y ago

> There are exotic CPU architectures explicitly designed for latency hiding, mostly used in supercomputing.

That’s how GPUs can saturate their RAM bandwidth, which exceeds 500 GB/second in modern high-end GPUs.

spockz5y ago

Is there some tool like ps/top /time that I can use measure how much “constructive” work my cpu spend doing?

1 more reply

Jweb_Guru5y ago

CJefferson5y ago· 3 in thread

jandrewrogersOP5y ago

CJefferson5y ago

1 more reply

truckerbill5y ago

Do you know of any resources to learn more about this approach?

Jweb_Guru5y ago

eptcyka5y ago

You are mistaking parallelism for concurrency.

otabdeveloper45y ago

> purpose-built scheduler + native coroutines

"Threads" are nothing but kernel-mode coroutines with purpose-built schedulers in the kernel.

Redoing the same machinery except in usermode is not the way to get performance.

The problem is that scripting languages don't let you access the kernel API's cleanly due to various braindead design decisions - global locks in the garbage collector, etc.

But the solution isn't to rewrite the kernel in every scripting language, the solution is to learn to make scripting languages that aren't braindead.

j / k navigate · click thread line to collapse