undefined | Better HN

0 pointsConst-me2y ago0 comments

> 32-bit only atomics, I don't think its a very serious solution in practice

Yeah, I think I encountered that while porting a hash map from CUDA to HLSL.

However, I'm not sure that's necessarily a huge deal. Probably not an issue for machine learning or BLAS stuff, these use cases don't need fine-grained thread synchronization.

For applications which would benefit from such synchronization, traditional lock-free techniques ported from CPU (i.e. compare and swap atomics on global memory) can be slow due to huge count of active threads on GPUs. I mean it's obviously better than locks, but sometimes it's possible to do something better instead of CAS.

> so much API-crap you need to even get Hello World / SAXY up.

I agree for D3D12 and especially Vulkan, but IMO D3D11 is not terribly bad. It has a steep learning curve, but the amount of boilerplate for simple apps is IMO reasonable. Especially for ML or similar GPGPU stuff which only needs a small subset of the API: compute shaders only, no textures, no render targets, no depth-stencil views, no input layouts, etc.

However, unlike simple apps, real-life ones often need profiler and queue depth limiter, relatively hard to implement on top of these queries. I think Microsoft should ship them both in Windows SDK.

0 comments

4 comments · 2 top-level

adrian_b2y ago· 2 in thread

Lock-free techniques are not "obviously better than locks".

Lock-free techniques offer a different trade-off than locks. Lock-free techniques are a little faster in the typical case than locks, but this advantage is paid by being much slower in the worst case (because they may need a very large number of retries to succeed). In the case when a great number of threads contend for access, the worst case can be very frequent.

The best application for lock-free techniques is in read-only access to shared data. In this case they are almost always the best solution.

On the other hand, for write access to shared data, which one is better, between optimistic access control with lock-free techniques and deterministic serialization of the accesses with locks, depends on the application and it cannot be said in general that one method or the other is preferable.

Const-meOP2y ago

> Lock-free techniques are not "obviously better than locks".

On GPUs, they are. GPUs don't have any locks, but they can be emulated on top of these global memory atomics. Because count of active threads is often thousands, the performance of that approach is much worse than lock-free techniques.

dragontamer2y ago

I missed this earlier.

> For applications which would benefit from such synchronization, traditional lock-free techniques ported from CPU (i.e. compare and swap atomics on global memory) can be slow due to huge count of active threads on GPUs. I mean it's obviously better than locks, but sometimes it's possible to do something better instead of CAS.

I don't think traditional CAS can be optimized. But a fair number of atomic-operations seem to be coalesced into a prefix sum. So... with regards to your latest post.

> On GPUs, they are. GPUs don't have any locks, but they can be emulated on top of these global memory atomics. Because count of active threads is often thousands, the performance of that approach is much worse than lock-free techniques.

Those "thousands of atomic" operations can become coalessed 32-at-a-time (prefix-sum) and turned into just "dozens of atomics" in practice. Automatically mind you, by the compiler.

Don't discount the brute-force code because its simple. Don't assume ~1000+ atomic operations will actually be physically executed as 1000+ atomics. The compiler can "fix" a lot of this code in practice.

Not always, but the compiler can fix it often enough that its beneficial to write the simple brute-force "thousands-of-atomics" code and check the compiled output.

Probably never with "CAS", but its pretty often that "atomic_add" written in a brute force manner (tracking a parallel counter) will compile into a prefix-sum + 1x atomic from one lane, rather than execute as 32x atomics. And even if it is executed as 32x atomics, there are atomic-accelerators on the GPUs that may make the operation faster than you might think. You know, as long as it isn't a compare-and-swap loop.

dragontamer2y ago

I think I've more or less decided that DX12 is gonna be what I focus on, for my hobbywork. After evaluating all options. Vulkan is a close 2nd.

Gobs of boilerplate code is annoying, but I honestly can live with it. The tooling available for DirectX is really good.

j / k navigate · click thread line to collapse

0 comments

4 comments · 2 top-level

adrian_b2y ago· 2 in thread

Lock-free techniques are not "obviously better than locks".

The best application for lock-free techniques is in read-only access to shared data. In this case they are almost always the best solution.

Const-meOP2y ago

> Lock-free techniques are not "obviously better than locks".

dragontamer2y ago

I missed this earlier.

I don't think traditional CAS can be optimized. But a fair number of atomic-operations seem to be coalesced into a prefix sum. So... with regards to your latest post.

Those "thousands of atomic" operations can become coalessed 32-at-a-time (prefix-sum) and turned into just "dozens of atomics" in practice. Automatically mind you, by the compiler.

Not always, but the compiler can fix it often enough that its beneficial to write the simple brute-force "thousands-of-atomics" code and check the compiled output.

dragontamer2y ago

I think I've more or less decided that DX12 is gonna be what I focus on, for my hobbywork. After evaluating all options. Vulkan is a close 2nd.

Gobs of boilerplate code is annoying, but I honestly can live with it. The tooling available for DirectX is really good.

j / k navigate · click thread line to collapse