undefined | Better HN

0 pointsfancyfredbot4y ago0 comments

I wonder if this applies to the same extent for an on-package GPU which shares the same physical memory as the CPU. I'd expect round trip times in that case to be minimal and the available processing power would probably be competitive with AVX512. I have been wondering if this is the reason for deprecating AVX512 on consumer processors - these are likely to have a GPU available.

0 comments

3 comments · 1 top-level

raphlinus4y ago· 2 in thread

Good question! There are two separate issues with putting the GPU in the same package as the CPU. One is the memcpy bandwidth issue, which is indeed entirely mitigated (assuming the app is smart enough to exploit this). But the round trip times seem more related to context switches. I have an M1 Max here, and just found ~200µs for a very simple dispatch (just clearing 16k of memory).

I personally believe it may be possible to reduce latency using techniques similar to io_uring, but it may not be simple. Likely a major reason for the roundtrips is so that a trusted process (part of the GPU driver) can validate inputs from untrusted user code before it's presented to the GPU hardware.

fancyfredbotOP4y ago

Yes I think you are right about driver overhead, although there should be ways to amortize that it probably doesn't work very well for latency sensitive problems! I expect that in most cases if you have enough work to do to make using AVX512 worthwhile you can afford the round-trip.

boulos4y ago

It's been a while, but IIRC the integrated GPUs are only L3-cache coherent. So while that greatly improves the memcpy problem, anything that would have fit in L1 and does a bunch of math may still be a better fit for AVX2 or AVX-512.

j / k navigate · click thread line to collapse