undefined | Better HN

story

0 pointsdragontamer5y ago0 comments

NVidia Volta: https://arxiv.org/pdf/1804.06826.pdf

Each machine instruction on NVidia Volta has the following information:

* Reuse Flags

* Wait Barrier Mask

* Read/Write barrier index (6-bit bitmask)

* Read Dependency barriers

* Stall Cycles (4-bit)

* Yield Flag (1-bit software hint: NVidia CU will select new warp, load-balancing the SMT resources of the compute unit)

Itanium's idea of VLIW was commingled with other ideas; in particular, the idea of a compiler static-scheduler to minimize hardware work at runtime.

To my eyes: the benefits of Itanium are implemented in NVidia's GPUs. The compiler for NVidia's compiler-scheduling flags has been made and is proven effective.

Itanium itself: the crazy "bundling" of instructions and such, seems too complex. The explicit bitmasks / barriers of NVidia Volta seems more straightforward and clear in describing the dependency graph of code (and therefore: the potential parallelism).

----------

Clearly, static-compilers marking what is, and what isn't, parallelizable, is useful. NVidia Volta+ architectures have proven this. Furthermore, compilers that can emit such information already exist. I do await the day when other architectures wake up to this fact.

0 comments

StillBored5y ago

GPU's, aren't general purpose compute. EPIC did fairly well with HPC/etc style applications as well, it was everything else that was problematic. So, yes there are a fair number of workload and microarch decision similarities. But right now, those workloads tend to be better handled with a GPU style offload engine (or as it appears the industry is slowly moving, possibly a lot of fat vector units attached to a normal core).

dragontamerOP5y ago

I'm not talking about the SIMD portion of Volta.

I'm talking about Volta's ability to detect dependencies. Which is null: the core itself probably can't detect dependencies at all. Its entirely left up to the compiler (or at least... it seems to be the case).

AMD's GCN and RDNA architecture is still scanning for read/write hazards like any ol' pipelined architecture you learned in college. The NVidia Volta thing is new, and probably should be studied from a architectural point of view.

Yeah, its a GPU-feature on NVidia Volta. But its pretty obvious to me that this explicit dependency-barrier thing could be part of a future ISA, even one for traditional CPUs.

rrss5y ago

FWIW, this article suggests the static software scheduling you are describing was introduced in Kepler, so it's probably at least not entirely new in Volta:

https://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-r...

> NVIDIA has replaced Fermi’s complex scheduler with a far simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIA’s compiler. In essence it’s a return to static scheduling

and I think this is describing more or less the same thing in Maxwell: https://github.com/NervanaSystems/maxas/wiki/Control-Codes

1 more reply

j / k navigate · click thread line to collapse