Each machine instruction on NVidia Volta has the following information:
* Reuse Flags
* Wait Barrier Mask
* Read/Write barrier index (6-bit bitmask)
* Read Dependency barriers
* Stall Cycles (4-bit)
* Yield Flag (1-bit software hint: NVidia CU will select new warp, load-balancing the SMT resources of the compute unit)
Itanium's idea of VLIW was commingled with other ideas; in particular, the idea of a compiler static-scheduler to minimize hardware work at runtime.
To my eyes: the benefits of Itanium are implemented in NVidia's GPUs. The compiler for NVidia's compiler-scheduling flags has been made and is proven effective.
Itanium itself: the crazy "bundling" of instructions and such, seems too complex. The explicit bitmasks / barriers of NVidia Volta seems more straightforward and clear in describing the dependency graph of code (and therefore: the potential parallelism).
----------
Clearly, static-compilers marking what is, and what isn't, parallelizable, is useful. NVidia Volta+ architectures have proven this. Furthermore, compilers that can emit such information already exist. I do await the day when other architectures wake up to this fact.