Hm, the way I understood it is that a single instruction is executed on a 16-wide SIMD unit, thus processing 16 elements/threads/lanes simultaneously (subject to execution mask of course). This is what I mean by "in lockstep". In my understanding the role of the operand collector was to make sure that all register arguments are available before the instruction starts executing. If the operand collector needs multiple cycles to fetch the arguments from the register file, the instruction execution would stall.
So you are saying that my understanding is incorrect and that the instruction can be executed in multiple passes with different masks depending on which arguments are available? What is the benefit as opposed to stalling and executing the instruction only when all arguments are available? To me it seems like the end result is the same, and stalling is simpler and probably more energy efficient (if EUs are power-gated).
> But, yes (warp) instruction is already scheduled, but (ALU) operation are re-scheduled by the operand-collector and it's dispatch. In the Nvidia patent they mention the possibility to dispatch operation in an order that prevent write collision for example.
Ah, that is interesting, so the operand collector provides a limited reordering capability to maximize hardware utilization, right? I must have missed that bit in the patent, that is a very smart idea.
> But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.
Is any existing GPU actually doing superscalar execution from the same software thread (I mean the program thread, i.e., warp, not a SIMT thread)? Many GPUs claim dual-issue capability, but that either refers to interleaved execution from different programs (Nvidia, Apple) or a SIMD-within SIMT or maybe even a form of long instruction word (AMD). If I remember correctly, Nvidia instructions contain some scheduling information that tells the scheduler when it is safe to issue the next instruction from the same wave after the previous one started execution. I don't know how others do it, probably via some static instruction timing information. Apple does have a very recent patent describing dependency detection in an in-order processor, no idea whether it is intended for the GPU or something else.
> you have multiple multiple operand-collector entry to minimize the probability that no entry is ready. I should have say "to minimize bubbles".
I think this is essentially what some architectures describe as the "register file cache". What is nice about Nvidia's approach is that it seems to be fully automatic and can really make the best use of a constrained register file.