undefined | Better HN

0 pointscogman104y ago0 comments

Funnily, I feel like SIMD instructions are slowly reinventing what the itanium did out of the box.

I think a modern compiler could likely do a good job with itanium now-a-days. However, when it first came out, there simply wasn't the ability to keep those instruction batches full. Compiler tech was too far behind to work well with the hardware.

0 comments

7 comments · 2 top-level

zamadatix4y ago· 4 in thread

I'm not sure I'd say many compilers are even that great with SIMD these days and that is easier than what the itanium was asking of compilers.

There are real gains to be had by using SIMD but it tends to be massively parallel data processing workloads with specially written SIMD code or even hand tuned assembly (image/video processing, neural networks) not just feeding in a source file and compiling with the SIMD flag to then realize meaningful gains.

cogman10OP4y ago

The reverse is true.

SIMD is harder because you have to have a uniform operation across a set of data.

Imagine a for loop that looks like this

    int[] x, y, z;
    int[] p, d, q;

    for (int i = 0; i < size; ++i) {
       p[i] = x[i] / z[i]
       d[i] = z[i] * x[i]
       q[i] = y[i] + z[i]  
    }

For SIMD, this is a complicated mess for the compiler to unravel. What the compiler would LIKE to do is turn this into 3 for loops and use the SIMD instructions to perform those operations in parallel.

The itanium optimization, however, is a lot easier. The compiler can see that none of p, d, or q depend on the results of the previous stage (that is q[i] doesn't depend on p[i]). As a result, the entire thing can be packed into a single operation.

Now, of course, modern OOO processors can do the same optimization so maybe it's not a huge win? Still, would have been something worth exploring more (IMO) but the market forces killed it. Moving that sort of optimization out of the processor hardware and into the compiler software seems like it could lead to some nice power/performance benefits.

jcranmer4y ago

That loop is actually nicely vectorizable, at least assuming that you replace int with float (there is no integer division vector instruction on x86).

All of the array accesses are uniform, so the resulting vector code is roughly:

  for (i = 0 .. size by vector width) {
    r0 = vector load x[i..i + vw]
    r1 = vector load y[i..i + vw]
    r2 = vector load z[i..i + vw]
    r3 = r0 / r2
    r4 = r2 * r0
    r5 = r1 + r2
    vector store r3 to p[i..i + vw]
    vector store r4 to d[i..i + vw]
    vector store r5 to q[i..i + vw]
  }

(and probably unroll the loop for good measure). No need to fission the loop to vectorize here.

1 more reply

sifar4y ago

>> For SIMD, this is a complicated mess for the compiler to unravel

this is trivially vectorizable for simd, would fit nicely in a vliw packet too. The only issue is if there was a runtime memory stall with any access, then the entire pipeline would stall.

with predication, modern simd even parallelize if conditions like below.

int[] x, y, z; int[] p, d, q;

    for (int i = 0; i < size; ++i) {
       p[i] = x[i] / z[i];
       d[i] = z[i] * x[i];
       if(i>n) {
         q[i] = y[i] + z[i]  ;
       } else {
         q[i] = y[i];
       } 
    }

hajile4y ago

VLIW architecture is so bad that AMD and Nvidia couldn't make it work well with embarrassingly parallel graphics code. AMD first moved from VLIW-5 to VLIW-4 because they couldn't find enough data to reliably keep unit 5 busy.

AMD then followed Nvidia into the world of SIMD/SIMT because it offered better real-world performance for the majority of applications.

VLIW has been tried repeatedly only to be replaced with something that worked better.

therealcamino4y ago· 1 in thread

The problem is, compile-time instruction scheduling for VLIW vs an out-of-order, superscalar processor is inherently unequal because there is an information gap. At compile time you cannot see the actual dependences, and have to statically schedule for the worst case. You can do great on regular, array-based code. But VLIW can never beat OOO superscalar processors on irregular or pointer-chasing code, because there is unequal information. On those codes, the information gap can't ever be overcome, no matter what compiler technology you have. If you don't have access to the at-runtime data values (and you never will have that), there is no static schedule that can compete.

sifar4y ago

Both VLIW & Ooo Superscalar are doing a different tradeoff to achieve similar goal - maximize ILP.

With VLIW the compiler unrolls the code and tries to find the parallelization but has no control over runtime stalls and results in larger code size. The complexity is in the compiler while the machine is simpler.

With an OOO superscalar machine, you have to dedicate significant piece of HW for stuff that would be easily done by the compiler. The advantage is you can get reduced code size and better performance for non-linear code.

j / k navigate · click thread line to collapse

0 comments

7 comments · 2 top-level

zamadatix4y ago· 4 in thread

I'm not sure I'd say many compilers are even that great with SIMD these days and that is easier than what the itanium was asking of compilers.

cogman10OP4y ago

The reverse is true.

SIMD is harder because you have to have a uniform operation across a set of data.

Imagine a for loop that looks like this

    int[] x, y, z;
    int[] p, d, q;

    for (int i = 0; i < size; ++i) {
       p[i] = x[i] / z[i]
       d[i] = z[i] * x[i]
       q[i] = y[i] + z[i]  
    }

jcranmer4y ago

That loop is actually nicely vectorizable, at least assuming that you replace int with float (there is no integer division vector instruction on x86).

All of the array accesses are uniform, so the resulting vector code is roughly:

  for (i = 0 .. size by vector width) {
    r0 = vector load x[i..i + vw]
    r1 = vector load y[i..i + vw]
    r2 = vector load z[i..i + vw]
    r3 = r0 / r2
    r4 = r2 * r0
    r5 = r1 + r2
    vector store r3 to p[i..i + vw]
    vector store r4 to d[i..i + vw]
    vector store r5 to q[i..i + vw]
  }

(and probably unroll the loop for good measure). No need to fission the loop to vectorize here.

1 more reply

sifar4y ago

>> For SIMD, this is a complicated mess for the compiler to unravel

this is trivially vectorizable for simd, would fit nicely in a vliw packet too. The only issue is if there was a runtime memory stall with any access, then the entire pipeline would stall.

with predication, modern simd even parallelize if conditions like below.

int[] x, y, z; int[] p, d, q;

    for (int i = 0; i < size; ++i) {
       p[i] = x[i] / z[i];
       d[i] = z[i] * x[i];
       if(i>n) {
         q[i] = y[i] + z[i]  ;
       } else {
         q[i] = y[i];
       } 
    }

hajile4y ago

AMD then followed Nvidia into the world of SIMD/SIMT because it offered better real-world performance for the majority of applications.

VLIW has been tried repeatedly only to be replaced with something that worked better.

therealcamino4y ago· 1 in thread

sifar4y ago

Both VLIW & Ooo Superscalar are doing a different tradeoff to achieve similar goal - maximize ILP.

j / k navigate · click thread line to collapse