I think a modern compiler could likely do a good job with itanium now-a-days. However, when it first came out, there simply wasn't the ability to keep those instruction batches full. Compiler tech was too far behind to work well with the hardware.
With VLIW the compiler unrolls the code and tries to find the parallelization but has no control over runtime stalls and results in larger code size. The complexity is in the compiler while the machine is simpler.
With an OOO superscalar machine, you have to dedicate significant piece of HW for stuff that would be easily done by the compiler. The advantage is you can get reduced code size and better performance for non-linear code.
There are real gains to be had by using SIMD but it tends to be massively parallel data processing workloads with specially written SIMD code or even hand tuned assembly (image/video processing, neural networks) not just feeding in a source file and compiling with the SIMD flag to then realize meaningful gains.
SIMD is harder because you have to have a uniform operation across a set of data.
Imagine a for loop that looks like this
int[] x, y, z;
int[] p, d, q;
for (int i = 0; i < size; ++i) {
p[i] = x[i] / z[i]
d[i] = z[i] * x[i]
q[i] = y[i] + z[i]
}
For SIMD, this is a complicated mess for the compiler to unravel. What the compiler would LIKE to do is turn this into 3 for loops and use the SIMD instructions to perform those operations in parallel.The itanium optimization, however, is a lot easier. The compiler can see that none of p, d, or q depend on the results of the previous stage (that is q[i] doesn't depend on p[i]). As a result, the entire thing can be packed into a single operation.
Now, of course, modern OOO processors can do the same optimization so maybe it's not a huge win? Still, would have been something worth exploring more (IMO) but the market forces killed it. Moving that sort of optimization out of the processor hardware and into the compiler software seems like it could lead to some nice power/performance benefits.
All of the array accesses are uniform, so the resulting vector code is roughly:
for (i = 0 .. size by vector width) {
r0 = vector load x[i..i + vw]
r1 = vector load y[i..i + vw]
r2 = vector load z[i..i + vw]
r3 = r0 / r2
r4 = r2 * r0
r5 = r1 + r2
vector store r3 to p[i..i + vw]
vector store r4 to d[i..i + vw]
vector store r5 to q[i..i + vw]
}
(and probably unroll the loop for good measure). No need to fission the loop to vectorize here.this is trivially vectorizable for simd, would fit nicely in a vliw packet too. The only issue is if there was a runtime memory stall with any access, then the entire pipeline would stall.
with predication, modern simd even parallelize if conditions like below.
int[] x, y, z; int[] p, d, q;
for (int i = 0; i < size; ++i) {
p[i] = x[i] / z[i];
d[i] = z[i] * x[i];
if(i>n) {
q[i] = y[i] + z[i] ;
} else {
q[i] = y[i];
}
}AMD then followed Nvidia into the world of SIMD/SIMT because it offered better real-world performance for the majority of applications.
VLIW has been tried repeatedly only to be replaced with something that worked better.