undefined | Better HN

0 pointsdarksaints5y ago0 comments

OpenPOWER is pretty awesome but would be nowhere near as awesome as an OpenItanium. IMHO, Itanium was always mismarketed and misoptimized. It made a pretty good server processor, but not so good that enterprises were willing to migrate 40 year old software to run on it.

In mobile form, it would have made a large leap in both performance and battery life. And it would have been a fairly easy market to break into: the average life of a mobile device is a few years, not a few decades. Recompilation and redistribution of software is the status quo.

0 comments

19 comments · 2 top-level

anarazel5y ago· 11 in thread

IMO VLIW is an absurdly bad choice for a general purpose processor. It requires baking in a huge amount of low level micro-architectural details into the compiler / generated code. Which obviously leads to problems with choosing what hardware generation to optimize for / not being able to generate good code for future architectures.

And the compiler doesn't even come close to having as much information as the CPU has. Which basically means that most of the VLIW stuff just ends up needing to be broken up inside the CPU for good performance.

dragontamer5y ago

VLIW was the best implementation (20 years ago) of instruction level parallelism.

But what have we learned in these past 20 years?

* Computers will continue to become more parallel -- AMD Zen2 has 10 execution pipelines, supporting 4-way decode and 6-uop / clock tick dispatch per core, with somewhere close to 200 registers for renaming / reordering instructions. Future processors will be bigger and more parallel, Ice Lake is rumored to have over 300-renaming registers.

* We need assembly code that scales to all different processors of different sizes. Traditional assembly code is surprisingly good (!!!) at scaling, thanks to "dependency cutting" with instructions like "xor eax, eax".

* Compilers can understand dependency chains, "cut them up" and allow code to scale. The same code optimized for Intel Sandy Bridge (2011-era chips) will continue to be well-optimized for Intel Icelake (2021 era) ten years later, thanks to these dependency-cutting compilers.

I think a future VLIW chip can be made that takes advantage of these facts. But it wouldn't look like Itanium.

----------

EDIT: I feel like "xor eax, eax" and other such instructions for "dependency cutting" are wasting bits. There might be a better way for encoding the dependency graph rather than entire instructions.

Itanium's VLIW "packages" is too static.

I've discussed NVidia's Volta elsewhere, which has 6-bit dependency bitmasks on every instruction. That's the kind of "dependency graph" information that a compiler can provide very easily, and probably save a ton on power / decoding.

jabl5y ago

I agree there is merit in the idea of encoding instruction dependencies in the ISA. There have been a number of research projects in this area, e.g. wavescalar, EDGE/TRIPS, etc.

It's not only about reducing the need for figuring out dependencies at runtime, but you could also partly reduce the need for the (power hungry and hard to scale!) register file to communicate between instructions.

floatboth5y ago

Main lesson: we failed to make all the software JIT-compiled or AOT-recompiled-on-boot or something, that would allow retargeting the optimizations for the new generation of a VLIW CPU. Barely anyone even tried. Well I guess in the early 2000s there was this vision that everything would be Java, which is JIT, but lol

1 more reply

ATsch5y ago

All of this hackery with hundreds of registers just to continue to make a massively parallel computer look like an 80s processor is what something like Itanium would have prevented. Modern processors ended up becoming basically VLIW anyway, Itanium just refused to lie to you.

1 more reply

anarazel5y ago

I don't understand how an increase, including the implied variability, of CPU internal parallelism and VLIW benefits go together?

1 more reply

moonchild5y ago

Have you seen the mill cpu?

1 more reply

KMag5y ago

> And the compiler doesn't even come close to having as much information as the CPU has.

Unless your CPU has a means for profiling where your pipeline stalls are coming from, combined with dynamic recompilation/reoptimization similar to IBM's project DAISY or HP's Dynamo.

It's not going to do well as out-of-order CPUs that make instruction re-optimization decisions for every instruction, but I wouldn't rule out software-controlled dynamic re-optimization getting most of the performance benefits of out-of-order execution with a much smaller power budget, due to not re-doing those optimization calculations for every instruction. There are reasons most low-power implementations are in-order chips.

csharptwdec195y ago

I feel like what you describe is possible. When I think of what Transmeta was able to accomplish in the early 2000s just with CMS, certainly so.

darksaintsOP5y ago

Traditional compiler techniques may have struggled with maintaining code for different architectures, but a lot has changed in the last 15 years. The rise of widely used IR languages has led to compilers that support dozens of architectures and hundreds of instruction sets. And they are getting better all the time.

The compiler has nearly all of the information that the CPU has, and it has orders of magnitude more. At best, your CPU can think a couple dozen cycles ahead of what it is currently executing. The compiler can see the whole program, can analyze it using dozens of methodologies and models, and can optimize accordingly. Something like Link Time Optimization can be done trivially with a compiler, but it would take an army of engineers decades of work to be able to implement in hardware.

dragontamer5y ago

> At best, your CPU can think a couple dozen cycles ahead of what it is currently executing.

The 200-sized reorder buffer says otherwise.

Loads/stores can be reordered for 200+ different concurrent objects on modern Intel skylake (2015 through 2020) CPUs. And its about to get a bump to 300+ sized reorder buffers in Icelake.

Modern CPUs are designed to "think ahead" almost the entirety of DDR4 RAM Latency, allowing reordering of instructions to keep the CPU pipes as full as possible (at least, if the underlying assembly code has enough ILP to fill the pipelines while waiting for RAM).

> Something like Link Time Optimization can be done trivially with a compiler, but it would take an army of engineers decades of work to be able to implement in hardware.

You might be surprised at what the modern Branch predictor is doing.

If your "call rax" indirect call constantly calls the same location, the branch predictor will remember that location these days.

2 more replies

formerly_proven5y ago

> The compiler has nearly all of the information that the CPU has, and it has orders of magnitude more.

The CPU has something the compiler can never have.

Runtime information.

That's why VLIW works great for DSP which is 99.9 % fixed access patterns, while being bad for general purpose code.

drivebycomment5y ago· 6 in thread

Itanuim deserved its fiery death and resurrection doesn't make any sense whatsoever. It's a dead end architecture, and humanity gained (by freeing up valuable engineering power to other more useful endeavors) when it died.

darksaintsOP5y ago

Itanium was an excellent idea that needed investment in compilers. Nobody wanted to make that investment because speculative execution got them 80% of the way there without the investment in compilers. But as it turns out, speculative execution was a phenomenally bad idea, and patching its security vulnerabilities has set back processor performance to the point where VLIW seems like a good idea again. We should have made those compiler improvements decades ago.

dragontamer5y ago

NVidia Volta: https://arxiv.org/pdf/1804.06826.pdf

Each machine instruction on NVidia Volta has the following information:

* Reuse Flags

* Wait Barrier Mask

* Read/Write barrier index (6-bit bitmask)

* Read Dependency barriers

* Stall Cycles (4-bit)

* Yield Flag (1-bit software hint: NVidia CU will select new warp, load-balancing the SMT resources of the compute unit)

Itanium's idea of VLIW was commingled with other ideas; in particular, the idea of a compiler static-scheduler to minimize hardware work at runtime.

To my eyes: the benefits of Itanium are implemented in NVidia's GPUs. The compiler for NVidia's compiler-scheduling flags has been made and is proven effective.

Itanium itself: the crazy "bundling" of instructions and such, seems too complex. The explicit bitmasks / barriers of NVidia Volta seems more straightforward and clear in describing the dependency graph of code (and therefore: the potential parallelism).

----------

Clearly, static-compilers marking what is, and what isn't, parallelizable, is useful. NVidia Volta+ architectures have proven this. Furthermore, compilers that can emit such information already exist. I do await the day when other architectures wake up to this fact.

1 more reply

StillBored5y ago

I think your conflating OoO and speculative execution. It was OoO which the itanium architects (apparently) didn't think would work as well as it did. OoO and being able to build wide superscaler machines, which could dynamically determine instruction dependency chains is what killed EPIC.

Speculative execution is something you would want to do with the itanium as well, otherwise the machine is going to be stalling all the time waiting for branches/etc. Similarly, later itaniums went OoO (dynamically scheduled) because it turns out, the compiler can't know runtime state..

https://www.realworldtech.com/poulson/

Also while googling for that, ran across this:

https://news.ycombinator.com/item?id=21410976

PS: speculative execution is here to stay, it might be wrapped in more security domains and/or its going to just be one more nail in the business model of selling shared compute (something that was questionably from the beginning).

1 more reply

Quequau5y ago

I dimly recall reading an interview with one of Intel's Sr. Managers on the Itanium project where he explained his thoughts on why Itanium failed.

His explanation centred on the fact that Intel decided early on that Itanium would only ever be an ultra high end niche product and only built devices which Intel could demand very high prices for. This in turn meant that almost no one outside of the few companies who were supporting Itanium development and certainly not most of the people who were working on other compilers and similar developer tools at the time, had any interest in working on Itanium because they simply could not justify the expense of obtaining the hardware.

So all the organic open source activity that goes on for all the other platforms which are easily obtainable by pedestrian users simply did not go on for Itanium. Intel did not plan on that up front (though in hindsight it seemed obvious) and by the time that was widely recognised within the management team no one was willing to devote the sort of scale of resources that were required for serious development of developer tools on a floundering project.

jabl5y ago

> Itanium was an excellent idea that needed investment in compilers.

ISTR that Intel & HP spent well over a $billion on VLIW compiler R&D, with crickets to show for it all.

How much are you suggesting should be spent this time for a markedly different result?

drivebycomment5y ago

By late 2000s, instruction scheduling research was largely considered done and dusted, with papers like:

https://dl.acm.org/doi/book/10.5555/923366 https://dl.acm.org/doi/10.1145/349299.349318

and many, many others (it produced so many PhDs in 90s). And, needless to say, HP and Intel hired so many excellent researchers during the heydays of Itanium. So I don't know on what basis you think there wasn't enough investment. So I have no choice but to assume you're ignorant of the actual history here, both in academics and industry.

It turns out instruction scheduling can not overcome the challenge of variable memory and cache latency, and branch prediction, because all of those are dynamic and unpredictable, for "integer" application (i.e. bulk of the code running on the CPUs of your laptop and cell phones). And, predication, which was one of the "solutions" to overcome branch misprediction penalties, turns out to be not very efficient, and is limited in its application.

For integer applications, it turns out the instruction level parallelism isn't really the issue. It's about how to generate and maintain as many outstanding cache misses at a time. VLIW turns out to be insufficient and inefficient for that. Some minor attempts are addressing that through prefetches and more elaborate markings around load/store all failed to give good results.

For HPC type workload, it turns out data parallelism and thread-level parallelism are much more efficient way to improve the performance, and also makes ILP on a single instruction stream play only a very minor role - GPUs and ML accelerators demonstrate this very clearly.

As for the security and the speculative execution, speculative execution is not going anywhere. Naturally, there are many researches around this like:

https://ieeexplore.ieee.org/abstract/document/9138997 https://dl.acm.org/doi/abs/10.1145/3352460.3358306

and while it will take a while before the real pipeline implements ideas like above thus we may continue to see some smaller and smaller vulnerabilities as the industry collectively plays whack-a-mole game, I don't see a world where the top of the line general-purpose microprocessor giving up on speculative execution, as the performance gain is simply too big.

I have yet to meet any academics or industry processor architects or compiler engineer who think VLIW / Itanium is the way to move forward.

This is not to say putting as much work to the compiler is a bad idea, as nVidia has demonstrated. But what they are doing is not VLIW.

j / k navigate · click thread line to collapse

0 comments

19 comments · 2 top-level

anarazel5y ago· 11 in thread

dragontamer5y ago

VLIW was the best implementation (20 years ago) of instruction level parallelism.

But what have we learned in these past 20 years?

I think a future VLIW chip can be made that takes advantage of these facts. But it wouldn't look like Itanium.

----------

EDIT: I feel like "xor eax, eax" and other such instructions for "dependency cutting" are wasting bits. There might be a better way for encoding the dependency graph rather than entire instructions.

Itanium's VLIW "packages" is too static.

jabl5y ago

I agree there is merit in the idea of encoding instruction dependencies in the ISA. There have been a number of research projects in this area, e.g. wavescalar, EDGE/TRIPS, etc.

floatboth5y ago

1 more reply

ATsch5y ago

1 more reply

anarazel5y ago

I don't understand how an increase, including the implied variability, of CPU internal parallelism and VLIW benefits go together?

1 more reply

moonchild5y ago

Have you seen the mill cpu?

1 more reply

KMag5y ago

> And the compiler doesn't even come close to having as much information as the CPU has.

Unless your CPU has a means for profiling where your pipeline stalls are coming from, combined with dynamic recompilation/reoptimization similar to IBM's project DAISY or HP's Dynamo.

csharptwdec195y ago

I feel like what you describe is possible. When I think of what Transmeta was able to accomplish in the early 2000s just with CMS, certainly so.

darksaintsOP5y ago

dragontamer5y ago

> At best, your CPU can think a couple dozen cycles ahead of what it is currently executing.

The 200-sized reorder buffer says otherwise.

Loads/stores can be reordered for 200+ different concurrent objects on modern Intel skylake (2015 through 2020) CPUs. And its about to get a bump to 300+ sized reorder buffers in Icelake.

> Something like Link Time Optimization can be done trivially with a compiler, but it would take an army of engineers decades of work to be able to implement in hardware.

You might be surprised at what the modern Branch predictor is doing.

If your "call rax" indirect call constantly calls the same location, the branch predictor will remember that location these days.

2 more replies

formerly_proven5y ago

> The compiler has nearly all of the information that the CPU has, and it has orders of magnitude more.

The CPU has something the compiler can never have.

Runtime information.

That's why VLIW works great for DSP which is 99.9 % fixed access patterns, while being bad for general purpose code.

drivebycomment5y ago· 6 in thread

darksaintsOP5y ago

dragontamer5y ago

NVidia Volta: https://arxiv.org/pdf/1804.06826.pdf

Each machine instruction on NVidia Volta has the following information:

* Reuse Flags

* Wait Barrier Mask

* Read/Write barrier index (6-bit bitmask)

* Read Dependency barriers

* Stall Cycles (4-bit)

* Yield Flag (1-bit software hint: NVidia CU will select new warp, load-balancing the SMT resources of the compute unit)

Itanium's idea of VLIW was commingled with other ideas; in particular, the idea of a compiler static-scheduler to minimize hardware work at runtime.

To my eyes: the benefits of Itanium are implemented in NVidia's GPUs. The compiler for NVidia's compiler-scheduling flags has been made and is proven effective.

----------

1 more reply

StillBored5y ago

https://www.realworldtech.com/poulson/

Also while googling for that, ran across this:

https://news.ycombinator.com/item?id=21410976

1 more reply

Quequau5y ago

I dimly recall reading an interview with one of Intel's Sr. Managers on the Itanium project where he explained his thoughts on why Itanium failed.

jabl5y ago

> Itanium was an excellent idea that needed investment in compilers.

ISTR that Intel & HP spent well over a $billion on VLIW compiler R&D, with crickets to show for it all.

How much are you suggesting should be spent this time for a markedly different result?

drivebycomment5y ago

By late 2000s, instruction scheduling research was largely considered done and dusted, with papers like:

https://dl.acm.org/doi/book/10.5555/923366 https://dl.acm.org/doi/10.1145/349299.349318

As for the security and the speculative execution, speculative execution is not going anywhere. Naturally, there are many researches around this like:

https://ieeexplore.ieee.org/abstract/document/9138997 https://dl.acm.org/doi/abs/10.1145/3352460.3358306

I have yet to meet any academics or industry processor architects or compiler engineer who think VLIW / Itanium is the way to move forward.

This is not to say putting as much work to the compiler is a bad idea, as nVidia has demonstrated. But what they are doing is not VLIW.

j / k navigate · click thread line to collapse