If they thought Itanium was bad, they should have looked into the i860. Itanium was an attempt to fix a bunch of the i860 ideas. i860 quickly went from a supercomputer chip to a cheap DSP alternative (where it had at least the hope of hitting more than 10% of its theoretical performance).
Intel iAPX 432 was preached as the second coming back in the 80s, but failed spectacularly. The i960 was take 2 and their joint venture called BiiN also shuttered. Maybe Rekursiv would be worthy of a mention here too.
We now know that core 2 dropped all kinds of safety features resulting in the Meltdown vulnerabilities. It also partially explains why AMD couldn't keep up as these shortcuts gave a big advantage (though security papers at the time predicted that meltdown-style attacks existed due to the changes).
Rather than an "honorable mention", the Cell processor should have easily topped the list of designs they mentioned. It was terrible in the PS3 (with few games if any able to make full use of it) and it was terrible in the couple supercomputers that got stuck with it.
I'd also note that Bulldozer is also maligned more than it should be. There's a lot to like about the concept of CMT and for the price, they weren't the worst. I'd even go so far as to say that if AMD wasn't so starved for R&D money during that period, they may have been able to make it work. ARM's latest A510 shares more than a few similarities. A big/little or big/little/little CMT architecture seems like a very interesting approach to explore in the future.
As for Bulldozer, I was saddled with one for a while. Where it really fell down was (surprise!) its floating point performance. That FPU shared between two integer units makes for some "interesting" performance characteristics when trying to run multiple FP-heavy tasks, but overall, it was merely mediocre rather than terrible. I'm glad AMD hit it out of the park with Zen.
The wrong expectations and false advertising have centered on the fact that the first Bulldozer was described as an 8-core CPU, which would easily crush its 4-core competition from Intel (Sandy Bridge).
What the AMD bloggers have forgotten to mention was that the new Bulldozer cores were much weaker than the cores of their previous CPU generations, being able to execute only 2 instructions per cycle, while an Intel core could execute 4 instructions per cycle (and the previous AMD cores could execute 3 instructions per cycle). So a Bulldozer core only had the performance of a single thread of the 2 threads of an Intel core, for multi-threaded tasks, with the additional disadvantage that the resources of 2 AMD cores could not be allocated to a single thread when the second core of a module was idle.
So an 8-core Bulldozer could barely match the multi-threaded performance of a 4-core Sandy Bridge, while being much slower on single-thread tasks.
If one would have known since the beginning that the Bulldozer cores had been intentionally designed to be much weaker than the old AMD cores and than the Intel cores, this would not have been a surprise and everybody for whom the price/performance ratio was more important than the performance would have been happy to buy Bulldozer CPUs.
However, after many months during which AMD claimed that their supposedly 8-core CPU will be better than any other CPU with less cores, there was a huge disappointment caused by the first tests after launch, which immediately revealed the pathetic performance of the new cores, which for single-thread tasks were much slower than the previous AMD CPUs.
So all the hate has been caused by the stupid actions of the AMD management and marketing, who lied continuously about Bulldozer, even if they should have thought that this is useless, because the independent benchmarks will reveal the truth immediately after launch.
To set correctly the expectations about Bulldozer vs. Sandy Bridge, what AMD called a 4-module 8-core CPU should have been called a 4-core 8-thread CPU, but which has dynamic allocation inside a core (module in AMD jargon) only for the FPU, while the integer resources are allocated statically. With this correct description there would have been no surprise about the behavior of Bulldozer.
A part of the hate is also due to some engineering decisions whose reasons are a mystery even now, because if you would have queried randomly a thousand of logic design engineers before 2011, all or almost all would have said that they are bad decisions, so it is hard to understand how they could be promoted and approved inside the AMD design teams.
For example, since the Opteron launch in 2003 and until Intel launched Sandy Bridge in 2011, the largest advantage in performance of the AMD CPUs was in the computations with large numbers, because the AMD CPUs could do integer multiplications much faster than the Intel CPUs.
The Intel designers have recognized that this is a problem, and during the 2006-2011 interval they have decreased every year the number of clock cycles required for operations like multiplications and divisions, so that Penryn began to approach the AMD throughput per clock cycle, Nehalem & Westmere matched the AMD throughput, while Sandy Bridge achieved a double throughput in comparison with the old AMD CPUs.
While Intel worked diligently to improve the performance of their cores, what did AMD do ?
Someone at AMD has decided for an unknown reason that there is no need for Bulldozer to keep their existing computational performance, but it is enough to have integer multipliers with a throughput equal to a half of their current throughput and equal to only a quarter of their Sandy Bridge competitor (Intel had announced much in advance, by more than a year before launch, that Sandy Bridge will double the integer multiplication throughput over Nehalem, and it was anyway an obvious trend of the evolution of their previous cores; so the higher performance of the competition could not have been a surprise for the AMD designers).
The downgraded integer multipliers have crippled the performance of the new AMD CPUs for certain applications where their previous CPUs had been the best, while enabling only a negligible reduction in the core area.
Nobody cuts prices more than they have to, but everyone adjusts prices to where they need to go to sell the product. Bulldozer was priced low because it was genuine garbage, it was actually slower than Phenom in a lot of cases (which blows the "it was about price to peformance!" thing out of the water - nobody regresses performance on purpose).
(and before people wind up about the obvious counterexample: Ryzen was priced low because a 1800X was genuinely a lot slower than a 5960X in productivity tasks due to latency and poor AVX performance, and got completely smoked in gaming. If they had tried to go head-to-head with Intel at $1000 pricing they wouldn't have sold anything because it would have been a far inferior package to what Intel offered, they had to cut prices by around half to make it a compelling offering. And even then it was not that appealing compared to, say, a 5820K.)
Companies need to make enough of a showing to attract consumers but if a company prices something super aggressively, there's often a catch. And that's bulldozer in a nutshell. Oh shit the product sucks. What can we charge for a mediocre "8-core" (sorta) that underperforms the 4-core i7? Offer it at i5 pricing and see if anyone bites. If they had managed to achieve good performance, they would have priced it appropriately.
(the other thing is - people prefer to make the comparison about the FX-8350, but that's not Bulldozer, that's Steamroller. Bulldozer was the FX-8150/FX-6350, which actually did outright regress performance vs a Phenom X6, and was priced relatively steeply due to "8 real cores". Bulldozer went up against Sandy Bridge, Steamroller was more of an Ivy Bridge/Haswell competitor, and that's where prices really started to drop. It isn't a huge difference but Intel was making some progress too in those days.)
Price chart: https://www.anandtech.com/show/4955/the-bulldozer-review-amd...
- The Intel i432 - too far ahead of its time, in Itanium for the 1980's. https://en.wikipedia.org/wiki/Intel_iAPX_432
- The TI CMS320 series of DSPs. So full of silicon bugs it hurt TI badly.
- The Transputer T9000 - very ambitious, but vapourware for so long it killed its parent company. https://en.wikipedia.org/wiki/Transputer#T9000
Sony gave you 6 of the 8 SPE cores to use (I think they reserved two, but it's been ages). They are indeed very fast, however, they have no cache coherent access to main RAM and only 256k of memory for each element. So, you have to meticulously write DMA scheduling code to keep them fed. If you're a simpleton like me, you double buffer your SPE memory, cutting in in half, so 128k to work with, 128k for paging into, and you hope to be done paging before it's needed. Latency to memory is on the order of 2,000 cycles to first byte, but then they arrive fast.
So, what you do is decompose your problem into data streams that can be cruched through, but in such a way that you minimize the need to randomly access much memory. It's often cheaper to recompute things locally than to fetch them from RAM. Random access into your RAM is pointless, so you have to marshal all your input into DMA buffers, do some work, marshal all your output into other DMA buffers, and send back to host CPU.
Anyhow, I got this working. Meshes were being skinned at very high rate, but it was very frustrating. The PPE was really slow, so you had to offload as much as you could to those SPE's. But hey, I may be complaining, but it sure beats dealing with the "Emotion Engine" on the PS2. I can tell you which emotion that engine brings up.
If the chip were so wonderful to work on, then it would still be in use today as the theoretical performance per area beats everything else by a wide margin.
Roadrunner was built in 2008. It would still be just barely off the top 500 list in 2021, but was decommissioned just FIVE years later in 2013. Its x86 replacement was already underway in 2010 TWO years after its launch.
I'm glad you got to work with the architecture you loved for so many years, but I think the rest of the world disagrees with your assessment.
Until today, I’ve never once seen someone “singing it’s praises” that’s actually written code for one. At best, they’d curse it under their breath while saying it had its benefits. Usually however it was a full throated rant about how bad the experience was.
Cell (for example) was an asymmetric/hybrid multicore CPU; Apple Silicon is perhaps a modern example of asymmetric performance vs. efficiency cores, and also features special-purpose accelerator cores such as the neural engine.
The 432 had capability-based addressing. Speed-over-security has had a good run, but with some disastrous consequences. We may be seeing the return of capabilities with CHERI/ARM.
The 960 was an early superscalar design, supported tag bits, and was also a successful product.
"RISC instruction sets I have known and disliked."
https://www.jwhitham.org//2016/02/risc-instruction-sets-i-ha...
https://news.ycombinator.com/item?id=11607119
I might also say that Sun's UltraSPARC was constantly beaten by Fujitsu SuperSPARC. It would have been better to outsource.
As for the Cell it was overly complex architecture and had remarkable performance under very optimized code. The hope was hand tuned libraries would address this; and compiler optimizations would take care of the rest. Neither happened in a meaningful way. We did two major projects with the Cell using it for real-time HDTV compression/direct broadcast applications.
Another one not on the list was the inmos Transputer. Again similar to the Cell; very complex and fast for its time; but not easy to achieve this performance. That was my first job as an EE - we used it on a GPS receiver ISA card in the early days of GPS. It was a good choice as very fast and could keep up with the signal processing that allowed us to roll code updates to add major features as various changes to GPS signals were rolled out (P-code on L2, SA being turned off, and later CA code on L2 being unencrypted). Our competitors had to redesign ASICS to get these new features which means long product cycles and hardware replacement.
Today I find myself doing a lot on the M1 series, as well as Epyc. Now you can give zero shits about clean optimized code and it still runs amazingly fast. Last time I had to do assembler or intrinsics was many many years ago - and I sort of miss that intimacy with the hardware to get the most out of it.
It runs rings around workstations!
Curiously, every other out-of-order chip designer except for AMD also designed CPUs with Meltdown flaws. That's per their own documentation ARM, IBM both Power and mainframe, SPARC, and I think MIPS but they weren't entirely clear about it.
I have an old X-11 terminal I believe has a i960 in it. I’m shocked that thing was capable of running CDE desktops when it stutters on FVWM over a network much faster than it ever was intended to see.
SGI, Compaq and HP mothballed development of their own CPUs (MIPS/Alpha/PA-RISC) as they all settled on Itanium for future products.
After Itanium turned out to be a flop, those companies adopted x86-64 - Intel killed off 3 competing ISAs by shipping a bad product.
I think a modern compiler could likely do a good job with itanium now-a-days. However, when it first came out, there simply wasn't the ability to keep those instruction batches full. Compiler tech was too far behind to work well with the hardware.
Signetics made the 2650, a nice processor with a highly regular architecture with a condition code register. After every arithmetic operation including loads and stores the ALU updated the condition code register.
The National 32032 processor was a wonderful part with a clarity of design that made it a great choice for a workhorse processor. Unix running the machine was stable and efficient except that every few weeks there would be disastrous crash. With a tremendous amount of effort the source of the problem was found: a race condition in the interrupt control logic that returned from the wrong stack and scribbled over memory.
The Intel i860 exposed the internal computational pipeline to the programmer. Context switching was complicated by the conflict of real-time operating performance requirements and a deep pipeline with no way to grab the context and drain the pipeline. Eventually a dedicated team got a Unix OS running on the part, but it peformed poorly.
The Maspar MP-1 was a SIMD machine. It was cool to test new library functions by seeing if, say, sqrt(x)*sqrt(x)==x for all floating point numbers. Customers wanted the Maspar machine to be timeshared, but the architecture made it difficult to do since the CPU state was very large and memory was not mapped.
Intel's 8048 (and simplified versions like the 8021 and enhanced versions like the 8051) did not perform as well in terms of speed or code size as many of the competing micro controllers. The competition offered very simple asymmetric complex architectures which could be programmed (possibly with external hardware assists) to accomplish embedded tasks with significant effort and several days or weeks of effort. The Intel part was not quite as efficient in memory use and speed, but could be programmed in an afternoon. And another engineer/programmer could look at the code and understand it without much deep thought.
The Motorola 68000 was a wonderful machine with a clear instruction set. But the original 68000 could not support virtual memory.
There have been all sorts of different architectures tried which seen strange today but came about because the architecture was thought to provide an engineering solution to an immediate problem. There was a time when register machines were thought to be a bad architecture, far inferior to a simple stack architecture.
I know intel wanted Itanium to succeed for the same reasons, but the PIV came very close to home since it actually shipped for consumers. Oddly enough, Extreme Tech was a huge shill for Intel back in those days. Funny they don't mention that in this article.
It's a nifty little CPU. There's a lot of hidden little features once you dig in. It can actually address multiple separate 64k memory namespaces: data memory, instruction memory, macroinstruction memory, and mapped memory with the assistance of a then-standard chip. Normally these are all the same space and just need external logic to differentiate them. There's also a completely separate serial and parallel hardware interface bus.
The macroinstruction ("Macrostore") feature is pretty fun. There's sets of opcodes that will decode into illegal instructions that, instead of immediately erroring out, will go looking for a PC and workspace pointer (the "registers") in memory and jump there. Their commercial systems like the 990/12 used this feature to add floating point and other features like stack operations.
Yup, there's no stack. Just the 16 "registers," which live in main memory. There are specific branch and return instructions that store the previous PC and register pointer in the top registers of the new "workspace," allowing you direct access to the context of the caller. The assembly language is simple and straightforward with few surprises, but it's also clearly an abstraction over the underlying mechanisms of the CPU. I believe this then classifies this CPU as CISC incarnate.
There are some brilliant and insane people on the Atari Age forums! One of them managed to extract and post the data for a subset of those floating point instructions, and then broke it all down and how it all worked. Some are building new generations of previous TMS9900 systems. One of them is replicating the CPU in an FPGA. A few others are building things like a full-featured text editor and, of course, an operating system.
I've learned a hell of a lot during this project. I've been documenting what I'm doing and am planning to eventually make it into a pretty build log. I think this is a beautiful dead platform that deserved better.
So, first it generally had a higher IPC than anything else available (ignoring the P6). So, the smart marketing people at cyrix decided they were going to sell it based on a PR rating which was the average performance on a number of benchmarks vs a similar pentium. AKA a Cyrix PR166 (clocked at 133Mhz) was roughly the same perf as a 166Mhz pentium. Now had they actually been selling it for a MSRP similar to a pentium 166 that might have seemed a bit shady, but they were selling it closer to the price of a pentium 75/90.
Then along comes quake which is hand optimized for the pentium's U/V pipeline architecture and happens to use floating point too. And since a number of people had pointed out the Cx86's floating point perf was closer in "PR" ratings to its actual clock speed suddenly you have a chip performing at much less than its PR rating, and certain people then proceeded to bring up the fact that it was more like a 90Mhz pentium in quake than a 166Mhz pentium (something i'm sure made, say intel, really happy) at every chance they get.
So, yah here we are 20 years later putting a chip with what was generally a higher IPC than its competitors on a "shit" list mostly because of one benchmark. While hopefully all being aware that these shenanigins continue to this day, a certain company will be more than happy to cherry pick a benchmark and talk up their product while ignoring all the benchmarks that make it look worse.
Now as far as motherboard compatibility, that was true to a certain extent if you didn't bother to assure your motherboard was certified for the higher bus rates required by the cyrix, and the other being it tended to require more sustained current than the intels the motherboards were initially designed for. So, yah the large print said "compatible with socket7" the fine print later added that they needed to be qualified, and the whole thing paved the way for the super socket7 specs which AMD made use of. And of course lots of people didn't put large enough heatsink/fans on them which they needed to be stable.
So, people are shitting on a product that gets a bad rep because they were mostly ignorant of what we have all come to accept as normal business when your talking about differing micro architectural implementations.
PS: Proud owner of a 6x86 that cost me about the same as a pentium 75, and not once do I think it actually performed worse than that, while for the most part (compiling code, and running everything else including Unreal) it was significantly better than my roommates pentium75.
IIRC the official excuse when this became public was that a MS engineer turned it off because one of their test machines couldn't complete a stress test with it enabled, but later it turned out the root cause was a bad motherboard. The curious part being that it didn't result in MS immediately issuing a hotfix to turn the cache back on.
edit: found one of the articles mentioning this. https://www.tomshardware.com/reviews/bananas,9.html
Apparently it was just writeback mode that got disabled, either way that link mentions a 30% perf hit.
The common thread was Intel marketing pushing something that was a dog for marketing reasons
1. It is very amazing not in a good way when you think you have enough inventory but someone from HQ calls up the warehouse and has the older CPUs crushed by a bulldozer (you don't want to throw them out, they are quite usable)
2. Was amazing that sucker ran so hot tech support got a call about test boxes catching on fire
That didn't last long. Like what, one generation?
Good.
(saying that, but I remember purchasing a dual Pentium II motherboard for 2 400 MHz CPUs to speed up 3DStudio 4 renderings under Windows NT4... xD)
Cache was still external at that point. There would be performance benefits from brining it on die, but larger chips are more expensive to make & using two smaller dies (one for CPU & one for cache like the Pentium Pro) is still quite expensive.
The middle ground was to put the CPU and cache on a single PCB, so you end up with a cartridge form factor. By the time the next generation rolled around it was possible to put the CPU and cache on the same die at a reasonable cost (Moore's law), making the cartridge form factor obsolete.
I thought it was cool at the time, made me think of a NES cartridge.
(Single use analog pocket cameras)
They were all awful.
It was without a doubt the fastest CPU I had ever had at the time, but boy did it generate heat and need cooling.
That machine sounded like a always on vacuum-cleaner.
[0] https://www.cnet.com/culture/pcs-plagued-by-bad-capacitors/
https://obits.dallasnews.com/us/obituaries/dallasmorningnews...
Worked on itanium too. It was more amazing Microsoft actually had support for it.
The fact that the fault was tiny and that few people were affected is definatly NOT the point.
The so-called Pentium 'bug' was the result of fundamentally terrible engineering on Intel's part in that the underlying design wasn't fit for purpose - it wasn't just a bug.
It seems to me the authors of this story do not understand the implications of what Intel did was fundamentally wrong in that its math processing was flawed by design from the outset or otherwise they would have included the Pentium in their list.
In order to achieve increased math processing speed, Intel broke mathematics algorithms down into part algorithm and part lookup tables - that is instead of having mathematics algorithms complete the whole task (which is the logical way of doing things). If the mathematics algorithm were wrong then every calculation would also be wrong and thus the problem obvious from the outset. Adding a lookup table makes calculations faster but one would then have had to test every combination in the lookup table - and Intel didn't.
Look at the problem like this - think of a set of log or trig tables, now think of the implications if one of those table entries is incorrect. What Intel did was deliberate cheating and it failed to get away with it. Intel would have known this from the outset and thus the problem was an integral design fault rather than a bug.
Intel knowingly implemented a design that had flawed data integrity at its most fundamental level. What Intel did was so nasty that it's hard to think of how it could have made matters worse than if it had deliberately tried to introduce a fault.
In my opinion, any company that would stoop to such low ethical tactics as Intel did with the Pentium's design would have demonstrated that it cannot be trusted - and I've never trusted Intel from that point onward.
If anyone ever needs a reason for why processors should have open design architectures that are subject to third-party scrutiny then this is the quintessential example.
There's a great writeup with the results of Intel's internal investigation [2], which outlines the challenge in testing production chips for this sort of bug. A key point:
> The fraction of the total input number space that is prone to failure is 1.14 x 10^-10.
So around 1 in 9 billion possible numerator/denominator pairs exhibit the bug. Testing 9 billion double-precision FDIV divides on a 60MHz Pentium would take almost four days, if my math checks out and the CPU could do 2.5 billion divides per 24 hours.
[1]: https://en.wikipedia.org/wiki/Division_algorithm#SRT_divisio...
[2]: https://users.fmi.uni-jena.de/~nez/rechnerarithmetik_5/fdiv_...
I'm aware of most of those details as I took a keen interest in the matter at the time. I'm also aware of the argument for the use of said algorithm.
Whether one adopts this approach or not is philosophical argument and I just happen to believe it's bad (and ugly) engineering - and in this case witnes the outcome, it cost Intel dearly in both monetary and PR terms.
Can you expand on this? I thought all FPUs used lookup tables? Even the 8087 had them.
This is nonsense. There's no functional difference between "lookup table" and "algorithm" (whatever that means) when it comes to a circuit design. Both are perfectly valid ways, nothing inherently wrong with either.
ah well
And the whole thing is built for a world where everybody is writing code in Ada. I bet some compiler makers were salivating at the prospect of collecting all of those huge license fees from developers.
Itanium held the idea that we could accurately predict ILP at compile time (when the halting problem clearly states that we cannot).
Transmeta said VLIW has the best theoretical PPA possible, so let's wrap that in a large, programmable JIT to analyze/optimize stuff to take advantage.
Modern CPUs run quite a bit closer to transmeta, but they largely use fixed-function hardware rather than being able to improve performance at a later time.
If we could nail down that ideal VLIW architecture, we could sell a given chip at various process sizes and then offer various paid "software" upgrades or compatibility packs for various ISAs to run legacy code.
At least there's a pipe dream worth looking into.
I don't know where these notions are coming from.
Compilers can (and do) reorder instructions to extract as much parallelism as possible. Further, SIMD has forced most compilers down a path of figuring out how to parallelize, at the instruction level, the processing of data.
Further, most CPUs now-a-days are doing instruction reordering to try and extract as much instruction level parallelism out as possible.
Figuring out what instructions can be run in parallel is a data dependency problem, one that compilers have been solving for years.
Side note: the instruction reordering actually poses a problem for parallel code. Language writers and compiler writers have to be extra careful about putting up "fences" to make sure a read or write isn't happening outside a critical section when it shouldn't be.
The weak memory model:
https://devblogs.microsoft.com/oldnewthing/20170817-00/?p=96...
Inability to address low-power designs:
https://en.m.wikipedia.org/wiki/StrongARM
"According to Allen Baum, the StrongARM traces its history to attempts to make a low-power version of the DEC Alpha, which DEC's engineers quickly concluded was not possible."
The other major problem with the Alpha was the high license costs of DEC operating systems, which greatly helped put it in the grave.