This matters in the way it interacts with i-cache. In aarch64 with 64-byte cache lines, one cache line is 16 instrs. always. In x86 that cache line could contain only 3 whole instrs. So unless your core is able to ingest over one icache line per cycle (intel cores currently are NOT), you are thus limited.
>Another oft-repeated truism is that x86 has a significant ‘decode tax’ handicap. ARM uses fixed length instructions, while x86’s instructions vary in length. Because you have to determine the length of one instruction before knowing where the next begins, decoding x86 instructions in parallel is more difficult. This is a disadvantage for x86, yet it doesn’t really matter for high performance CPUs because in Jim Keller’s words:
>For a while we thought variable-length instructions were really hard to decode. But we keep figuring out how to do that. … So fixed-length instructions seem really nice when you’re building little baby computers, but if you’re building a really big computer, to predict or to figure out where all the instructions are, it isn’t dominating the die. So it doesn’t matter that much.
>...
>Researchers agree too. In 2016, a study supported by the Helsinki Institute of Physics[2] looked at Intel’s Haswell microarchitecture. There, Hiriki et al. estimated that Haswell’s decoder consumed 3-10% of package power. The study concluded that “the x86-64 instruction set is not a major hindrance in producing an energy-efficient processor architecture.”
I always understood micro-ops to be fixed length.
Sure, you have to decode the variable length instructions at some point. But that extra work, relative to aarch64, is in practice amortized over the lifetime of that cache line.
Do Intel cores no longer have a μop cache in front of the L1i cache?
x86's approach to variable-length instructions is unfortunate.
In contrast, RISC-V leverages variable-length encoding to get the best code density among 64bit ISAs while sidestepping the instruction boundary problem.
(I digress, but note that while for the 32bit ISA RISC-V code density was competitive yet bested by ARM thumb2, it has since improved; RISC-V has the best density overall)
Note that RISC-V's code density with the C extension is in bytes, not in number of instructions. The core integer ISA was designed to be extensible from small embedded MCUs, so every other chip has to use it. High-performance RISC-V cores depend a lot on macro-op fusion to run as fast as 64-bit ARM.
As long as Intel can produce fast CPUs, with new features and while maintaining support for the existing binaries, everything is OK. Fixed or variable length, that's a matter for Intel engineers: users could, and should, care less.
It's of like saying x64 is limited by only having 16 registers - there's only names for 16ish in the ISA, but there's loads more registers in the machine as part of hiding latency.
Why not have just one, then?
After all, there's loads more registers in the machine as part of hiding latency.
The ISA either matters or it does not. Pick one.
How many CPUID flags exist? There are so many interdependencies, it’s even hard to say what even makes sense, without detailed knowledge. SSE without MMX? The reuse of floating point HW for other stuff is also a mess.
A x86 system is a witches’ brew of MSRs, I/O ports, and chipset-specific PCI devices. And that’s just across only Intel CPUs…
How much code has to execute before even a bootloader can run?
Why do we need a damn ACPI interpreter?
Why do we still deal with legacy PCI routing (on all devices) when none actually use it?
The PCI configuration space is a bit of a mess. We should just make a new standard where everything 64bit and memory-mapped only.
Why are we shackled with slow IO port operations that replicate hardware from the era when leaded gasoline was widely available? Some may say this is legacy, yet we still rely on it today!!
Imagine if 1980s systems were shackled with 1940s-1960s compatibility concerns.
We need to start afresh — take all the learnings from the past decades and cast off the legacy crap.
> Why do we need a damn ACPI interpreter?
Because ACPI is still used and for good reason. The lack of an equivalent in virtually every other ISA ecosystem is enough to laugh them out of the room whenever anyone suggests they're a viable alternative.
No thanks. Keep your ridiculous blackbox that's actively hostile to having new software run on it.
Exactly. It does not matter what core architecture is being used. What matters is that each system usually has different memory model, which is completely defeating any compatibility
Note that we would probably end up with something which has just as many problems or maybe even worse. Rewrites are very hard and you really need to know what you are doing to get everything right or to even to make things better.
Does any modern x86 wake up knowing it's a modern x86 or do they all still wake up thinking they are 8086's and progressively wake up from a series of nested nightmares until it realizes the truth that it has more registers, 64-bits and SIMD instructions?
Yesterday I was looking at a server board and it had the two-digit POST display. I assume it is updated with those ancient OUT instructions and attached to the last vestiges of an 8-bit ISA PC bus running at 5v TTL levels.
Option ROMs still start in real mode (at least for non-UEFI). The system management (SMI) handlers are still launched in real mode but too!
When the OS gets launched, l it brings up the other CPUs in real mode (INIT/SIPI), and has to do all the same gymnastics again…
Even PCI configuration space access is still supported with IO ports and that mode still required for some aspects.
Even today, the serial port still uses IO ports and interrupts just as it did 40+ years ago…
Yep, IO based POST codes are still a thing…
This imo is one of the biggest advantages of x86 currently, at least as a hobbyist. In comparison to ARM based computers (like the raspberry pi for example) where the boot process is different for each device, and usually involves proprietary binaries which the user has no clue of how they work
In comparison, you could re-use, update, and repurpose any old x86 machine to do whatever you need.
Personally I think ditching the old real mode systems will be a big boon to hobbyists, not a hindrance (sorry mode 13h users)! Linux/x86 Boot Protocol docs tentatively support this assertion (https://www.kernel.org/doc/html/v6.9-rc1/arch/x86/boot.html). What is helpful is having ACPI and UEFI and other conventions/standards in place.
ARM does have Base Boot Requirements (https://arm-software.github.io/ebbr/), that builds to something vaguely x86 like, but wow there's so many systems that still use hardcoded device trees. I haven't spent that much time, but just a couple hours is all it takes to figure out that uncompressing your dtb, applying an overlay, and recompiling a new dtb is awful & terrible & no way to compute. ARM is used so heavily in consumer devices that it's hard to see what would compel the greater ecosystem to do the right thing, to reform. I also can go read a deck like https://uefi.org/sites/default/files/resources/UEFI%20and%20... (UEFI and ACPI in Arm System Architecture) and appreciate, yeah, well, trying to be compatible & a good citizen is hard; there's specifications on top of specifications on top of specifications (Wei lists 17) to make it happen. x86 has benefited form a history of everyone tending towards intercompatibility, but there's nothing else in computing that's ever had such a strong overriding cooperation motive before.
OK- there were some SPARC servers but that was a while ago - and they were honestly never any fun.
I haven't paid any attention to new PC hardware other than the RasPi in the last 5 years or so, and I've always ignored Apple, so I was really not expecting that much progress!
[Mine is an M2 16-inch MBP from last year, perhaps the M3's are somewhat better?]
In comparison, my new Dell work laptop with an Intel chip gives me about 4 hours. It's not an apples <-> apples comparison but they're in the ballpark.
It's impressive. Nothing on the market comes even close.
An ISA is the interface between hardware and software. Thus a complex ISA does impose complexity upon both the hardware and the software.
Complexity is inherently (very) bad, and thus needs strong justification.
RISC embodies this idea by recognizing the value of simplicity and requiring any ISA addition be weighted against its complexity cost.
Implementations of RISC philosophy ISAs demonstrate (by achieving or even surpassing parity) the complexity in x86 is not justified, and this is why there hasn't been any tabula rasa CISC architecture worth noting in several decades.
My guess is the most important thing for chip performance is the manufacturing process. After that, it's things like pipelining, branch predictors, super-scaler design, etc. (I am not an expert and this is just a guess). I don't think instruction set really matters that much when chips have billions of transistors.
RISC was a great idea in the 1970s because a more complex instruction set meant fewer transistors for performance improvements. The same was also true in the 1980s. By 1995-1996, the Pentium Pro was the fastest 32-bit chip. At this point, RISC's proponents had to start explaining why a better instruction set did not translate into a faster chip. They never did. Instead, they keep on banging on the "RISC is better" drum without supplying better chips.
It’s also hard to separate that from other factors: was the Pentium more successful than the PowerPC because of CISC or because Intel had much better fabs than Motorola? If Motorola, IBM, DEC, or HP had had less incompetent management at the time it’s possible that we might remember this period very differently.
Citation needed.
>Why aren't we using them today?
Because we are using better chips made recently, not the ones from the 80s and 90s.
>Why are Amazon, Google, and Microsoft buying an enormous number of x64 chips?
Because performance vs cost in the current market, as well as access to x86 software moat.
But this is changing. Notably, Amazon has Graviton, Microsoft was Windows for ARM with grease for x86 software, and Google has a digital design team, which is already iterating RISC-V based accelerators.
Facebook, FAANG you have not mentioned, has its own RISC-V server effort.
>I don't think instruction set really matters that much
And yet, you're writing this very opinionated comment about ISAs.
>RISC was a great idea
Yes, it was. This is why the industry did never again make a tabula rasa CISC ISA.
>At this point, RISC's proponents had to start explaining why a better instruction set did not translate into a faster chip.
The RISC chips actually were faster. But this did not matter, as Intel had the better fabs, and the cash.
So Intel was able bruteforce its way into enough performance for cheap enough that the market would then not bother going through the pain of switching ISAs.
>without supplying better chips.
The chips were better despite Intel's fab advantage. But they were not cheaper, nor did it run the software the market wanted to run.
They sure sold these Pentiums, and were able to buy (and kill) Alpha later.
The one and only reason x86 survives to date is this software moat.
This moat advantage is in danger now, thanks to Microsoft's efforts to detach Windows from x86 and provide emulation to handle the transition like Apple did.
For a long time everything else was stuck with custom built images for every device.
Like today with ARM boards, where each OS is custom-bent to each board.
And because there wasn't any new names, everyone got stuck in this never ending RISC vs CISC debate.
As ChipsAndCheese points out, the uArch diagrams of modern high-performance ARM and x86 cores look very similar. And the real answer, is that both designs are neither RISC or CISC (the fact that one implements a CISC-derived ISA and the other implements a RISC-like ISA is irrelevant to the actual microarchtecture).
So what is this unnamed uarch pattern?
Mitch Alsup (who dwells on the comp.arch newsgroup) calls them GBOoO (Great Big Out-of-Order). And I quite like that name, guess I just need to convince everyone else in the computer software and computer hardware industry to adopt it too.
The GBOoO design pattern focuses on Out-of-Order execution to a somewhat insane degree.
They have massive reorder buffers (or similar structures) which allow hundreds of instructions to be in-flight at once, with complex schedulers tracking dependencies so they can dispatch uops to their execution units as soon as possible. Most designs today can disaptch at least 8 uops per cycle, and I've one design capable of reaching peaks of 14 uops dispatched per cycle.
To feed this out-of-order monster GBOoO designs have complex frontends. Even the smallest GBOoO designs can decode at least three instructions per cycle. Apples latest CPUs in the M1/M2 can decode eight instructions per cycle. Alternatively, uop caches are common (especially on x86 designs, but some ARM cores have them too), bypassing any instruction decoding bottlenecks.
GBOoO designs are strongly reliant on accurate branch predictors. With hundreds of instructions in flight, the frontend is often miles ahead of finalised instruction pointer. That in-flight instruction window might cross hundreds of loop iterators, or cover a dozen function calls/returns. Not only do these branch predictors reach high levels of accuracy (usually well above 90%), and can track and predict complex patterns, and indirect patters, but they can actually predict multiple branches per cycle (for really short loops).
Why do GBOoO designs aim for such insane levels of Out-of-Order execution? Partly its about doing more work in parallel. But the primary motivation is memory latency hiding. GBOoO designs want to race forwards and find memory load instructions as soon as possible, so they can be sent off to the complex multi-level cache hierarchy that GBOoO designs are always paired with.
If an in-order uarch ever misses the L1 cache, then the CPU pipeline is guaranteed to stall. Even if an L2 cache exists, it's only going to minimise the length of the stall.
But because GBOoO designs issue memory requests so early, there is a decent chance the L2 cache (or even L3 cache) can service the miss before the execution unit even needed that data (though I really doubt any GBOoO design can completely bridge a last-level cache miss).
Where did GBOoO come from?
From what I can tell, the early x86 Out-of-order designs (Intel's Pentium Pro/Pentium II, AMD's K6/K7) were the first to stumble on this GBOoO uarch design pattern. Or at least the first mass-market designs. I'm not 100% these early examples fully qualify as GBOoO, they only had reorder buffers large enough for a few dozen instructions, and the designers were drawn to the pattern because GBOoO's decoupled frontend and backend allowed them to push through bottlenecks caused by x86's legacy CISC instruction set.
But as the designs evolved (lets just ignore Intel's misadventures with netburst), the x86 designs of the mid 2000's (like the Core 2 Duo) were clearly GBOoO, and taking full advantage of GBOoO's abilities to hide memory latency. By 2010, we were staring to see ARM cores that were clearly taking notes and switching to GBOoO style designs.
So far so good.
>And the real answer, is that both designs are neither RISC or CISC
This is... Not even wrong.
>(the fact that one implements a CISC-derived ISA and the other implements a RISC-like ISA is irrelevant to the actual microarchtecture).
Exactly. CISC and RISC are characteristics of the ISA, not the microarchitecture.
But note (and I can't stress this enough), this does not mean ISA doesn't matter.
The ISA is the interface between software and hardware. A well designed ISA will e.g.:
- Not restrict the actual design of the microarchitecture.
- Not expose microarchitecture artifacts.
- Not force unjustified complexity into the microarchitecture nor the software.
It’s worth remembering that the surviving CISC architectures (x86 and z/370) are less CISCy than VAX and 68k were, in terms of number of address operands and complexity of addressing modes. And ARM is not a classically RISCy RISC. Instruction sets seem to have converged on a pragmatic middle ground — except for RISCV :-)
Apple did an excellent job with Rosetta 2 in most cases. It has its limitations since it's not 100% or sufficiently general as to replicate a Windows PC.
One approach that didn't work so well was Transmeta with VLIW and pouring resource-costly optimizations into the compiler.
All-in-all, CISC/RISC debate is a mirage because it depends on the net performance of the macro ISA running on some particular micro ISA. We don't have single-cycle, non-pipelined RISC or low cycle efficiency, hyper deep, 6 GHz PC processors anymore for good reason... they've been supplanted by a series of incremental computer organizational design approaches due to healthy competition. Now we have low energy ARM, blinding-fast laptops from the top 3 vendors, and ridiculous server metal like the 9754 and the 9474F for 6 TiB 2P systems.
This ignores performance is not everything.
Complexity doesn't just mean more effort; it also translates into an increase of bugs.
In truth ARM doesn't actually present any real gains (efficiency or otherwise) over x86 in pretty much any space as an inherent consequence of its ISA. The narrative has its origins seeded by way of the principle market that ARM found widespread success in being.... microprocessors and extremely low-end processor like those found in handheld gaming devices and eventually phones. Naturally these processors were designed to sip voltage by way of not actually pushing a whole lot of numbers. The market matured and so did the architecture, and we started to see cellphones that could really sling their weight! It's all smoke and mirrors though, as even in TYOOL 2024 the moment you do something intensive that your phone does not have a hardware accelerator for (eg, compiling software) it becomes apparent the thing you're holding in your hand is about as good as a core 2 duo when it comes to crunching numbers with a lot of branches. Then of course Apple came along and brought ARM back to the desktop space after decades of being relegated to power-sippers. People's jaws hit the floor over a chip that doesn't actually perform any better than its AMD counterparts, because oh hey! 20 hour battery life! Well, actually it's 11 hours if you're doing really light web browsing, and only in Safari. Well hey, that's slightly better than the comparable laptop chips in the x86 family released around the same time, right? And indeed, that's a few hours more than I got in my first gen AMD T14 which by all metrics is close enough to the M1 chip in my mbp. But you know, the more I dug futher, the more I found out that the battery life was about comparable when the system actually has to start doing work. Long video calls in Zoom? That was about 4 hours on both. Heavy use of Firefox? 6 hours each. Lots of compiling and a resource heavy dev environment with a fat C++ language server? Again, about 4 hours on both. The battery gains weren't from some mystical discrepency between how instructions are decoded on the two chips. In the end it just came down to the fact that as all operating systems do, the great M1 battery life was owed to really cute power management drivers in Apple's operating system (as well as implementing a fair amount of extensions to make things like video decode and javascript execution draw less power.) It's a fact that becomes all the more apparent when, while following the Asahi devblog, during the watershed moment of actually getting Linux to bootstrap itself on Apple silicon, I read the kernel's main loop doing absolutely nothing chewed through the whole battery in less than 3 hours. That sounds about right to me, given every single power draw experiment I've read between 2020 and now indicates the M1's power draw isn't really as great as the hype machine has made it out to be.
I'm sure we're all at the edge of our seats hoping to experience an efficiency revolution, and that this can be given to us with this one weird trick of changing ISAs... but it's not happening. At least not with ARM, or RISC-V or any other contemporary architecture that isn't x86.
Far be it from me to defend Intel or their frankenstein's monster architecture, but I've gotten a bit tired of this dream that we're on the cusp of performing supertasks in an instant, at the cost of mere picowatts. Especially not when it inadvertantly pushes us towards a future of SoC lock-in and hobbyist bare metal development becomes an even bigger waking nightmare. Until then, I sincerely hope x86 never dies, even though saying so kills a piece of me inside.
You seem to miss that RISC-V being better is besides the point.
RISC-V's massive success was unavoidable, due to its open specification and free license.
The ISA did not need to even be good, just decent.
A current x64 chip is a dozen or so separate dies with eight or so x64 cores per die, with a couple of those in different sockets. When one thread on one code decides to write to a cache line, the memory model makes really strong guarantees about cores on some other socket noticing that change.
Arm doesn't have to go with total store order. GPUs involve distinct blocks of memory with their own invariants on when caches are invalidated at potentially very coarse granularity (like no change will be seen until after a kernel has finished executing, where a kernel is essentially a process that sprung to life and then did arbitrary amounts of maths).
Fast x64 code is prone to carefully partitioning the problem across different cores and trying not to hit a cache from another core but even then you still have something like MOESI sitting in the background waiting just in case some thread mutates the instructions executing on another one.
the problem is that the article makes too much reliance on bad arguments such as "ISA differences were swept aside by the resources a company could put behind designing a chip". this is a dangerously bad argument. the fact that a company can afford to and is willing to keep x86 afloat and competitive with massive resources is not an argument for dismissal, but for its economic usefulness.
the value to a legacy ISA is real. but the cost is complexity. this complexity drives $, silicon area, power. period. in a even drag race, the simpler more efficient design would definitely win.
and a load-store architecture is simpler than a complex addressing scheme and will have more throughput per clock with less resources (design work, validation, $, area, power). a fixed or simple variable length opcode is always going to be simpler to implement than x86!
but on the other hand, a lot of those massive costs (NRE) are sunk costs. others are not.
so there's nothing wrong with x86 at the moment - it's still clearly the cheapest (due to scale) and fastest (definitely per/$) and excepting M2/M3 also fastest absolute per-core.
it certainly doesn't need to "die".
part of the advantage is inertia. and that's part of the disadvantage too. it's just barely starting to look like the trajectory is changing. but by the time the overall economics of ARM, RISC-V, etc. begin to overtake x86, the inertia and cruft will be negatively affecting them too.
things get old and die on their own. there's no stopping it. but in this case it doesn't need to be hastened and it won't be a "happy" moment when the changing of the guard does happen.