Why x86 doesnt need to die (opens in new tab)

(chipsandcheese.com)

78 pointsylk12y ago67 comments

67 comments

45 comments · 12 top-level

dmitrygr2y ago· 12 in thread

This misses on an important bit: parallel decoding of instructions. It is a lot harder with variable-length instrs where the length cannot even be calculated from the first byte - you need to read 10 bytes in the worst case to find an instr's len in x86. In aarch64 you need to read 0 bytes to know the length - it is 4

This matters in the way it interacts with i-cache. In aarch64 with 64-byte cache lines, one cache line is 16 instrs. always. In x86 that cache line could contain only 3 whole instrs. So unless your core is able to ingest over one icache line per cycle (intel cores currently are NOT), you are thus limited.

tester7562y ago

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

>Another oft-repeated truism is that x86 has a significant ‘decode tax’ handicap. ARM uses fixed length instructions, while x86’s instructions vary in length. Because you have to determine the length of one instruction before knowing where the next begins, decoding x86 instructions in parallel is more difficult. This is a disadvantage for x86, yet it doesn’t really matter for high performance CPUs because in Jim Keller’s words:

>For a while we thought variable-length instructions were really hard to decode. But we keep figuring out how to do that. … So fixed-length instructions seem really nice when you’re building little baby computers, but if you’re building a really big computer, to predict or to figure out where all the instructions are, it isn’t dominating the die. So it doesn’t matter that much.

>...

>Researchers agree too. In 2016, a study supported by the Helsinki Institute of Physics[2] looked at Intel’s Haswell microarchitecture. There, Hiriki et al. estimated that Haswell’s decoder consumed 3-10% of package power. The study concluded that “the x86-64 instruction set is not a major hindrance in producing an energy-efficient processor architecture.”

dmitrygr2y ago

I did not talk about power - i talked about perf. No modern x86 chip can decode 6 or 7 of these long instrs per cycle. there are aarch64 chips that can

2 more replies

watersb2y ago

I believe that modern x86 processors store decoded micro-ops in the I$ instruction cache.

I always understood micro-ops to be fixed length.

Sure, you have to decode the variable length instructions at some point. But that extra work, relative to aarch64, is in practice amortized over the lifetime of that cache line.

1 more reply

LegionMammal9782y ago

> So unless your core is able to ingest over one icache line per cycle (intel cores currently are NOT), you are thus limited.

Do Intel cores no longer have a μop cache in front of the L1i cache?

snvzz2y ago

>It is a lot harder with variable-length instrs where the length cannot even be calculated from the first byte - you need to read 10 bytes in the worst case to find an instr's len in x86. In aarch64 you need to read 0 bytes to know the length - it is 4

x86's approach to variable-length instructions is unfortunate.

In contrast, RISC-V leverages variable-length encoding to get the best code density among 64bit ISAs while sidestepping the instruction boundary problem.

(I digress, but note that while for the 32bit ISA RISC-V code density was competitive yet bested by ARM thumb2, it has since improved; RISC-V has the best density overall)

Findecanor2y ago

The length of a RISC-V instruction is in the first byte though, not the tenth.

Note that RISC-V's code density with the C extension is in bytes, not in number of instructions. The core integer ISA was designed to be extensible from small embedded MCUs, so every other chip has to use it. High-performance RISC-V cores depend a lot on macro-op fusion to run as fast as 64-bit ARM.

1 more reply

pif2y ago

I think you are missing the only point of the article: performance and compatibility are important; everything else is just aesthetics.

As long as Intel can produce fast CPUs, with new features and while maintaining support for the existing binaries, everything is OK. Fixed or variable length, that's a matter for Intel engineers: users could, and should, care less.

aurareturn2y ago

Most important applications have an ARM version now. Especially true since Apple Silicon and AWS Graviton. Windows will force developers to compile both x86 and ARM versions.

JonChesterfield2y ago

It's a nice theory but I don't think it holds up. X64 executes from a micro op cache and there's no particular reason to expect the ops in that to be variable length encoded. Thus it only goes to the i-cache when that misses, at which point you've spent long enough digging around in the cache that the extra decoding probably doesn't matter.

It's of like saying x64 is limited by only having 16 registers - there's only names for 16ish in the ISA, but there's loads more registers in the machine as part of hiding latency.

snvzz2y ago

>It's of like saying x64 is limited by only having 16 registers - there's only names for 16ish in the ISA, but there's loads more registers in the machine as part of hiding latency.

Why not have just one, then?

After all, there's loads more registers in the machine as part of hiding latency.

The ISA either matters or it does not. Pick one.

account4mypc2y ago

usually the really fat instructions take over 1 cycle anyway, right? so the decoder should be able to keep up

dmitrygr2y ago

pipelining...

they are usually piplineable

1 more reply

BobbyTables22y ago· 5 in thread

The problem is not the ISA — it’s the whole ecosystem.

How many CPUID flags exist? There are so many interdependencies, it’s even hard to say what even makes sense, without detailed knowledge. SSE without MMX? The reuse of floating point HW for other stuff is also a mess.

A x86 system is a witches’ brew of MSRs, I/O ports, and chipset-specific PCI devices. And that’s just across only Intel CPUs…

How much code has to execute before even a bootloader can run?

Why do we need a damn ACPI interpreter?

Why do we still deal with legacy PCI routing (on all devices) when none actually use it?

The PCI configuration space is a bit of a mess. We should just make a new standard where everything 64bit and memory-mapped only.

Why are we shackled with slow IO port operations that replicate hardware from the era when leaded gasoline was widely available? Some may say this is legacy, yet we still rely on it today!!

Imagine if 1980s systems were shackled with 1940s-1960s compatibility concerns.

We need to start afresh — take all the learnings from the past decades and cast off the legacy crap.

johnnyjeans2y ago

Most of these gripes exist in all modern ecosystems as a consequence of heterogenous and expansive markets of computer hardware. SoC-land is much, much worse. As an exercise, I'd recommend you try porting 9front to a random unsupported SBC. Personally, after a month of reverse engineering and running into really fun and cool hardware bugs, I gave up.

> Why do we need a damn ACPI interpreter?

Because ACPI is still used and for good reason. The lack of an equivalent in virtually every other ISA ecosystem is enough to laugh them out of the room whenever anyone suggests they're a viable alternative.

No thanks. Keep your ridiculous blackbox that's actively hostile to having new software run on it.

TheLoafOfBread2y ago

> SoC-land is much, much worse

Exactly. It does not matter what core architecture is being used. What matters is that each system usually has different memory model, which is completely defeating any compatibility

StressedDev2y ago

We will when you are willing to pay for the massive cost of redoing everything which already works well. Also, are you going to buy everyone new hardware for your new better architecture?

Note that we would probably end up with something which has just as many problems or maybe even worse. Rewrites are very hard and you really need to know what you are doing to get everything right or to even to make things better.

rbanffy2y ago

> Why are we shackled with slow IO port operations that replicate hardware from the era when leaded gasoline was widely available? Some may say this is legacy, yet we still rely on it today!!

Does any modern x86 wake up knowing it's a modern x86 or do they all still wake up thinking they are 8086's and progressively wake up from a series of nested nightmares until it realizes the truth that it has more registers, 64-bits and SIMD instructions?

Yesterday I was looking at a server board and it had the two-digit POST display. I assume it is updated with those ancient OUT instructions and attached to the last vestiges of an 8-bit ISA PC bus running at 5v TTL levels.

BobbyTables22y ago

Today they still boot in 16-bit real mode. Very quickly, it usually switches to 32bit mode for the BIOS, maybe eventually 64bit mode.

Option ROMs still start in real mode (at least for non-UEFI). The system management (SMI) handlers are still launched in real mode but too!

When the OS gets launched, l it brings up the other CPUs in real mode (INIT/SIPI), and has to do all the same gymnastics again…

Even PCI configuration space access is still supported with IO ports and that mode still required for some aspects.

Even today, the serial port still uses IO ports and interrupts just as it did 40+ years ago…

Yep, IO based POST codes are still a thing…

robotnikman2y ago· 4 in thread

> x86-64 CPUs keep real mode around so that operating systems can keep booting in the same way ... It’s part of the PC compatibility ecosystem that gives x86 CPUs unmatched compatibility and longevity.

This imo is one of the biggest advantages of x86 currently, at least as a hobbyist. In comparison to ARM based computers (like the raspberry pi for example) where the boot process is different for each device, and usually involves proprietary binaries which the user has no clue of how they work

In comparison, you could re-use, update, and repurpose any old x86 machine to do whatever you need.

yjftsjthsd-h2y ago

The really annoying thing is that we're so close to doing better - openfirmware is decades old, and if we must throw that away UEFI is in fact portable; we could have UEFI ARM machines with nice normal busses that the OS can enumerate and boot just like x86. But, y'know, that would cost another 10 cents a board so we get to live with the current trash. (I mean, this is even a thing that we do use to boot VMs and Windows on ARM, and AIUI ex. https://libre.computer/ does use UEFI firmware, the adoption is just super limited)

the_panopticon2y ago

https://www.intel.com/content/www/us/en/developer/articles/t...

jauntywundrkind2y ago

Opened comments to make sure this was mentioned, X86S (formerly X86-S or X86-Simplified). Getting rid of all the old compat modes & booting straight to 64-bit. See: Intel Continues Prepping The Linux Kernel For X86S, https://www.phoronix.com/news/Linux-6.9-More-X86S . Also mentioned by chipsandcheese: Of course, compatibility can’t be maintained forever. ISAs have to evolve. AMD and Intel probably want to save some money by reducing the validation work needed to support real mode. Intel is already planning to drop real mode.

Personally I think ditching the old real mode systems will be a big boon to hobbyists, not a hindrance (sorry mode 13h users)! Linux/x86 Boot Protocol docs tentatively support this assertion (https://www.kernel.org/doc/html/v6.9-rc1/arch/x86/boot.html). What is helpful is having ACPI and UEFI and other conventions/standards in place.

ARM does have Base Boot Requirements (https://arm-software.github.io/ebbr/), that builds to something vaguely x86 like, but wow there's so many systems that still use hardcoded device trees. I haven't spent that much time, but just a couple hours is all it takes to figure out that uncompressing your dtb, applying an overlay, and recompiling a new dtb is awful & terrible & no way to compute. ARM is used so heavily in consumer devices that it's hard to see what would compel the greater ecosystem to do the right thing, to reform. I also can go read a deck like https://uefi.org/sites/default/files/resources/UEFI%20and%20... (UEFI and ACPI in Arm System Architecture) and appreciate, yeah, well, trying to be compatible & a good citizen is hard; there's specifications on top of specifications on top of specifications (Wei lists 17) to make it happen. x86 has benefited form a history of everyone tending towards intercompatibility, but there's nothing else in computing that's ever had such a strong overriding cooperation motive before.

1 more reply

hackeraccount2y ago

That's the best part of x86-64. From a laptop to a large number of enterprise servers (or at least any I use) you're essentially dealing with the same stuff. I think Pi's were the first non x86-64 architecture I'd dealt with to any degree.

OK- there were some SPARC servers but that was a while ago - and they were honestly never any fun.

korginator2y ago· 3 in thread

Ignoring the power-guzzling data centres running Xeons at work, and talking as a layperson using a laptop, my older Intel MacBook Pro gives me 2-3 hours of battery life and heats up like a toaster, while my M2 MacBook Pro runs cool, and lasts a couple of days under moderate to heavy use before going flat. That's a huge win for me.

eternityforest2y ago

days of heavy use??? That's really impressive!

I haven't paid any attention to new PC hardware other than the RasPi in the last 5 years or so, and I've always ignored Apple, so I was really not expecting that much progress!

korginator2y ago

Moderate to heavy use. Not 100% heavy use. The official specs claim 15-18 hours battery life. Practically, I'd say I do about 8 hours a day and the Mac lasts a couple of days before the battery goes to single-digit percentages. With lighter use, I see nearly a week of battery life.

[Mine is an M2 16-inch MBP from last year, perhaps the M3's are somewhat better?]

In comparison, my new Dell work laptop with an Intel chip gives me about 4 hours. It's not an apples <-> apples comparison but they're in the ballpark.

rbanffy2y ago

Same here. Disconnected the charger when moving stuff in the office, worked a full day and halfway through the next one the laptop started complaining.

It's impressive. Nothing on the market comes even close.

snvzz2y ago· 3 in thread

Most CISC proponents (armchair digital architects) entirely miss the point.

An ISA is the interface between hardware and software. Thus a complex ISA does impose complexity upon both the hardware and the software.

Complexity is inherently (very) bad, and thus needs strong justification.

RISC embodies this idea by recognizing the value of simplicity and requiring any ISA addition be weighted against its complexity cost.

Implementations of RISC philosophy ISAs demonstrate (by achieving or even surpassing parity) the complexity in x86 is not justified, and this is why there hasn't been any tabula rasa CISC architecture worth noting in several decades.

StressedDev2y ago

I am going to be blunt. If CISC is so bad, why did almost all of the RISC chips from the 1980s and 1990s fail? Why aren't we using them today? The closest you will get is ARM chips. If you are going to claim RISC is fundamentally better, why aren't the fastest and most power efficient chips RISC chips? Why are Amazon, Google, and Microsoft buying an enormous number of x64 chips? It's not because they love the architecture. It's because x64 chips are the best in terms of cost, power usage, and performance.

My guess is the most important thing for chip performance is the manufacturing process. After that, it's things like pipelining, branch predictors, super-scaler design, etc. (I am not an expert and this is just a guess). I don't think instruction set really matters that much when chips have billions of transistors.

RISC was a great idea in the 1970s because a more complex instruction set meant fewer transistors for performance improvements. The same was also true in the 1980s. By 1995-1996, the Pentium Pro was the fastest 32-bit chip. At this point, RISC's proponents had to start explaining why a better instruction set did not translate into a faster chip. They never did. Instead, they keep on banging on the "RISC is better" drum without supplying better chips.

acdha2y ago

One thing to remember is that chips are complex and defy simple binary classification. Even Intel thought that CISC was on the way out, although they were going down a somewhat extreme path with EPIC, but the most successful approach turned out to be a hybrid where complex CISC instructions were broken into RISC-like micro-ops. That got Intel back in the game with the Pentium Pro getting close enough to the DEC Alpha’s performance lead and with the advantage of not having to recompile everything in an era where that was orders of magnitude harder than it is now. I wouldn’t say either side won since that has been going back and forth for decades now.

It’s also hard to separate that from other factors: was the Pentium more successful than the PowerPC because of CISC or because Intel had much better fabs than Motorola? If Motorola, IBM, DEC, or HP had had less incompetent management at the time it’s possible that we might remember this period very differently.

1 more reply

snvzz2y ago

>why did almost all of the RISC chips from the 1980s and 1990s fail?

Citation needed.

>Why aren't we using them today?

Because we are using better chips made recently, not the ones from the 80s and 90s.

>Why are Amazon, Google, and Microsoft buying an enormous number of x64 chips?

Because performance vs cost in the current market, as well as access to x86 software moat.

But this is changing. Notably, Amazon has Graviton, Microsoft was Windows for ARM with grease for x86 software, and Google has a digital design team, which is already iterating RISC-V based accelerators.

Facebook, FAANG you have not mentioned, has its own RISC-V server effort.

>I don't think instruction set really matters that much

And yet, you're writing this very opinionated comment about ISAs.

>RISC was a great idea

Yes, it was. This is why the industry did never again make a tabula rasa CISC ISA.

>At this point, RISC's proponents had to start explaining why a better instruction set did not translate into a faster chip.

The RISC chips actually were faster. But this did not matter, as Intel had the better fabs, and the cash.

So Intel was able bruteforce its way into enough performance for cheap enough that the market would then not bother going through the pain of switching ISAs.

>without supplying better chips.

The chips were better despite Intel's fab advantage. But they were not cheaper, nor did it run the software the market wanted to run.

They sure sold these Pentiums, and were able to buy (and kill) Alpha later.

The one and only reason x86 survives to date is this software moat.

This moat advantage is in danger now, thanks to Microsoft's efforts to detach Windows from x86 and provide emulation to handle the transition like Apple did.

1 more reply

eternityforest2y ago· 2 in thread

The nice thing about x86 is that it's so standardized. Everyone has been copying each other since the IBM PC era.

For a long time everything else was stuck with custom built images for every device.

TheLoafOfBread2y ago

> For a long time everything else was stuck with custom built images for every device.

Like today with ARM boards, where each OS is custom-bent to each board.

rbanffy2y ago

I gather many of those are mainlined now. Bootloaders might still be needed, but I would imagine the bulk of the customisations for a couple "major" platforms is sorted out.

1 more reply

phire2y ago· 2 in thread

One of the reasons why the RISC vs CISC debate keeps popping up every few years, is that we kind of stopped naming CPU uarches after the CISC and RISC terminology was introduced in the late 70s.

And because there wasn't any new names, everyone got stuck in this never ending RISC vs CISC debate.

As ChipsAndCheese points out, the uArch diagrams of modern high-performance ARM and x86 cores look very similar. And the real answer, is that both designs are neither RISC or CISC (the fact that one implements a CISC-derived ISA and the other implements a RISC-like ISA is irrelevant to the actual microarchtecture).

So what is this unnamed uarch pattern?

Mitch Alsup (who dwells on the comp.arch newsgroup) calls them GBOoO (Great Big Out-of-Order). And I quite like that name, guess I just need to convince everyone else in the computer software and computer hardware industry to adopt it too.

The GBOoO design pattern focuses on Out-of-Order execution to a somewhat insane degree.

They have massive reorder buffers (or similar structures) which allow hundreds of instructions to be in-flight at once, with complex schedulers tracking dependencies so they can dispatch uops to their execution units as soon as possible. Most designs today can disaptch at least 8 uops per cycle, and I've one design capable of reaching peaks of 14 uops dispatched per cycle.

To feed this out-of-order monster GBOoO designs have complex frontends. Even the smallest GBOoO designs can decode at least three instructions per cycle. Apples latest CPUs in the M1/M2 can decode eight instructions per cycle. Alternatively, uop caches are common (especially on x86 designs, but some ARM cores have them too), bypassing any instruction decoding bottlenecks.

GBOoO designs are strongly reliant on accurate branch predictors. With hundreds of instructions in flight, the frontend is often miles ahead of finalised instruction pointer. That in-flight instruction window might cross hundreds of loop iterators, or cover a dozen function calls/returns. Not only do these branch predictors reach high levels of accuracy (usually well above 90%), and can track and predict complex patterns, and indirect patters, but they can actually predict multiple branches per cycle (for really short loops).

Why do GBOoO designs aim for such insane levels of Out-of-Order execution? Partly its about doing more work in parallel. But the primary motivation is memory latency hiding. GBOoO designs want to race forwards and find memory load instructions as soon as possible, so they can be sent off to the complex multi-level cache hierarchy that GBOoO designs are always paired with.

If an in-order uarch ever misses the L1 cache, then the CPU pipeline is guaranteed to stall. Even if an L2 cache exists, it's only going to minimise the length of the stall.

But because GBOoO designs issue memory requests so early, there is a decent chance the L2 cache (or even L3 cache) can service the miss before the execution unit even needed that data (though I really doubt any GBOoO design can completely bridge a last-level cache miss).

Where did GBOoO come from?

From what I can tell, the early x86 Out-of-order designs (Intel's Pentium Pro/Pentium II, AMD's K6/K7) were the first to stumble on this GBOoO uarch design pattern. Or at least the first mass-market designs. I'm not 100% these early examples fully qualify as GBOoO, they only had reorder buffers large enough for a few dozen instructions, and the designers were drawn to the pattern because GBOoO's decoupled frontend and backend allowed them to push through bottlenecks caused by x86's legacy CISC instruction set.

But as the designs evolved (lets just ignore Intel's misadventures with netburst), the x86 designs of the mid 2000's (like the Core 2 Duo) were clearly GBOoO, and taking full advantage of GBOoO's abilities to hide memory latency. By 2010, we were staring to see ARM cores that were clearly taking notes and switching to GBOoO style designs.

snvzz2y ago

>As ChipsAndCheese points out, the uArch diagrams of modern high-performance ARM and x86 cores look very similar.

So far so good.

>And the real answer, is that both designs are neither RISC or CISC

This is... Not even wrong.

>(the fact that one implements a CISC-derived ISA and the other implements a RISC-like ISA is irrelevant to the actual microarchtecture).

Exactly. CISC and RISC are characteristics of the ISA, not the microarchitecture.

But note (and I can't stress this enough), this does not mean ISA doesn't matter.

The ISA is the interface between software and hardware. A well designed ISA will e.g.:

- Not restrict the actual design of the microarchitecture.

- Not expose microarchitecture artifacts.

- Not force unjustified complexity into the microarchitecture nor the software.

fanf22y ago

Yes, good points.

It’s worth remembering that the surviving CISC architectures (x86 and z/370) are less CISCy than VAX and 68k were, in terms of number of address operands and complexity of addressing modes. And ARM is not a classically RISCy RISC. Instruction sets seem to have converged on a pragmatic middle ground — except for RISCV :-)

1 more reply

1letterunixname2y ago· 1 in thread

Long-term, correct and fast emulation of x86 and ARM on other platforms is going to be damn important, features and bugs especially, for investment utilization and for long-term archival and historical purposes.

Apple did an excellent job with Rosetta 2 in most cases. It has its limitations since it's not 100% or sufficiently general as to replicate a Windows PC.

One approach that didn't work so well was Transmeta with VLIW and pouring resource-costly optimizations into the compiler.

All-in-all, CISC/RISC debate is a mirage because it depends on the net performance of the macro ISA running on some particular micro ISA. We don't have single-cycle, non-pipelined RISC or low cycle efficiency, hyper deep, 6 GHz PC processors anymore for good reason... they've been supplanted by a series of incremental computer organizational design approaches due to healthy competition. Now we have low energy ARM, blinding-fast laptops from the top 3 vendors, and ridiculous server metal like the 9754 and the 9474F for 6 TiB 2P systems.

snvzz2y ago

>All-in-all, CISC/RISC debate is a mirage because it depends on the net performance of

This ignores performance is not everything.

Complexity doesn't just mean more effort; it also translates into an increase of bugs.

johnnyjeans2y ago· 1 in thread

x86 doesn't need to die because there's nothing wrong with it (or at least, many of its issues are either not talked about or misrepresented.) Marketing hype and people who buy into that have established and cemented a narrative where the despotic x86 chains us to higher power draws and unreasonable architecture choices that we could otherwise do away with if only we could adopt the noble ARM magic silicon.

In truth ARM doesn't actually present any real gains (efficiency or otherwise) over x86 in pretty much any space as an inherent consequence of its ISA. The narrative has its origins seeded by way of the principle market that ARM found widespread success in being.... microprocessors and extremely low-end processor like those found in handheld gaming devices and eventually phones. Naturally these processors were designed to sip voltage by way of not actually pushing a whole lot of numbers. The market matured and so did the architecture, and we started to see cellphones that could really sling their weight! It's all smoke and mirrors though, as even in TYOOL 2024 the moment you do something intensive that your phone does not have a hardware accelerator for (eg, compiling software) it becomes apparent the thing you're holding in your hand is about as good as a core 2 duo when it comes to crunching numbers with a lot of branches. Then of course Apple came along and brought ARM back to the desktop space after decades of being relegated to power-sippers. People's jaws hit the floor over a chip that doesn't actually perform any better than its AMD counterparts, because oh hey! 20 hour battery life! Well, actually it's 11 hours if you're doing really light web browsing, and only in Safari. Well hey, that's slightly better than the comparable laptop chips in the x86 family released around the same time, right? And indeed, that's a few hours more than I got in my first gen AMD T14 which by all metrics is close enough to the M1 chip in my mbp. But you know, the more I dug futher, the more I found out that the battery life was about comparable when the system actually has to start doing work. Long video calls in Zoom? That was about 4 hours on both. Heavy use of Firefox? 6 hours each. Lots of compiling and a resource heavy dev environment with a fat C++ language server? Again, about 4 hours on both. The battery gains weren't from some mystical discrepency between how instructions are decoded on the two chips. In the end it just came down to the fact that as all operating systems do, the great M1 battery life was owed to really cute power management drivers in Apple's operating system (as well as implementing a fair amount of extensions to make things like video decode and javascript execution draw less power.) It's a fact that becomes all the more apparent when, while following the Asahi devblog, during the watershed moment of actually getting Linux to bootstrap itself on Apple silicon, I read the kernel's main loop doing absolutely nothing chewed through the whole battery in less than 3 hours. That sounds about right to me, given every single power draw experiment I've read between 2020 and now indicates the M1's power draw isn't really as great as the hype machine has made it out to be.

I'm sure we're all at the edge of our seats hoping to experience an efficiency revolution, and that this can be given to us with this one weird trick of changing ISAs... but it's not happening. At least not with ARM, or RISC-V or any other contemporary architecture that isn't x86.

Far be it from me to defend Intel or their frankenstein's monster architecture, but I've gotten a bit tired of this dream that we're on the cusp of performing supertasks in an instant, at the cost of mere picowatts. Especially not when it inadvertantly pushes us towards a future of SoC lock-in and hobbyist bare metal development becomes an even bigger waking nightmare. Until then, I sincerely hope x86 never dies, even though saying so kills a piece of me inside.

snvzz2y ago

>I'm sure we're all at the edge of our seats hoping to experience an efficiency revolution, and that this can be given to us with this one weird trick of changing ISAs... but it's not happening. At least not with ARM, or RISC-V or any other contemporary architecture that isn't x86.

You seem to miss that RISC-V being better is besides the point.

RISC-V's massive success was unavoidable, due to its open specification and free license.

The ISA did not need to even be good, just decent.

JonChesterfield2y ago

I don't think the instruction encoding is a significant problem. Cache coherency really might be.

A current x64 chip is a dozen or so separate dies with eight or so x64 cores per die, with a couple of those in different sockets. When one thread on one code decides to write to a cache line, the memory model makes really strong guarantees about cores on some other socket noticing that change.

Arm doesn't have to go with total store order. GPUs involve distinct blocks of memory with their own invariants on when caches are invalidated at potentially very coarse granularity (like no change will be seen until after a kernel has finished executing, where a kernel is essentially a process that sprung to life and then did arbitrary amounts of maths).

Fast x64 code is prone to carefully partitioning the problem across different cores and trying not to hit a cache from another core but even then you still have something like MOESI sitting in the background waiting just in case some thread mutates the instructions executing on another one.

fargle2y ago

x86 doesn't "need to die". long lasting designs that earned their keep and have proven are valuable. so from that standpoint, i agree with the premise.

the problem is that the article makes too much reliance on bad arguments such as "ISA differences were swept aside by the resources a company could put behind designing a chip". this is a dangerously bad argument. the fact that a company can afford to and is willing to keep x86 afloat and competitive with massive resources is not an argument for dismissal, but for its economic usefulness.

the value to a legacy ISA is real. but the cost is complexity. this complexity drives $, silicon area, power. period. in a even drag race, the simpler more efficient design would definitely win.

and a load-store architecture is simpler than a complex addressing scheme and will have more throughput per clock with less resources (design work, validation, $, area, power). a fixed or simple variable length opcode is always going to be simpler to implement than x86!

but on the other hand, a lot of those massive costs (NRE) are sunk costs. others are not.

so there's nothing wrong with x86 at the moment - it's still clearly the cheapest (due to scale) and fastest (definitely per/$) and excepting M2/M3 also fastest absolute per-core.

it certainly doesn't need to "die".

part of the advantage is inertia. and that's part of the disadvantage too. it's just barely starting to look like the trajectory is changing. but by the time the overall economics of ARM, RISC-V, etc. begin to overtake x86, the inertia and cruft will be negatively affecting them too.

things get old and die on their own. there's no stopping it. but in this case it doesn't need to be hastened and it won't be a "happy" moment when the changing of the guard does happen.

DerekL2y ago

Title is misspelled, should be “doesn't”.

j / k navigate · click thread line to collapse

67 comments

45 comments · 12 top-level

dmitrygr2y ago· 12 in thread

tester7562y ago

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

>...

dmitrygr2y ago

I did not talk about power - i talked about perf. No modern x86 chip can decode 6 or 7 of these long instrs per cycle. there are aarch64 chips that can

2 more replies

watersb2y ago

I believe that modern x86 processors store decoded micro-ops in the I$ instruction cache.

I always understood micro-ops to be fixed length.

Sure, you have to decode the variable length instructions at some point. But that extra work, relative to aarch64, is in practice amortized over the lifetime of that cache line.

1 more reply

LegionMammal9782y ago

> So unless your core is able to ingest over one icache line per cycle (intel cores currently are NOT), you are thus limited.

Do Intel cores no longer have a μop cache in front of the L1i cache?

snvzz2y ago

x86's approach to variable-length instructions is unfortunate.

In contrast, RISC-V leverages variable-length encoding to get the best code density among 64bit ISAs while sidestepping the instruction boundary problem.

(I digress, but note that while for the 32bit ISA RISC-V code density was competitive yet bested by ARM thumb2, it has since improved; RISC-V has the best density overall)

Findecanor2y ago

The length of a RISC-V instruction is in the first byte though, not the tenth.

1 more reply

pif2y ago

I think you are missing the only point of the article: performance and compatibility are important; everything else is just aesthetics.

aurareturn2y ago

Most important applications have an ARM version now. Especially true since Apple Silicon and AWS Graviton. Windows will force developers to compile both x86 and ARM versions.

JonChesterfield2y ago

It's of like saying x64 is limited by only having 16 registers - there's only names for 16ish in the ISA, but there's loads more registers in the machine as part of hiding latency.

snvzz2y ago

>It's of like saying x64 is limited by only having 16 registers - there's only names for 16ish in the ISA, but there's loads more registers in the machine as part of hiding latency.

Why not have just one, then?

After all, there's loads more registers in the machine as part of hiding latency.

The ISA either matters or it does not. Pick one.

account4mypc2y ago

usually the really fat instructions take over 1 cycle anyway, right? so the decoder should be able to keep up

dmitrygr2y ago

pipelining...

they are usually piplineable

1 more reply

BobbyTables22y ago· 5 in thread

The problem is not the ISA — it’s the whole ecosystem.

A x86 system is a witches’ brew of MSRs, I/O ports, and chipset-specific PCI devices. And that’s just across only Intel CPUs…

How much code has to execute before even a bootloader can run?

Why do we need a damn ACPI interpreter?

Why do we still deal with legacy PCI routing (on all devices) when none actually use it?

The PCI configuration space is a bit of a mess. We should just make a new standard where everything 64bit and memory-mapped only.

Why are we shackled with slow IO port operations that replicate hardware from the era when leaded gasoline was widely available? Some may say this is legacy, yet we still rely on it today!!

Imagine if 1980s systems were shackled with 1940s-1960s compatibility concerns.

We need to start afresh — take all the learnings from the past decades and cast off the legacy crap.

johnnyjeans2y ago

> Why do we need a damn ACPI interpreter?

No thanks. Keep your ridiculous blackbox that's actively hostile to having new software run on it.

TheLoafOfBread2y ago

> SoC-land is much, much worse

Exactly. It does not matter what core architecture is being used. What matters is that each system usually has different memory model, which is completely defeating any compatibility

StressedDev2y ago

We will when you are willing to pay for the massive cost of redoing everything which already works well. Also, are you going to buy everyone new hardware for your new better architecture?

rbanffy2y ago

> Why are we shackled with slow IO port operations that replicate hardware from the era when leaded gasoline was widely available? Some may say this is legacy, yet we still rely on it today!!

BobbyTables22y ago

Today they still boot in 16-bit real mode. Very quickly, it usually switches to 32bit mode for the BIOS, maybe eventually 64bit mode.

Option ROMs still start in real mode (at least for non-UEFI). The system management (SMI) handlers are still launched in real mode but too!

When the OS gets launched, l it brings up the other CPUs in real mode (INIT/SIPI), and has to do all the same gymnastics again…

Even PCI configuration space access is still supported with IO ports and that mode still required for some aspects.

Even today, the serial port still uses IO ports and interrupts just as it did 40+ years ago…

Yep, IO based POST codes are still a thing…

robotnikman2y ago· 4 in thread

In comparison, you could re-use, update, and repurpose any old x86 machine to do whatever you need.

yjftsjthsd-h2y ago

the_panopticon2y ago

https://www.intel.com/content/www/us/en/developer/articles/t...

jauntywundrkind2y ago

1 more reply

hackeraccount2y ago

OK- there were some SPARC servers but that was a while ago - and they were honestly never any fun.

korginator2y ago· 3 in thread

eternityforest2y ago

days of heavy use??? That's really impressive!

I haven't paid any attention to new PC hardware other than the RasPi in the last 5 years or so, and I've always ignored Apple, so I was really not expecting that much progress!

korginator2y ago

[Mine is an M2 16-inch MBP from last year, perhaps the M3's are somewhat better?]

In comparison, my new Dell work laptop with an Intel chip gives me about 4 hours. It's not an apples <-> apples comparison but they're in the ballpark.

rbanffy2y ago

Same here. Disconnected the charger when moving stuff in the office, worked a full day and halfway through the next one the laptop started complaining.

It's impressive. Nothing on the market comes even close.

snvzz2y ago· 3 in thread

Most CISC proponents (armchair digital architects) entirely miss the point.

An ISA is the interface between hardware and software. Thus a complex ISA does impose complexity upon both the hardware and the software.

Complexity is inherently (very) bad, and thus needs strong justification.

RISC embodies this idea by recognizing the value of simplicity and requiring any ISA addition be weighted against its complexity cost.

StressedDev2y ago

acdha2y ago

1 more reply

snvzz2y ago

>why did almost all of the RISC chips from the 1980s and 1990s fail?

Citation needed.

>Why aren't we using them today?

Because we are using better chips made recently, not the ones from the 80s and 90s.

>Why are Amazon, Google, and Microsoft buying an enormous number of x64 chips?

Because performance vs cost in the current market, as well as access to x86 software moat.

Facebook, FAANG you have not mentioned, has its own RISC-V server effort.

>I don't think instruction set really matters that much

And yet, you're writing this very opinionated comment about ISAs.

>RISC was a great idea

Yes, it was. This is why the industry did never again make a tabula rasa CISC ISA.

>At this point, RISC's proponents had to start explaining why a better instruction set did not translate into a faster chip.

The RISC chips actually were faster. But this did not matter, as Intel had the better fabs, and the cash.

So Intel was able bruteforce its way into enough performance for cheap enough that the market would then not bother going through the pain of switching ISAs.

>without supplying better chips.

The chips were better despite Intel's fab advantage. But they were not cheaper, nor did it run the software the market wanted to run.

They sure sold these Pentiums, and were able to buy (and kill) Alpha later.

The one and only reason x86 survives to date is this software moat.

This moat advantage is in danger now, thanks to Microsoft's efforts to detach Windows from x86 and provide emulation to handle the transition like Apple did.

1 more reply

eternityforest2y ago· 2 in thread

The nice thing about x86 is that it's so standardized. Everyone has been copying each other since the IBM PC era.

For a long time everything else was stuck with custom built images for every device.

TheLoafOfBread2y ago

> For a long time everything else was stuck with custom built images for every device.

Like today with ARM boards, where each OS is custom-bent to each board.

rbanffy2y ago

I gather many of those are mainlined now. Bootloaders might still be needed, but I would imagine the bulk of the customisations for a couple "major" platforms is sorted out.

1 more reply

phire2y ago· 2 in thread

One of the reasons why the RISC vs CISC debate keeps popping up every few years, is that we kind of stopped naming CPU uarches after the CISC and RISC terminology was introduced in the late 70s.

And because there wasn't any new names, everyone got stuck in this never ending RISC vs CISC debate.

So what is this unnamed uarch pattern?

The GBOoO design pattern focuses on Out-of-Order execution to a somewhat insane degree.

If an in-order uarch ever misses the L1 cache, then the CPU pipeline is guaranteed to stall. Even if an L2 cache exists, it's only going to minimise the length of the stall.

Where did GBOoO come from?

snvzz2y ago

>As ChipsAndCheese points out, the uArch diagrams of modern high-performance ARM and x86 cores look very similar.

So far so good.

>And the real answer, is that both designs are neither RISC or CISC

This is... Not even wrong.

>(the fact that one implements a CISC-derived ISA and the other implements a RISC-like ISA is irrelevant to the actual microarchtecture).

Exactly. CISC and RISC are characteristics of the ISA, not the microarchitecture.

But note (and I can't stress this enough), this does not mean ISA doesn't matter.

The ISA is the interface between software and hardware. A well designed ISA will e.g.:

- Not restrict the actual design of the microarchitecture.

- Not expose microarchitecture artifacts.

- Not force unjustified complexity into the microarchitecture nor the software.

fanf22y ago

Yes, good points.

1 more reply

1letterunixname2y ago· 1 in thread

Apple did an excellent job with Rosetta 2 in most cases. It has its limitations since it's not 100% or sufficiently general as to replicate a Windows PC.

One approach that didn't work so well was Transmeta with VLIW and pouring resource-costly optimizations into the compiler.

snvzz2y ago

>All-in-all, CISC/RISC debate is a mirage because it depends on the net performance of

This ignores performance is not everything.

Complexity doesn't just mean more effort; it also translates into an increase of bugs.

johnnyjeans2y ago· 1 in thread

snvzz2y ago

You seem to miss that RISC-V being better is besides the point.

RISC-V's massive success was unavoidable, due to its open specification and free license.

The ISA did not need to even be good, just decent.

JonChesterfield2y ago

I don't think the instruction encoding is a significant problem. Cache coherency really might be.

fargle2y ago

x86 doesn't "need to die". long lasting designs that earned their keep and have proven are valuable. so from that standpoint, i agree with the premise.

the value to a legacy ISA is real. but the cost is complexity. this complexity drives $, silicon area, power. period. in a even drag race, the simpler more efficient design would definitely win.

but on the other hand, a lot of those massive costs (NRE) are sunk costs. others are not.

so there's nothing wrong with x86 at the moment - it's still clearly the cheapest (due to scale) and fastest (definitely per/$) and excepting M2/M3 also fastest absolute per-core.

it certainly doesn't need to "die".

things get old and die on their own. there's no stopping it. but in this case it doesn't need to be hastened and it won't be a "happy" moment when the changing of the guard does happen.

DerekL2y ago

Title is misspelled, should be “doesn't”.

j / k navigate · click thread line to collapse