The Past, Present, and Future of the CPU, According to Intel and AMD (opens in new tab)

(gamespot.com)

53 points0xb011y ago49 comments

49 comments

34 comments · 8 top-level

zanny11y ago· 10 in thread

> could run 64-bit operating systems, which could address more than 4GB of RAM

I just want to vent how insane it is how the x86 ecosystem moved to 64 bit because Microsoft refused to support physical address extensions in XP. It was completely artificial, it was not a technical limitation. Hell, we really have no excuse to be on 64 bit even today - the improved integer and floating precision is nice, but few people cared about that. With PAE, you are only limited by the per-application address space... which you are still limited by today if you are running a 32 bit program. And in the Windows ecosystem, since there is so much less free software and no package management, everything is distributed as a 32 bit binary to be compatible across the board.

Which means in practice nobody ever needed 64 bit. And those that did had business reasons to do so. It is true that Itanium failed because Intel was ignorant of the massive inertial resistance to trying to bring tools across architectures - there is a reason OS/2 still exists, and that is just cross-OS, you can still use the same ASM at least. And you would always be surprised how many enterprise programs are ASM bound somewhere because some unholy desecration of coding practices goes on in the bowels of that non-source-controlled shared repository of suffering.

But I'm talking about the consumer space here - if AMD64 took off as the Xeon class of 2004 that would have been fine, but we made this transition for pretty much no reason at all. I have no idea why Microsoft thought making Windows 64 bit for the Vista release (there was 64 bit XP, but that was really rare and business focused) was easier than just supporting PAE.

But that is happening again today. Kind of in reverse, though. Or maybe it is just a precedent that was set? Apple has now opened the floodgates for ARMv8, and now everyone and their mother wants 64 bit buzzwords on their products while their phones still ship with 3GB of less of memory, and their architecture again supports PAE without issue.

Because there are performance ramifications here. You can fit less into your cache lines when every address takes up twice the space. There is a reason to try fitting your data in the smallest format possible. You end up with more pages in aggregate from all the wasted space, and thus consume more memory implicitly. And we never even got real 64 bit - x86 chips are still physical 48 bit, because somewhere along the lines it was realized "64 bits is ludicrous amounts of memory".

I think the best part is that memory limitations are also solvable problems. If your program hits its memory limit (ie, a 32 bit binary with 4GB addressing) you can just fork a process instead of a thread, and suddenly you double your working space. It is insanely rare to have an active working set of more than 4GB where that data needs to be address-local available for access or else you suffer huge slowdowns, and even more so in routines where you cannot load balance to delegate processes that manage ranges of values that get that big.

Likewise, if you hit a kernel address limit (remember, with PAE, you can get anything from a magnitude to thousandfold increases in available physical memory) your problem is so huge it makes sense to have it worked on by a server farm compute cluster. Unless you tried to voodoo all that circuitry together into some supermassive NUMA system (sounds painful) which wouldn't make any sense anyway because trying to abstract away network interconnects between disparate nodes at that scale has so much latency trying to treat it like memory when it is as slow as flash storage is redundant.

....

Ok, tangential rant over. I just think 64 bit is such a stupid buzzword waste of time, and its crazy that the industry has gone deep end on it for so long because it exploits some cultural tick in people that bigger is better. No, in practice it really does not "hurt" to have, but we (at least the consumer class, and 99% of business use cases) never needed it in the first place, really.

joenathan11y ago

>Microsoft refused to support physical address extensions in XP

Completely wrong

"The original releases of Windows XP and Windows XP SP1 used PAE mode to allow RAM to extend beyond the 4 GB address limit. However, it led to compatibility problems with 3rd party drivers which led Microsoft to remove this capability in Windows XP Service Pack 2. Windows XP SP2 and later, by default, on processors with the no-execute (NX) or execute-disable (XD) feature, runs in PAE mode in order to allow NX.[14] The no execute (NX, or XD for execution disable) bit resides in bit 63 of the page table entry and, without PAE, page table entries on 32-bit systems have only 32 bits; therefore PAE mode is required in order to exploit the NX feature. However, "client" versions of 32-bit Windows (Windows XP SP2 and later, Windows Vista, Windows 7) limit physical address space to the first 4 GB for driver compatibility via the licensing limitation mechanism, even though these versions do run in PAE mode if NX support is enabled.

Windows 8 will only run on processors which support PAE, in addition to NX and SSE2."

- http://en.wikipedia.org/wiki/Physical_Address_Extension#Micr...

zanny11y ago

The driver instability was only their fault. The fact is the reason AMD64 "won" was because in the late 2004 era, with SP2, Microsoft killed PAE (the concept, not the implementation, by artificially limiting addressable memory), and the artificial memory limit on 32 bit systems doomed the architecture.

I speak only of the time period from 2004 - 2008 where we went from Pentium 4 to Core 2 (and Athlon64 to Phenom), where 64 bit became ubiquitous because Microsoft failed to provide its consumer grade OSes the ability to access the amounts of memory the hardware supported.

2 more replies

jeffreyrogers11y ago

My understanding is that even with PAE (Physical Address Extensions) a 32bit OS can only provide 4GB of RAM to any individual application (but more than 4GB or RAM can be used in total). In which case there is a benefit to 64bit OS's, particularly with memory intensive applications such as games, CAD tools, or audio/video processing.

Edit: Actually you already addressed this in your post, with forking a process, which does seem like a potentially better solution in many situations.

Also, not sure why you're being downvoted, since your post seems intelligent and well thought out.

zanny11y ago

> games

Almost all games are distributed as 32 bit binaries, albeit more recent titles are coming out 64 only. Point is, only one ever really comes out, because the vendors don't want to support two independent binaries, and in practice games really have no excuse to use more than 4GB of ram - you really should be caching to disk any scene or environment data that is that large, and all your texture data is already resident on the GPU. GPU drivers would have to be PAE aware, though, if you wanted to address more than 4GB of video memory (such as on the Titan).

And all the media creation tools do fall under that "business class use case". It is not that I think 64 bit is stupid - if you need 64 bit addressing, you need 64 bit addressing - just that the average consumer did not need it, at all. Business applications that needed 64 bit could have been shipped as such, hopefully for Itanium (read some docs on it, the ASM and pipeline model were way ahead of their time - to its detriment, sadly) which could have been the Xeon-esque business CPUs.

1 more reply

wmf11y ago

Both iOS and Android are reporting significant performance gains from 64-bit, so ARMv8 is not exactly pointless.

runeks11y ago

But are the performance gains a result of a completely new CPU architecture (ARMv8-A, that happens to be 64 bit), or the result of addressing memory using 64 bits?

In reality, I think your statement should be "Both iOS and Android are reporting significant performance gains from ARMv8 [...]", not necessarily from going "64 bit".

1 more reply

maximilianburke11y ago

The extra registers are really nice, as is the deprecation of x87.

zanny11y ago

In hindsight the best thing about AMD64 is the ability to assume all the extensions found in the 686 and such up to that point. Programs compiled today for ia32 often still restrict themselves to the 386 instruction set.

1 more reply

AnimalMuppet11y ago

I wonder if some of the reason for the 32-bit-to-64-bit transition is because there are still a lot of people who remember the last one. 16-bit-to-32-bit was in fact a very big deal. People who remember that one may have it coloring their expectations for 32 -> 64.

pjc5011y ago

Memory space issues aside, the AMD64 instruction set is much nicer than 32-bit x86. The performance benefits of more registers often outweigh the larger cache lines.

KerrickStaley11y ago· 4 in thread

Mantle still seems somewhat far out—the article mentions that the spec won't be published until the end of this year, and there's no word on when Linux support will come (though AMD has said it will happen).

jeffreyrogers11y ago

Do you know of a good explanation of Mantle's goals? What sets it apart from DirectX or OpenGL?

corysama11y ago

OpenGL and and D3D are both very good APIs that have worked well for 20+ years. However, they have a few ideas about the hardware fundamentally baked into them that are not aging well.

The main issue is that they are based on a model of continuously modifying a very large, monolithic body of state representing fine details about what the next draw should do. At any moment a draw call may be issued to enact the current state and produce a result.

In the past, that state was represented in hardware mostly using a large collection of physical registers. Nothing else could possibly be fast enough. The API model of "set BlendStateSourceOp, set BlendStateDestOp, ect..." mapped very well to the hardware. You literally were continuously mutating a large block of registers.

In the present, programmable hardware has become capable of largely taking over for fixed-function hardware. Modern GPUs have been increasingly cutting out special-purpose silicon to make room for more multi-purpose ALUs. These general-purpose ALUs represent how to draw using fairly large, allocated structures instead of single-purpose registers. These structures are not trivial to construct and modifying them continuously is not advised. However, switching between them is as trivial as moving a pointer from one to the other.

Fortunately, most games don't actually use a continuum of states when drawing. In practice, they switch repeatedly between a small number of states with very little variation between frames. Therefore, modern drivers do a lot of work to implicitly infer what state setups are heavily repeated within each run each application. These states are baked into structures under the hood on the fly. Odd variants are expensive in this mode. But, they are also rare, so they are lower priority.

Mantle, Metal and DX12 all seek to reboot the idea of graphics APIs from scratch based on how hardware actually works today. You set up a an explicit set of draw state structures at init time. You switch between them explicitly and trivially at run time.

A second issue baked into OGL/D3D is that, in the past, the monolithic draw state was stratified into quite nicely orthogonal chunks dealing with separate issues such as: how to load a vertex from memory vs. how to operate on a vertex vs. how to pass data from the vertex shader to the fragment shader vs. how to operate on a fragment (sample) vs. how to blend the fragment into the framebuffer. This model made the APIs quite nice to learn and to use.

Unfortunately, it is simply not representative of how the hardware actually operates today. Today, most of those operations are actually handled by general purpose ALUs. These ALUs are running the vertex and fragment programs you wrote. But, they are also running more code to handle what used to be done in fixed-function silicon. Actually, it's worse than that. What used to be a register flip that was completely orthogonal to your vertex/fragment programs is now actually implemented by modifying code interleaved into the guts of the programs you compiled back at init time. These changes are done under the hood and on the fly.

Modifying the code under the hood is expensive. Worse, the draw state is so large and complicated that it is easy to accidentally request an invalid state. Validating each given state is expensive. Because the classic model lets you make draw state changes at any time preceding a draw and the state changes are no longer stratified, the state validation can no longer be done incrementally. Instead, every time you draw a significant amount of work is done just to make sure the request makes sense.

Again, by declaring draw states up front. Compilation and validation can be done once up front. Switching between pre-compiled, pre-validated states is trivial.

A third issue is that OGL/D3D have the genuinely great goal of preventing and/or detecting synchronization errors in the usage of the API. In other words, you really shouldn't try to have the CPU modify a given block of memory while the GPU is simultaneously reading that same memory in an uncoordinated fashion. OGl and D3D have an interface and implementation designed to prevent/detect/allow-at-a-huge-cost these usage errors as much as possible. In practice, serious programs cannot ship with these errors. That means that in practice, all serious, shipping programs do not have these errors to any significant degree, but the driver is still always doing a large amount of work checking for them all of the time.

The new-style APIs seem more inclined to declare this category of usage errors to be undefined behavior rather than pay the cost to handle them. "Here's how to avoid them. So... avoid them."

A fourth issue is that multi-core computing is much more common and important than it was in the past. OpenGL has never had in interface to issue draw command from multiple threads of a single process. D3D11 had an interface to record commands on multiple threads and dispatch them on a primary thread, but the consensus is that D3D11's implementation did not work as well as was expected in practice.

Mantle, Metal and DX12 all have new, multi-threaded interfaces that they are quite confident will work well in practice.

Much of what I'm describing here is covered in this presentation from Microsoft "DirectX 12 API Preview" https://www.youtube.com/watch?v=m0QkjKGZQzI

An alternative approach has been proposed by a multi-vendor group of OpenGL driver developers. It was presented in the "Approaching Zero Driver Overhead" (AZDO) talk at GDC 2014. http://gdcvault.com/play/1020791/ and https://www.khronos.org/assets/uploads/developers/library/20...

In the AZDO approach, instead of tossing out the legacy state machine of OpenGL, they demonstrate how some current (fairly cutting edge) features that have recently been added allow a draw state to be set up that is so expressive and so extensive that it can pretty effectively represent a whole, fairly complicated scene of a modern game in a single draw state. Once you set this up, you can pretty much issue a single request to draw much-if-not-all of the current frame as an atomic operation. Further, common frame-to-frame modifications (such as moving objects around) are very cheap in this setup.

ADZO is an interesting and perfectly workable approach. I am less of a fan of that approach than I am the DX12 approach.

I should make this into a blog post... I should start a blog...

2 more replies

Narishma11y ago

https://www.amd.com/Documents/Mantle_White_Paper.pdf

1 more reply

sn0v11y ago

Given their atrocious track record on Linux, I wouldn't hold my breath.

jeffreyrogers11y ago· 4 in thread

The idea that Moore's law is still continuing is a bit misleading. For all practical purposes it's over. See the graph here: http://www.extremetech.com/computing/116561-the-death-of-cpu...

Yes, we can still cram more transistors on a chip, but we can't get the clock speed any faster because we run into power/heat limitations. And actually if you look at the graph in the link I posted you'll see that clock speed peaked around 2005. Around that time AMD and Intel focused more on multiple cores and clock speed became less important.

Tangent: Personally, I like ARM much better than x86/x86-64. RISC based architectures just seem so much nicer. Plus, they are easier to optimize for and more predictable in terms of clock cycles per instruction.

m0th8711y ago

Moore's law is about the number of transistors, not clock speed.

"The number of transistors incorporated in a chip will approximately double every 24 months." - Gordon Moore: http://www.intel.com/content/www/us/en/history/museum-gordon...

jeffreyrogers11y ago

That's true, but the reason people care about it is because in the past more transistors led to higher clock speeds, which is what I was getting at by saying we can still put more transistors on a chip, but that power and heat limitations prevent that from turning into increased clock speeds.

2 more replies

maximilianburke11y ago

Modern RISC architectures, such as the ARM Cortex-A9 or any 64-bit ARM variant, aren't any easier to optimize for or more predictable than Intel/AMD processors as they all feature out of order execution.

jeffreyrogers11y ago

They are easier for the compiler to optimize because the pipelines are simpler... at least that's what I was taught in my compilers class.

ewzimm11y ago· 2 in thread

I'm disappointed that the article leads with Intel's performance lead. I've always found it to be meaningless. Their best CPU is over $2500. Their best i7 is over $1000. I use an AMD A4-3300. It's fast enough for my needs and $20. On the other hand, I also have a Bay Trail Atom tablet, and it has fantastic performance for its power usage. The article talks about what a failure Atom has been. I think part of the blame is tech journalism promoting the fastest silicon. It appeals to some psychological desire, but it doesn't make much practical sense. Battery life on mobile and value on the desktop is all I really care about. Why is it such an embarrassment for both AMD's desktop division and Intel's mobile division to be the most efficient?

runeks11y ago

Agreed. When I design a new PC, I don't look for the highest performance system, I look for performance per dollar. This chart is very useful: http://www.cpubenchmark.net/cpu_value_available.html

When designing a new PC about a year ago, I went with an eight-core AMD FX-8320. Which was significantly cheaper than an equally fast Intel i5 or i7 CPU (although the per-thread performance is only half of a 4-core i5/i7 with the same Passmark score).

moystard11y ago

The gaming community is globally aware that investing in an expensive CPU is not the best choice. Gaming performance will depend mostly on the GPU, and most gamers go for an AMD configuration for that specific reason.

Regarding the mobile usage, I think Intel has a clear advantage as their CPUs are more powerful and more efficient.

Handwash11y ago· 2 in thread

"If Epic and its Unreal engine on console don't have a threaded graphics pipeline--which to date they don't"

I'm surprised to know that this thing is happening. I thought everyone has optimized their program to go multi-threading.

m0th8711y ago

Current game engines are still bound by the single-threaded game loop model, although not by necessity. Here's a really cool presentation about it: https://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced...

Narishma11y ago

Most did, but not everyone. Unfortunately, those who didn't are the most popular 3rd party engines (UE3, CryEngine, Source, Unity).

yazaddaruvala11y ago· 2 in thread

I'm curious.. let me know good or bad (or wrong) what you guys think of this.

Quick background. My understanding of how a mobile phone works: There is a primary CPU, running Android, which does the majority of the work on the phone. Meanwhile, there is a second CPU attached to the radio running an RTOS. This RTOS interprets the signals from the antenna and makes nice packets for Android. The RTOS and secondary processing unit can then be optimized for just that one task, and Android can be optimized for reading, writing and processing data packets.

Similarly, what if we made more parts of our infrastructure "smarter"? Take for example the monitor. Currently, it has a frame buffer, we fill it through a DVI cord and the lights change. Is there some abstraction at which we could make a monitor work "smarter". Can we put a GPU in the monitor? Then just move memory and call OpenGL commands. Is that a better abstraction for monitors? Can we also put a CPU on the monitor and embed an RTOS/rendering engine? Would that make a better abstraction? Could we then optimize the "smarter" monitor / integration with the game logic, better than we could have if the monitor just has a frame buffer?

I don't have much domain knowledge and so I don't know what the correct abstraction for a "smarter" monitor should be. However, I do think its a good question to be asking (and not just about monitors). I'm curious if you guys can think of such an abstraction. What would that be?

pandaman11y ago

GPUs are already pretty smart and are controlled by their own processor(s). Moving them into monitor is not going to make them smarter. On the other hand, communication with a GPU hooked up with a flexible cable is going to be more complicated than with a GPU sitting on a wide internal bus or, as it becomes more common, on the same chip with the CPU.

pjc5011y ago

Then you'd have to upgrade your monitor every few years as the graphics card became obsolete. Also note that "integrate monitor with game logic" turns into "run the entire game in the monitor", at which point you've just pushed the computer up the cable into the monitor.

Interestingly, smarter displays were actually available at one point as a "graphical terminal" http://en.wikipedia.org/wiki/Tektronix_4014 ; people keep trying to reinstate this with the "thin client".

The history of thin clients shows what the problem with "smart" infrastructure is. It moves the tradeoffs around rather than giving absolute benefits.

kens11y ago· 1 in thread

I'd like a better understanding of modern processor microarchitectures, i.e. what's happening inside the chip. What do you recommend I read? I'd like to be able to understand a diagram like: http://www.realworldtech.com/wp-content/uploads/2012/10/hasw...

Dwolb11y ago

I'd recommend this course: https://www.coursera.org/course/comparch

If you're too advanced for this, I'd consider writing some behavioral VHDL or Verilog for a few of the units to see how a few of the pieces fit together.

chucknelson11y ago· 1 in thread

First thought: "This is a Gamespot article?" Seems like something you'd see at Anandtech.

KerrickStaley11y ago

Except that it's covering topics that were already well-known in the tech community 6 months ago.

j / k navigate · click thread line to collapse

49 comments

34 comments · 8 top-level

zanny11y ago· 10 in thread

> could run 64-bit operating systems, which could address more than 4GB of RAM

....

joenathan11y ago

>Microsoft refused to support physical address extensions in XP

Completely wrong

Windows 8 will only run on processors which support PAE, in addition to NX and SSE2."

- http://en.wikipedia.org/wiki/Physical_Address_Extension#Micr...

zanny11y ago

2 more replies

jeffreyrogers11y ago

Edit: Actually you already addressed this in your post, with forking a process, which does seem like a potentially better solution in many situations.

Also, not sure why you're being downvoted, since your post seems intelligent and well thought out.

zanny11y ago

> games

1 more reply

wmf11y ago

Both iOS and Android are reporting significant performance gains from 64-bit, so ARMv8 is not exactly pointless.

runeks11y ago

But are the performance gains a result of a completely new CPU architecture (ARMv8-A, that happens to be 64 bit), or the result of addressing memory using 64 bits?

In reality, I think your statement should be "Both iOS and Android are reporting significant performance gains from ARMv8 [...]", not necessarily from going "64 bit".

1 more reply

maximilianburke11y ago

The extra registers are really nice, as is the deprecation of x87.

zanny11y ago

1 more reply

AnimalMuppet11y ago

pjc5011y ago

Memory space issues aside, the AMD64 instruction set is much nicer than 32-bit x86. The performance benefits of more registers often outweigh the larger cache lines.

KerrickStaley11y ago· 4 in thread

jeffreyrogers11y ago

Do you know of a good explanation of Mantle's goals? What sets it apart from DirectX or OpenGL?

corysama11y ago

OpenGL and and D3D are both very good APIs that have worked well for 20+ years. However, they have a few ideas about the hardware fundamentally baked into them that are not aging well.

Again, by declaring draw states up front. Compilation and validation can be done once up front. Switching between pre-compiled, pre-validated states is trivial.

The new-style APIs seem more inclined to declare this category of usage errors to be undefined behavior rather than pay the cost to handle them. "Here's how to avoid them. So... avoid them."

Mantle, Metal and DX12 all have new, multi-threaded interfaces that they are quite confident will work well in practice.

Much of what I'm describing here is covered in this presentation from Microsoft "DirectX 12 API Preview" https://www.youtube.com/watch?v=m0QkjKGZQzI

ADZO is an interesting and perfectly workable approach. I am less of a fan of that approach than I am the DX12 approach.

I should make this into a blog post... I should start a blog...

2 more replies

Narishma11y ago

https://www.amd.com/Documents/Mantle_White_Paper.pdf

1 more reply

sn0v11y ago

Given their atrocious track record on Linux, I wouldn't hold my breath.

jeffreyrogers11y ago· 4 in thread

The idea that Moore's law is still continuing is a bit misleading. For all practical purposes it's over. See the graph here: http://www.extremetech.com/computing/116561-the-death-of-cpu...

m0th8711y ago

Moore's law is about the number of transistors, not clock speed.

"The number of transistors incorporated in a chip will approximately double every 24 months." - Gordon Moore: http://www.intel.com/content/www/us/en/history/museum-gordon...

jeffreyrogers11y ago

2 more replies

maximilianburke11y ago

jeffreyrogers11y ago

They are easier for the compiler to optimize because the pipelines are simpler... at least that's what I was taught in my compilers class.

ewzimm11y ago· 2 in thread

runeks11y ago

Agreed. When I design a new PC, I don't look for the highest performance system, I look for performance per dollar. This chart is very useful: http://www.cpubenchmark.net/cpu_value_available.html

moystard11y ago

Regarding the mobile usage, I think Intel has a clear advantage as their CPUs are more powerful and more efficient.

Handwash11y ago· 2 in thread

"If Epic and its Unreal engine on console don't have a threaded graphics pipeline--which to date they don't"

I'm surprised to know that this thing is happening. I thought everyone has optimized their program to go multi-threading.

m0th8711y ago

Narishma11y ago

Most did, but not everyone. Unfortunately, those who didn't are the most popular 3rd party engines (UE3, CryEngine, Source, Unity).

yazaddaruvala11y ago· 2 in thread

I'm curious.. let me know good or bad (or wrong) what you guys think of this.

pandaman11y ago

pjc5011y ago

The history of thin clients shows what the problem with "smart" infrastructure is. It moves the tradeoffs around rather than giving absolute benefits.

kens11y ago· 1 in thread

Dwolb11y ago

I'd recommend this course: https://www.coursera.org/course/comparch

If you're too advanced for this, I'd consider writing some behavioral VHDL or Verilog for a few of the units to see how a few of the pieces fit together.

chucknelson11y ago· 1 in thread

First thought: "This is a Gamespot article?" Seems like something you'd see at Anandtech.

KerrickStaley11y ago

Except that it's covering topics that were already well-known in the tech community 6 months ago.

j / k navigate · click thread line to collapse