I just want to vent how insane it is how the x86 ecosystem moved to 64 bit because Microsoft refused to support physical address extensions in XP. It was completely artificial, it was not a technical limitation. Hell, we really have no excuse to be on 64 bit even today - the improved integer and floating precision is nice, but few people cared about that. With PAE, you are only limited by the per-application address space... which you are still limited by today if you are running a 32 bit program. And in the Windows ecosystem, since there is so much less free software and no package management, everything is distributed as a 32 bit binary to be compatible across the board.
Which means in practice nobody ever needed 64 bit. And those that did had business reasons to do so. It is true that Itanium failed because Intel was ignorant of the massive inertial resistance to trying to bring tools across architectures - there is a reason OS/2 still exists, and that is just cross-OS, you can still use the same ASM at least. And you would always be surprised how many enterprise programs are ASM bound somewhere because some unholy desecration of coding practices goes on in the bowels of that non-source-controlled shared repository of suffering.
But I'm talking about the consumer space here - if AMD64 took off as the Xeon class of 2004 that would have been fine, but we made this transition for pretty much no reason at all. I have no idea why Microsoft thought making Windows 64 bit for the Vista release (there was 64 bit XP, but that was really rare and business focused) was easier than just supporting PAE.
But that is happening again today. Kind of in reverse, though. Or maybe it is just a precedent that was set? Apple has now opened the floodgates for ARMv8, and now everyone and their mother wants 64 bit buzzwords on their products while their phones still ship with 3GB of less of memory, and their architecture again supports PAE without issue.
Because there are performance ramifications here. You can fit less into your cache lines when every address takes up twice the space. There is a reason to try fitting your data in the smallest format possible. You end up with more pages in aggregate from all the wasted space, and thus consume more memory implicitly. And we never even got real 64 bit - x86 chips are still physical 48 bit, because somewhere along the lines it was realized "64 bits is ludicrous amounts of memory".
I think the best part is that memory limitations are also solvable problems. If your program hits its memory limit (ie, a 32 bit binary with 4GB addressing) you can just fork a process instead of a thread, and suddenly you double your working space. It is insanely rare to have an active working set of more than 4GB where that data needs to be address-local available for access or else you suffer huge slowdowns, and even more so in routines where you cannot load balance to delegate processes that manage ranges of values that get that big.
Likewise, if you hit a kernel address limit (remember, with PAE, you can get anything from a magnitude to thousandfold increases in available physical memory) your problem is so huge it makes sense to have it worked on by a server farm compute cluster. Unless you tried to voodoo all that circuitry together into some supermassive NUMA system (sounds painful) which wouldn't make any sense anyway because trying to abstract away network interconnects between disparate nodes at that scale has so much latency trying to treat it like memory when it is as slow as flash storage is redundant.
....
Ok, tangential rant over. I just think 64 bit is such a stupid buzzword waste of time, and its crazy that the industry has gone deep end on it for so long because it exploits some cultural tick in people that bigger is better. No, in practice it really does not "hurt" to have, but we (at least the consumer class, and 99% of business use cases) never needed it in the first place, really.
Completely wrong
"The original releases of Windows XP and Windows XP SP1 used PAE mode to allow RAM to extend beyond the 4 GB address limit. However, it led to compatibility problems with 3rd party drivers which led Microsoft to remove this capability in Windows XP Service Pack 2. Windows XP SP2 and later, by default, on processors with the no-execute (NX) or execute-disable (XD) feature, runs in PAE mode in order to allow NX.[14] The no execute (NX, or XD for execution disable) bit resides in bit 63 of the page table entry and, without PAE, page table entries on 32-bit systems have only 32 bits; therefore PAE mode is required in order to exploit the NX feature. However, "client" versions of 32-bit Windows (Windows XP SP2 and later, Windows Vista, Windows 7) limit physical address space to the first 4 GB for driver compatibility via the licensing limitation mechanism, even though these versions do run in PAE mode if NX support is enabled.
Windows 8 will only run on processors which support PAE, in addition to NX and SSE2."
- http://en.wikipedia.org/wiki/Physical_Address_Extension#Micr...
I speak only of the time period from 2004 - 2008 where we went from Pentium 4 to Core 2 (and Athlon64 to Phenom), where 64 bit became ubiquitous because Microsoft failed to provide its consumer grade OSes the ability to access the amounts of memory the hardware supported.
Edit: Actually you already addressed this in your post, with forking a process, which does seem like a potentially better solution in many situations.
Also, not sure why you're being downvoted, since your post seems intelligent and well thought out.
Almost all games are distributed as 32 bit binaries, albeit more recent titles are coming out 64 only. Point is, only one ever really comes out, because the vendors don't want to support two independent binaries, and in practice games really have no excuse to use more than 4GB of ram - you really should be caching to disk any scene or environment data that is that large, and all your texture data is already resident on the GPU. GPU drivers would have to be PAE aware, though, if you wanted to address more than 4GB of video memory (such as on the Titan).
And all the media creation tools do fall under that "business class use case". It is not that I think 64 bit is stupid - if you need 64 bit addressing, you need 64 bit addressing - just that the average consumer did not need it, at all. Business applications that needed 64 bit could have been shipped as such, hopefully for Itanium (read some docs on it, the ASM and pipeline model were way ahead of their time - to its detriment, sadly) which could have been the Xeon-esque business CPUs.
In reality, I think your statement should be "Both iOS and Android are reporting significant performance gains from ARMv8 [...]", not necessarily from going "64 bit".
The main issue is that they are based on a model of continuously modifying a very large, monolithic body of state representing fine details about what the next draw should do. At any moment a draw call may be issued to enact the current state and produce a result.
In the past, that state was represented in hardware mostly using a large collection of physical registers. Nothing else could possibly be fast enough. The API model of "set BlendStateSourceOp, set BlendStateDestOp, ect..." mapped very well to the hardware. You literally were continuously mutating a large block of registers.
In the present, programmable hardware has become capable of largely taking over for fixed-function hardware. Modern GPUs have been increasingly cutting out special-purpose silicon to make room for more multi-purpose ALUs. These general-purpose ALUs represent how to draw using fairly large, allocated structures instead of single-purpose registers. These structures are not trivial to construct and modifying them continuously is not advised. However, switching between them is as trivial as moving a pointer from one to the other.
Fortunately, most games don't actually use a continuum of states when drawing. In practice, they switch repeatedly between a small number of states with very little variation between frames. Therefore, modern drivers do a lot of work to implicitly infer what state setups are heavily repeated within each run each application. These states are baked into structures under the hood on the fly. Odd variants are expensive in this mode. But, they are also rare, so they are lower priority.
Mantle, Metal and DX12 all seek to reboot the idea of graphics APIs from scratch based on how hardware actually works today. You set up a an explicit set of draw state structures at init time. You switch between them explicitly and trivially at run time.
A second issue baked into OGL/D3D is that, in the past, the monolithic draw state was stratified into quite nicely orthogonal chunks dealing with separate issues such as: how to load a vertex from memory vs. how to operate on a vertex vs. how to pass data from the vertex shader to the fragment shader vs. how to operate on a fragment (sample) vs. how to blend the fragment into the framebuffer. This model made the APIs quite nice to learn and to use.
Unfortunately, it is simply not representative of how the hardware actually operates today. Today, most of those operations are actually handled by general purpose ALUs. These ALUs are running the vertex and fragment programs you wrote. But, they are also running more code to handle what used to be done in fixed-function silicon. Actually, it's worse than that. What used to be a register flip that was completely orthogonal to your vertex/fragment programs is now actually implemented by modifying code interleaved into the guts of the programs you compiled back at init time. These changes are done under the hood and on the fly.
Modifying the code under the hood is expensive. Worse, the draw state is so large and complicated that it is easy to accidentally request an invalid state. Validating each given state is expensive. Because the classic model lets you make draw state changes at any time preceding a draw and the state changes are no longer stratified, the state validation can no longer be done incrementally. Instead, every time you draw a significant amount of work is done just to make sure the request makes sense.
Again, by declaring draw states up front. Compilation and validation can be done once up front. Switching between pre-compiled, pre-validated states is trivial.
A third issue is that OGL/D3D have the genuinely great goal of preventing and/or detecting synchronization errors in the usage of the API. In other words, you really shouldn't try to have the CPU modify a given block of memory while the GPU is simultaneously reading that same memory in an uncoordinated fashion. OGl and D3D have an interface and implementation designed to prevent/detect/allow-at-a-huge-cost these usage errors as much as possible. In practice, serious programs cannot ship with these errors. That means that in practice, all serious, shipping programs do not have these errors to any significant degree, but the driver is still always doing a large amount of work checking for them all of the time.
The new-style APIs seem more inclined to declare this category of usage errors to be undefined behavior rather than pay the cost to handle them. "Here's how to avoid them. So... avoid them."
A fourth issue is that multi-core computing is much more common and important than it was in the past. OpenGL has never had in interface to issue draw command from multiple threads of a single process. D3D11 had an interface to record commands on multiple threads and dispatch them on a primary thread, but the consensus is that D3D11's implementation did not work as well as was expected in practice.
Mantle, Metal and DX12 all have new, multi-threaded interfaces that they are quite confident will work well in practice.
Much of what I'm describing here is covered in this presentation from Microsoft "DirectX 12 API Preview" https://www.youtube.com/watch?v=m0QkjKGZQzI
An alternative approach has been proposed by a multi-vendor group of OpenGL driver developers. It was presented in the "Approaching Zero Driver Overhead" (AZDO) talk at GDC 2014. http://gdcvault.com/play/1020791/ and https://www.khronos.org/assets/uploads/developers/library/20...
In the AZDO approach, instead of tossing out the legacy state machine of OpenGL, they demonstrate how some current (fairly cutting edge) features that have recently been added allow a draw state to be set up that is so expressive and so extensive that it can pretty effectively represent a whole, fairly complicated scene of a modern game in a single draw state. Once you set this up, you can pretty much issue a single request to draw much-if-not-all of the current frame as an atomic operation. Further, common frame-to-frame modifications (such as moving objects around) are very cheap in this setup.
ADZO is an interesting and perfectly workable approach. I am less of a fan of that approach than I am the DX12 approach.
I should make this into a blog post... I should start a blog...
Yes, we can still cram more transistors on a chip, but we can't get the clock speed any faster because we run into power/heat limitations. And actually if you look at the graph in the link I posted you'll see that clock speed peaked around 2005. Around that time AMD and Intel focused more on multiple cores and clock speed became less important.
Tangent: Personally, I like ARM much better than x86/x86-64. RISC based architectures just seem so much nicer. Plus, they are easier to optimize for and more predictable in terms of clock cycles per instruction.
"The number of transistors incorporated in a chip will approximately double every 24 months." - Gordon Moore: http://www.intel.com/content/www/us/en/history/museum-gordon...
When designing a new PC about a year ago, I went with an eight-core AMD FX-8320. Which was significantly cheaper than an equally fast Intel i5 or i7 CPU (although the per-thread performance is only half of a 4-core i5/i7 with the same Passmark score).
Regarding the mobile usage, I think Intel has a clear advantage as their CPUs are more powerful and more efficient.
I'm surprised to know that this thing is happening. I thought everyone has optimized their program to go multi-threading.
Quick background. My understanding of how a mobile phone works: There is a primary CPU, running Android, which does the majority of the work on the phone. Meanwhile, there is a second CPU attached to the radio running an RTOS. This RTOS interprets the signals from the antenna and makes nice packets for Android. The RTOS and secondary processing unit can then be optimized for just that one task, and Android can be optimized for reading, writing and processing data packets.
Similarly, what if we made more parts of our infrastructure "smarter"? Take for example the monitor. Currently, it has a frame buffer, we fill it through a DVI cord and the lights change. Is there some abstraction at which we could make a monitor work "smarter". Can we put a GPU in the monitor? Then just move memory and call OpenGL commands. Is that a better abstraction for monitors? Can we also put a CPU on the monitor and embed an RTOS/rendering engine? Would that make a better abstraction? Could we then optimize the "smarter" monitor / integration with the game logic, better than we could have if the monitor just has a frame buffer?
I don't have much domain knowledge and so I don't know what the correct abstraction for a "smarter" monitor should be. However, I do think its a good question to be asking (and not just about monitors). I'm curious if you guys can think of such an abstraction. What would that be?
Interestingly, smarter displays were actually available at one point as a "graphical terminal" http://en.wikipedia.org/wiki/Tektronix_4014 ; people keep trying to reinstate this with the "thin client".
The history of thin clients shows what the problem with "smart" infrastructure is. It moves the tradeoffs around rather than giving absolute benefits.
If you're too advanced for this, I'd consider writing some behavioral VHDL or Verilog for a few of the units to see how a few of the pieces fit together.