I'm a little surprised no better solution has come along. Vulkan didn't even exist back then (and DirectX 12 had only just released) but instead of making things better, it digs it's feet even deeper into the assumption that all shaders will be known ahead of time (resulting in long "shader recompilation" dialogs on startup on many games).
I've been tempted to build my own fast shader compiler into Dolphin for many common GPU architectures. Hell, it wouldn't even be a proper compiler, more of a templated emitter as all shaders fit a pattern. Register allocation and scheduling could all be pre-calculated.
But that would be even more insane than ubershaders, as it would be one backend per gpu arch. And some drivers (like Nvidia) don't provide a way to inject pre-compiled shader binaries.
On the positive side, ubershaders do solve the problem, and modern GPU drivers do a much better job at accepting ubershaders than they did 9 years ago. Though that's primarily because (as far as I'm aware) examples of Dolphin's ubershader have made their way into every single shader compiler test suite.
How'd that come to be? Just interesting code for test suites or did you guys advocate for it to be included?
Back when Vulkan was developed, there were a bunch of OpenGL drivers out there which had random AST parsing bugs (Dolphin even has a bunch of workarounds for them); So a large chunk of the motivation for SPIR-V was avoiding the need for every driver to implement their own GLSL parser and the associated bugs.
The problem for Dolphin is not the complexity of the shader, but the quantity.
Shaders in modern games are usually written manually (or authored in a shader node editor by an artist), so it's rare for a game to have more than a few thousand total. Better games might only have a few dozen for the entire game.
But because gamecube/Wii games configure the TEV pixel pipeline though a dynamic API, some games use that API in a pattern where Dolphin can find itself generating hundreds of shaders per second. Some games even manage to generate new shaders continually as you play, because they append junk state to their pixel pixeline state which dolphin doesn't detect as a duplicate.
I had to solve a similar problem years ago, during the transition from fixed function to shaders, when shaders weren't as fast or powerful as today. We started out with an ubershader approximating the DX9/OpenGL 1.2 fixed functions, but that was too slow.
People in those days thought of rendering state being stored in a tree, like the transform hierarchy, and you ended up having unpredictable state at the leaf nodes, sometimes leading to a very high permutation of possible states. At the time, I decomposed all possible pipeline state into atomic pieces, eg, one light, fog function, texenv, etc. These were all annotated with inputs and outputs, and based on the state graph traversal, we'd generate a minimal shader for each particular material automatically, while giving old tools the semblance of being able to compose fixed function states. As for you, doing this on-demand resulted in stuttering, but a single game only has so many possible states - from what I've seen, it's on the order of a few hundred to a few thousand. Once all shaders are generated, you can cache the generated shaders and compile them all at startup time.
I wonder if something like this would work for emulating a Gamecube. You can definitely compute a signature for a game executable, and as you encounter new shaders, you can associate them with the game. Over time, you'll discover all the possible state, and if it's cached, you can compile all the cached shaders at startup.
Anyhow, fun stuff. I used to love work like this. I've implemented 3DFx's Glide API on top of DX ages ago to play Voodoo games on my Nvidia cards, and contributed some code to an N64 emulator named UltraHLE.
That's a blast from the past, I distinctly remember reading up about UltraHLE way back when and then trying it our and for the first time being able to play Ocarina of Time on my middle class PC with almost no issues, that was magical.
Many games actually dynamically generate new "shaders" on the fly, based on which lights are near an object, and in which order.
Second we can't use those vertex/pixel pipeline states directly on modern GPU, they need to be translated into modern shaders, and then compiled by the driver for your graphics card. It's actually that compile step which causes the stuttering, dolphin's translation is plenty fast enough.
The combination of these two facts means Dolphin can't depend on any pre-computation at all.
Prior to ubershaders the emulator took a configuration for the hardware pipeline and turned that into a shader, which took time to compile. Ubershaders work by emulating the entire fixed function pipeline in one glorious shader until the smaller, more efficient shader can be compiled and slipped in.
Basically, the ubershader is the only thing that can actually understand the "shaders" packaged with the game and start using them with zero latency.
Why not just precompile all the possible hardware combinations? There's far more combinations than atoms in the universe. Why not just precompile all the hardware combinations that the game actually uses? There's no way to tell before hand without examining every branch of the game's code which ranges in difficulty from "computationally prohibitive" to "fundamental theorems of how computers work says this is impossible".
The article mentions that some users actually passed around cached shader packs, but that solution was brittle.
When you first need to run something, you run it on the interpreter (JS) / ubershader (Dolphin). But once you know it's going to be run repeatedly (rarely for JS, almost always for Dolphin), you kick off an async compilation to produce JIT code (JS) / a specialized shader (Dolphin). You continue running in the expensive mode (interpreter / ubershader) until the compilation is complete, then you switch over seamlessly.
https://v8.dev/blog/ignition-interpreter
https://firefox-source-docs.mozilla.org/js/index.html#javasc...
Both V8 and SM will interpret the bytecode until it warms up enough to be compiled to specialized machine code. ("Warms up" == "is observed to execute enough times".) There are some subtle distinctions about whether the interpreter is implemented in C++ or generated by a variant of the JIT code compiler, but as with the shaders the main point is whether it's executed in a way that works for everything or is specialized to a particular purpose (and varying degrees of specialization are implemented, with various mechanisms for falling back to a more general execution mechanism if the specialization assumptions no longer hold).
Your SpiderMonkey doc link points to a section named "JavaScript Interpreter". The title is correct, that section is indeed about the mechanisms for interpreting JavaScript [bytecode].
The V8 link is a little tricky, since it leads off with "Code is initially compiled by a baseline compiler", but if you read a little further, it says "...the V8 team has built a new JavaScript interpreter, called Ignition, which can replace V8’s baseline compiler". Basically, V8 experimented for a while with dropping the interpreter, but for the reasons described well in that document, they went back to initially running in an interpreter. The article is quite nice and describes quite a bit about the tradeoffs involved. It's 8 years old, but I believe the overall picture isn't that different today.
(Source: I am an engineer on the SpiderMonkey team.)
The developer register himself playing the game and during the first loading of the game, the entire gameplay is replayed at high speed in the background on the machine.
But, still... Both GPUs were pretty well suited for this ubershader approach because they had a small, fixed limit on the number of instructions they could run. And, very strictly defined functionality for each instruction. They weren't really "shaders" as much as highly flexible fixed function stages that you could reasonably wedge in a text shader compiler as a front end and only get a moderate to high amount of complaints about how strict and limited the rules were for the assembly. I recall that both shading units could reasonably be fully specified as C structs that you manually packed into the GPU registers instead of using a shader compiler at all.
If you look closely, the TEV actually shares the same limitation, it's just that the traditional representation interleaves the texture fetch and math instructions (Because the 3rd texture fetch "instruction" always feeds into the 3rd math "instruction", for example). There are two independent execution units, separated by a fifo and no way to backfeed from the math back to texture fetch.
The two GPUs are roughly equivalent. The only reason the OG Xbox is consider to "have pixel shaders" is that they were exposed with a pixel shader API, while TEV was only ever exposed with a "texture environment" based API. They are both clearly register combiners, with no control flow, but they sit right in the middle as GPUs were transitioning from register combiners to "proper" pixel shaders. The team that designed GameCube's GPU went on to develop the first DirectX 9 GPU.
I'm pretty sure the Xbox's pixel pipeline is slightly more capable as TEV doesn't have the Dot3 instruction (and it also has programmable vertex shaders). But developers all abandoned the xbox in 2005. TEV has a much better reputation for being flexible because TEV was used in the Wii all the way to ~2013. And graphics developers who were exposed to much better shaders on the Xbox, PS3 and PC got very good at back porting those modern techniques to the more limited Wii. More than one studio created un-offical shader compilers for the Wii, so they could share the same shaders across PS3/Xbox/Wii/PC.
> I recall that both shading units could reasonably be fully specified as C structs that you manually packed into the GPU registers instead of using a shader compiler at all.
Yeah, not that they ever exposed that API.
The GameCube had great support for recording display lists, so you could record a display list while you called the API commands to configure TEV and then call that display list later to quickly load the "shader". Some games even saved those display lists to disc (or maybe generated them from scratch with external tools) as a form of offline shader compilation.
In reality, the OG Xbox and GameCube GPUs are almost identical in pixel shading capabilities (Though the gamecube's vertex shading pipeline is legitimately fixed function, but very flexible).
Despite their roughly equal capabilities, they were exposed with very different APIs. The xbox used the new-fangled "Shader" style API that Microsoft was introducing to the industry at the time, while TEV used a very extended version of the older "Texture Environment" style API that was introduced with DirectX 7 and OpenGL 1.3.
----------------
Edit: Actually, it might be better to explain from the other end:
In a true fixed function GPU like the Playstation 2 and Dreamcast (or OG Playstation... but not the N64, which is a two stage register combiner) the pixel pipeline is limited to just one basic equation. A single texture is sampled, and that sample is multiplied with a single color interpolated from the vertex colors (which were usually derived from lights). The flexibility is of the equation was limited to replacing each input with a fixed value, and then enabling a few optional post-processing stages like depth based fog, alpha cutout and/or blending with a few fixed blend equations.
But the results from that single texel * vertex_color equation are limiting. A common technique to produce better results on such GPUs was "multi-texturing". Graphics developers of the era would render the same triangles two or more times, but with different textures and vertex colours, blending the result into the frame buffer. This was commonly used to achieve the illusion of more detailed textures, or texture based light-maps. Or the reflections on cars in racing games.
But blending in the frame buffer is expensive as it wastes a lot of memory bandwidth. The PS2 is hyper-optimised for this approach, it has the VUs which can quickly generate multiple draws of the same geometry, and a fast, embedded dram with enough read/write ports that it can do blending "for free". But in the PC world, GPUs started adding features to combine these multiple draw call together and blend the result before writing to the frame buffer. The Voodoo 2 and Nvidia TnT (Twin Texel) from 1998 are examples of GPUs that supported this multi-pass texturing.
DirectX and OpenGL provided the "texture environment" APIs that automatically used these new single-pass multi-texturing features when available, or would fall back to multi-pass rendering on older GPUs.
But the actual hardware was often more flexible than what DirectX/OpenGL exposed, though vendors supplied "Register Combiner" OpenGL extensions that exposed the full functionality (This is why John Carmack used OpenGL, so he could create optimised per-gpu render paths for each GPU). And these Register Combiners could be "programmed" to produce pixel equations that were way more advanced that what could be achieved with multi-pass rendering, as they could pass more than one value between stages. And they started supported 4 or 8 textures plus enough math stages to combine the textures.
Microsoft gave up trying to expose the full capabilities of these register combiners though the older Texture Environment APIs and introduced Pixel Shaders with DirectX 8, but they were just providing a new API for the features GPUs already had. The register combine stages were simply renamed to "instructions".
The Xbox is a register combiner with 4 texture fetch stages and 8 combiner stages and Gamecube has 16 combiner stages and 8 texture fetches (well, it technically supports 16 texture fetches, but there are only 8 sets of UV coords)
Ubershaders: A Ridiculous Solution to an Impossible Problem - https://news.ycombinator.com/item?id=14884992 - July 2017 (88 comments)
But Dolphin's "Ubershader" is a different beast. It's about handling all the shader variants for _all_ Dolphin games (which are made in different engines) with one shader, and the variant parameters aren't passed as nice constants (data) but as shader programs (code) that need to be interpreted to be understood. It's more like a meta-shader that takes shaders as input and produces shaders as opposed to a "normal" ubershader which takes configuration to specialize it at runtime.
I think that's right anyway. I haven't worked directly with ubershaders in either variety and my knowledge comes from building my own hobby engine and 3D pipelines.
Would it be possible to build a web-hosted database of encountered shader configs against a game id, and have Dolphin fetch that list when a game launches and start doing async compilation?
When Dolphin encounters a new shader that wasn't in the db, it phones home to request it to be added it to the list.
I feel an automated sharing solution would build up coverage pretty quickly, and finding a stutter would eventually be considered an achievement - "no-one's been here before!"
For normal PC's, realistically Valve/Steam are the only people who could solve or implement this for PC games as they have the tech and platform to distribute it all. Even with all that its a crazy task to try and solve due to all of the variations and new patches for games that require the shaders to be recompiled again.
Meanwhile the programmer of a console game, not using shaders, could set GPU registers to any configuration they wanted just before rendering the next triangle. You have to actually play the game to find out what configurations it programs into the GPU, because those configurations are not neatly organized into a set of discrete shaders. Even then there is no guarantee that you found all possible configurations used by the game. The videos in the article provide a good example: the player fires a gun with luminous bullets, so on that frame the walls and floors need to be rendered with an extra light source. That requires reconfiguring the GPU to take that light source into account, then changing the configuration to render the weapon itself, then changing it again to render the HUD, and so on.
Now imagine that you go to a place on a different level where the walls are not shiny, and it doesn't bother to render the walls with the extra light source. Or it renders them with extra vertex lighting but not extra specular lighting. Now combine that with every type of wall and floor in the game; they might all need a unique shader to be lit correctly by that one gun. To find all possible GPU configurations you need to fire that gun, and every other, near every single different type of wall and floor texture used in the game. And there are a dozen different guns.
And then you need to do it all again while wearing the night–vision goggles, because that causes everything to be rendered with a different configuration yet again.
Every one of those unique combinations needs to be made into a shader, and there’s just no way to be sure that you have actually collected all of them. Or you can write a single Ubershader that can, by using branches, loops, and other advanced tricks, emulate the entire capabilities of the emulated GPU. Then you can program the Ubershader by sending all of the emulated GPUs register values as uniforms.
Also, games might know what shaders it can skip and what it can't, but Dolphin can't skip shaders if they aren't compiled, because it doesn't know what the game will do with the render (e.g. Miis work by rendering their heads once into a texture, and then reusing that. If it skips the render because the shader isn't ready, the Mii will just be missing forever).
Some emulators handle this by sharing "shader caches" between users so that they have a better idea of what the game will use; Dolphin opted for a different solution here.
But from what I’ve heard it’s often still an issue on PCs (I’m a Mac guy). I’ve seen videos of shader compilation stutters, even in games with a precompilation step that’s supposed to avoid that.
Digital Foundry has covered this many times. The link in a sibling comment to them on Eurogamer is a great place to start.
https://twistedvoxel.com/unreal-engine-5-pc-stuttering-issue...
https://www.eurogamer.net/digitalfoundry-2022-df-direct-week...
shaders weren't a thing when the gcn released. it may be arguable, but nobody even used the word at the time. shader compilation time is an issue for modern games on the PC, and because of this, developers anticipate and work around it.
on the gcn, specialized fixed-function pipelines were available, and could be composed by some limited configuration (literally, 24 instructions). you may think of this as a sort of proto-shader, but significantly, the fixed-function pipelines embody quite a lot of behavior in specialized and limited hardware that is now typically achieved in software on more versatile hardware.
so, to replicate that specialized hardware, modern graphics hardware (which exposes its greater capability as simple computational primitives) must compile a shader program and run it. but on gcn, the tiny configuration of static hardware loads near-instantly.
https://en.m.wikipedia.org/wiki/TMS34010
https://en.m.wikipedia.org/wiki/RenderMan_Shading_Language
Just two examples, SIGGRAPH has plenty of paper on the subject.
https://www.gamedeveloper.com/programming/shader-integration...
It's common to have many shader variants exactly because you want to avoid 'if' statements; conditionals are turned into shader-compile-time options. Uber shaders are about turning those compile-time options back into runtime options to have a fallback shader to run while the right shader variant is being compiled.
The fact is that a branch that the whole warp takes or does not take is relatively cheap on modern hw. Even if it is per thread dependent.