The same cannot be said for quads. If you e.g. have 4 vertices where 3 are sitting on the same plane, but the 4th isn't, there are 2 different, possible ways to subdivide this into triangles. In addition to all the possible non-linear surfaces one might image occupying that space.
Old OpenGL and Direct3D versions did have quad and arbitrary polygon rendering support. IIRC from vague memory, the above scenario was something you were supposed to avoid and results varied between graphics card vendors.
On a side note: If we are talking purely 2D, some old 2D software rendering systems used trapezoids as basic primitives, defined by their top and bottom edges. For instance, the X11 XRender API explicitly supports drawing trapezoids. They are easy to rasterize, interpolate across and quite flexible. Many other 2D shapes can be conveniently composed from them, including screen space triangles, if you were to implement a software rasterizer.
> For instance, the X11 XRender API explicitly supports drawing trapezoids.
It should be said XRender trapezoids have extra constraints: the top and bottom edges of the trapezoid must be completely straight (go exactly left to right). So it's basically a rectangle with sloped left/right sides. This makes it extremely easily to software rasterize, while you can still break each trapezoid into two triangles for 3D hardware.
I vaguely recall that at different times people tried to build both PC and phone graphics hardware around quads and basically failed, but I don’t know why that didn’t work (aside from the issues of geometry that you mention).
[1] https://registry.khronos.org/OpenGL/extensions/NV/NV_fill_re...
This makes interpolation for normals, texture, etc trivial.
This is not true for every 4 points.
In actual hardware shading is done 32 or 64 pixels at a time, not four. The problem above just got worse.
While it's true that there are "wasted" execution in 2x2 quads for derivative computation, it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient.I dont think that it's publicly documented how the "packing" of quads into lanes is done in the rasterizer for modern GPUs. I'd guess something opportunistic (maybe per tile) taking advantage of the general spatial coherency of triangles in mesh order.
I am no GPU expert, but I performed some experiments a while ago indicating that this is in fact how it works, at least on nvidia.
I would expect it simplifies the fragment processing pipeline to have all the interpolants come from the same triangle. Another factor that comes to mind is that, due to the 2x2 quad-padding, you would end up with multiple shader executions corresponding to the same pixel location, coming from different triangles; that would probably involve complicated bookkeeping. Especially given MSAA.
For example, from the ISA docs [1], pixel waves are preloaded with an SGPR containing a bit mask indicating just that :
> The new_prim_mask is a 15-bit mask with one bit per quad; a one in this mask indicates that this quad begins a new primitive, a zero indicates it uses the same primitive as the previous quad. The mask is 15 bits, not 16, since the first quad in a wavefront begins a new primitive and so it is not included in the mask
The mask is used by the interp instructions to load the correct interpolants from local memory.
In fact, in the (older) GCN3 docs [2] there is a diagram showing the memory layout of attributes from multiple primitives for a single wavefront (page 99).
That being said, of course I expect this process to be "lazy" : you would not want to buffer execution of a partially filled thread forever, so depending on the workload you might measure different things.
[1] https://developer.amd.com/wp-content/resources/RDNA2_Shader_...
[2] http://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_In...
>Because the quad is rendered using two separate triangles, separate wavefronts are generated for the pixel work associated with each of those triangles. Some of the pixels near the boundary separating those triangles end up being organized into partial wavefronts
It's as if sometimes one triangle was rendered before the vsync, while the other was rendered after it.
eg. Your shaders operate on float16 precision (common in the D3D9 days) on a screen with high enough resolution. The boundary pixels at the diagonal now can't fit into the precision of a float16 cleanly.
Sounds like something that would be within the margin of error? Seems especially meaningless because it's just the average of the timings, instead of something that would visualize the distribution, like a histogram or KDE.
The good thing about having 4 vertices is can just use a vertex position and set of texture coordinates (x,y) on each one and the texture can just be mapped exactly.
If not, is there any possibility that dividing a fullscreen quad into _more_ triangles would actually end up faster?
See e.g. https://fgiesen.wordpress.com/2011/07/05/a-trip-through-the-... for a nice detailed explanation.
In the 3D graphics space, this kind of knuckle-shaving is deeply revered!