Full screen triangle optimization (opens in new tab)

(30fps.net)

115 pointsrck3y ago44 comments

44 comments

31 comments · 9 top-level

ttoinou3y ago· 10 in thread

Why didn't they ever implemented a rectangle primitive to be drawn instead of a triangle ? Anyway, here the perf impact is negligible

st_goliath3y ago

Triangles in 3D euclidean space have the advantage of being uniquely defined by 3 vertices that have the nice property of always sitting on a single plane. There's no linear transformation you could apply that results in something other than a well-defined, planar triangle. The worst that might happen is collapsing it into a line or a point.

The same cannot be said for quads. If you e.g. have 4 vertices where 3 are sitting on the same plane, but the 4th isn't, there are 2 different, possible ways to subdivide this into triangles. In addition to all the possible non-linear surfaces one might image occupying that space.

Old OpenGL and Direct3D versions did have quad and arbitrary polygon rendering support. IIRC from vague memory, the above scenario was something you were supposed to avoid and results varied between graphics card vendors.

On a side note: If we are talking purely 2D, some old 2D software rendering systems used trapezoids as basic primitives, defined by their top and bottom edges. For instance, the X11 XRender API explicitly supports drawing trapezoids. They are easy to rasterize, interpolate across and quite flexible. Many other 2D shapes can be conveniently composed from them, including screen space triangles, if you were to implement a software rasterizer.

Jasper_3y ago

The OpenGL specification basically had no explanation of what a "Quad" was, it didn't say anything about how to render quads, interpolate quads, etc. Pretty much every driver implemented quads as a funky way of spelling "two triangles". Much like tristrips/trifans/etc., index buffers give you all of the benefits with none of the drawbacks.

> For instance, the X11 XRender API explicitly supports drawing trapezoids.

It should be said XRender trapezoids have extra constraints: the top and bottom edges of the trapezoid must be completely straight (go exactly left to right). So it's basically a rectangle with sloped left/right sides. This makes it extremely easily to software rasterize, while you can still break each trapezoid into two triangles for 3D hardware.

1 more reply

ttoinou3y ago

Of course you're right but I was thinking about a subset of features for quads for this kind of use case. But yeah for all features triangles are better !

mananaysiempre3y ago

> Old OpenGL and Direct3D versions did have quad and arbitrary polygon rendering support.

I vaguely recall that at different times people tried to build both PC and phone graphics hardware around quads and basically failed, but I don’t know why that didn’t work (aside from the issues of geometry that you mention).

1 more reply

delusional3y ago

A triangle is the simpleste polygon. As soon as you start doing quads the complexity skyrockets: what if it self intersects? What if it's concave? What does a winding order mean when the verts can be all out of order?

jbverschoor3y ago

Triangles are used to render 3d meshes because of that yes. However, the article is about "Full screen post processing effects", which is easier done in a rectangle

1 more reply

jra1013y ago

NVIDIA has an OpenGL extension that does just that [1].

[1] https://registry.khronos.org/OpenGL/extensions/NV/NV_fill_re...

Jasper_3y ago

Do you mean a bounding rectangle? That exists on a lot of GPUs, where you draw a triangle and the entire bounding box surrounding the triangle is covered. However, the details differ greatly between different GPUs, meaning it's hard to standardize. Full-screen triangles are just fine.

tverbeure3y ago

Nvidia's NV1 used quads as primitive. Shortly after its release, in 1995, Microsoft introduced DirectX which was triangle based. It didn't do well...

fooker3y ago

Every three points forms a triangle where all points are on a single plane.

This makes interpolation for normals, texture, etc trivial.

This is not true for every 4 points.

obl3y ago· 3 in thread

  In actual hardware shading is done 32 or 64 pixels at a time, not four. The problem above just got worse.

While it's true that there are "wasted" execution in 2x2 quads for derivative computation, it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient.

I dont think that it's publicly documented how the "packing" of quads into lanes is done in the rasterizer for modern GPUs. I'd guess something opportunistic (maybe per tile) taking advantage of the general spatial coherency of triangles in mesh order.

moonchild3y ago

> it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient

I am no GPU expert, but I performed some experiments a while ago indicating that this is in fact how it works, at least on nvidia.

I would expect it simplifies the fragment processing pipeline to have all the interpolants come from the same triangle. Another factor that comes to mind is that, due to the 2x2 quad-padding, you would end up with multiple shader executions corresponding to the same pixel location, coming from different triangles; that would probably involve complicated bookkeeping. Especially given MSAA.

obl3y ago

It would be interesting to see how you were testing for that, because at least on AMD it's fairly certain that a single thread can be shading multiple primitives.

For example, from the ISA docs [1], pixel waves are preloaded with an SGPR containing a bit mask indicating just that :

> The new_prim_mask is a 15-bit mask with one bit per quad; a one in this mask indicates that this quad begins a new primitive, a zero indicates it uses the same primitive as the previous quad. The mask is 15 bits, not 16, since the first quad in a wavefront begins a new primitive and so it is not included in the mask

The mask is used by the interp instructions to load the correct interpolants from local memory.

In fact, in the (older) GCN3 docs [2] there is a diagram showing the memory layout of attributes from multiple primitives for a single wavefront (page 99).

That being said, of course I expect this process to be "lazy" : you would not want to buffer execution of a partially filled thread forever, so depending on the workload you might measure different things.

[1] https://developer.amd.com/wp-content/resources/RDNA2_Shader_...

[2] http://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_In...

1 more reply

delusional3y ago

The linked AMD guide seems to suggest the author is correct

>Because the quad is rendered using two separate triangles, separate wavefronts are generated for the pixel work associated with each of those triangles. Some of the pixels near the boundary separating those triangles end up being organized into partial wavefronts

londons_explore3y ago· 2 in thread

A bigger reason to do this is that on some (shoddy) hardware, the user sees a tear line along the diagonal of the triangles.

It's as if sometimes one triangle was rendered before the vsync, while the other was rendered after it.

AnotherGoodName3y ago

I haven't seen it in years but the cause is actually the limitations of float precision and/or the way the developer has setup the world/view matrices.

eg. Your shaders operate on float16 precision (common in the D3D9 days) on a screen with high enough resolution. The boundary pixels at the diagonal now can't fit into the precision of a float16 cleanly.

quadcore3y ago

Mmmh there is rather an easy way to get that bug due to float precision is when the vertice positions are the result of some maths. Then two vertices which should be mathematically at the same position will end up not being exactly at the same position in practice due to float precision. Only ways to have two vertices at the exact same position is either hardcoding it, assigning the position of one to the other or using index buffers. An easy thing to miss.

3 more replies

nsajko3y ago· 2 in thread

> In my microbenchmark1 the single triangle approach was 0.2% faster than two.

Sounds like something that would be within the margin of error? Seems especially meaningless because it's just the average of the timings, instead of something that would visualize the distribution, like a histogram or KDE.

gpderetta3y ago

I know very litte of graphics programming and GPUs, but I would expect this particular shader to be very deterministic with very little jitter, right?

zamadatix3y ago

Probably quite far outside margin of error given the number of runs and simplicity of the test but you'd need the measured variance to be sure. The AMD study goes into more metrics and why's of it being faster.

lukko3y ago· 2 in thread

This is interesting, but also wouldn't the texture mapping / UVs be more confusing and possibly outweigh the benefit of micro-optimisation?

The good thing about having 4 vertices is can just use a vertex position and set of texture coordinates (x,y) on each one and the texture can just be mapped exactly.

account423y ago

What UVs would you use for full-screen effects? Typically you are only interested on the screen position to sample relevant buffers, i.e. gl_FragCoord or interpolated vertex positions depending on what scale you want.

lukko3y ago

That’s true - not so much UVs, but texture coords I find are useful for quickly flipping the screen horizontally or vertically - so storing both a vertex position and a x,y value 0-1 for sampling a texture. I guess could just interpolate the vertex positions directly, just it’s less intuitive with a large triangle.

ladon863y ago· 2 in thread

Would this still be true on a tiled rendering GPU, i.e. mobile?

If not, is there any possibility that dividing a fullscreen quad into _more_ triangles would actually end up faster?

pixelpoet3y ago

Even desktop GPUs use tiled rendering since Maxwell generation on Nvidia, and I forget which gen for AMD. I don't see how it's possible for many triangles to be faster than one for fullscreen rendering.

pixelpoet3y ago

Wow, downvoted for this with zero replies; lesson learnt...

2 more replies

ddren3y ago· 1 in thread

I wonder how this is implemented in the GPU. From my time working on a 3D renderer a long time ago, triangles with offscreen vertices would be clipped into smaller triangles, so in the end you would still be rendering multiple triangles anyway. I imagine it would also be possible to clip the scanlines instead.

obl3y ago

Actual clipping is expensive so indeed a "guard band" is used : inside the region allowed by the internal precision of the rasterizer, outside pixels are simply "ignored".

See e.g. https://fgiesen.wordpress.com/2011/07/05/a-trip-through-the-... for a nice detailed explanation.

teucris3y ago

> In my microbenchmark1 the single triangle approach was 0.2% faster than two. We are definitely deep into micro-optimization territory here :)

In the 3D graphics space, this kind of knuckle-shaving is deeply revered!

ww5203y ago

That's a pretty neat trick, letting the GPU to exclude the out of bound regions of the enlarged triangle and only render the visible rectangle.

j / k navigate · click thread line to collapse

44 comments

31 comments · 9 top-level

ttoinou3y ago· 10 in thread

Why didn't they ever implemented a rectangle primitive to be drawn instead of a triangle ? Anyway, here the perf impact is negligible

st_goliath3y ago

Jasper_3y ago

> For instance, the X11 XRender API explicitly supports drawing trapezoids.

1 more reply

ttoinou3y ago

Of course you're right but I was thinking about a subset of features for quads for this kind of use case. But yeah for all features triangles are better !

mananaysiempre3y ago

> Old OpenGL and Direct3D versions did have quad and arbitrary polygon rendering support.

1 more reply

delusional3y ago

jbverschoor3y ago

Triangles are used to render 3d meshes because of that yes. However, the article is about "Full screen post processing effects", which is easier done in a rectangle

1 more reply

jra1013y ago

NVIDIA has an OpenGL extension that does just that [1].

[1] https://registry.khronos.org/OpenGL/extensions/NV/NV_fill_re...

Jasper_3y ago

tverbeure3y ago

Nvidia's NV1 used quads as primitive. Shortly after its release, in 1995, Microsoft introduced DirectX which was triangle based. It didn't do well...

fooker3y ago

Every three points forms a triangle where all points are on a single plane.

This makes interpolation for normals, texture, etc trivial.

This is not true for every 4 points.

obl3y ago· 3 in thread

  In actual hardware shading is done 32 or 64 pixels at a time, not four. The problem above just got worse.

moonchild3y ago

> it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient

I am no GPU expert, but I performed some experiments a while ago indicating that this is in fact how it works, at least on nvidia.

obl3y ago

It would be interesting to see how you were testing for that, because at least on AMD it's fairly certain that a single thread can be shading multiple primitives.

For example, from the ISA docs [1], pixel waves are preloaded with an SGPR containing a bit mask indicating just that :

The mask is used by the interp instructions to load the correct interpolants from local memory.

In fact, in the (older) GCN3 docs [2] there is a diagram showing the memory layout of attributes from multiple primitives for a single wavefront (page 99).

[1] https://developer.amd.com/wp-content/resources/RDNA2_Shader_...

[2] http://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_In...

1 more reply

delusional3y ago

The linked AMD guide seems to suggest the author is correct

londons_explore3y ago· 2 in thread

A bigger reason to do this is that on some (shoddy) hardware, the user sees a tear line along the diagonal of the triangles.

It's as if sometimes one triangle was rendered before the vsync, while the other was rendered after it.

AnotherGoodName3y ago

I haven't seen it in years but the cause is actually the limitations of float precision and/or the way the developer has setup the world/view matrices.

quadcore3y ago

3 more replies

nsajko3y ago· 2 in thread

> In my microbenchmark1 the single triangle approach was 0.2% faster than two.

gpderetta3y ago

I know very litte of graphics programming and GPUs, but I would expect this particular shader to be very deterministic with very little jitter, right?

zamadatix3y ago

lukko3y ago· 2 in thread

This is interesting, but also wouldn't the texture mapping / UVs be more confusing and possibly outweigh the benefit of micro-optimisation?

The good thing about having 4 vertices is can just use a vertex position and set of texture coordinates (x,y) on each one and the texture can just be mapped exactly.

account423y ago

lukko3y ago

ladon863y ago· 2 in thread

Would this still be true on a tiled rendering GPU, i.e. mobile?

If not, is there any possibility that dividing a fullscreen quad into _more_ triangles would actually end up faster?

pixelpoet3y ago

Wow, downvoted for this with zero replies; lesson learnt...

2 more replies

ddren3y ago· 1 in thread

obl3y ago

Actual clipping is expensive so indeed a "guard band" is used : inside the region allowed by the internal precision of the rasterizer, outside pixels are simply "ignored".

See e.g. https://fgiesen.wordpress.com/2011/07/05/a-trip-through-the-... for a nice detailed explanation.

teucris3y ago

> In my microbenchmark1 the single triangle approach was 0.2% faster than two. We are definitely deep into micro-optimization territory here :)

In the 3D graphics space, this kind of knuckle-shaving is deeply revered!

ww5203y ago

That's a pretty neat trick, letting the GPU to exclude the out of bound regions of the enlarged triangle and only render the visible rectangle.

j / k navigate · click thread line to collapse