undefined | Better HN

0 pointsanon2911y ago0 comments

Considering that each kernel / kernel size is usually custom tuned on NVIDIA, I'd say no. Working in this field at several different companies, there are likely thousands of hand-tuned variations of a simple GEMM kernel. Each one required an engineer to look at specifically, even if they're all variations on a common theme.

As far as I know (and again, I work in the field of AI compilers), we're still a ways off from complete end-to-end generation of highly optimized kernels. If you want it to go fast, you need to write it by hand [1], and then test and validate.

Moreover, chip makers are constantly adding new features (Tensor Cores in NVIDIA for example), so the compiler is always playing catch up and at some point an engineer has to sit down (likely a team of them) and think 'what's the best way to exploit this hardware functionality for software performance?'. Then they have to test and validate that, and then either write a kernel, or attempt to put that know-how into a compiler.

Multiply this times the number of kernels in a typical suite, and... yeah.

And that was my point about herculean effort on modern chips. Assembly language isn't just the old 'Add register 1 and 2 and dump in R3' anymore. It's 'Use this instruction to access memory in this way, so that it's in a compatible format for the next instruction' and 'oh yeah, make sure your memory synchronization primitives are such that the whole thing is coherent'. Good luck!

Even going one step up into a higher-level language, you have to know how the kernel gets compiled to make it worthwhile. Again, it is trivial to write a correct opencl matrix multiply, but that's never going to be the highest performance. You have to know the hardware intimately. This is where having the software co-designed with hardware is very important. Basically, every AI chipmaker of any importance does this, including the startups, like Groq and Cerebras.

[1] A lot of kernels share basic patterns, so its not as hard as it sounds, but definitely requires engineering effort to get the design right.

0 comments

almostgotcaught1y ago

> Considering that each kernel / kernel size is usually custom tuned on NVIDIA, I'd say no. Working in this field at several different companies, there are likely thousands of hand-tuned variations of a simple GEMM kernel. Each one required an engineer to look at specifically, even if they're all variations on a common theme.

Lol that's absolutely not true. What you're describing is literally impossible for any company that has more than one product family on the market since each product has different scratch sizes, number of vector registers, data types supported/emulated etc.

Outside of trade show demos, kernels are codegened. What is true is there are recurring "themes/patterns" that are handled by engineers for a class of products. Lately this is flash attention...

> Again, it is trivial to write a correct opencl matrix multiply, but that's never going to be the highest performance.

I guess you work at AMD. The reason AMD ships a whole bunch of binary kernels is not because someone tuned/designed each one but because AMD doesn't have a PTX/SASS equivalent. So each kernel has to be compiled at build time for each device (it's also why they can't have LTS support for architectures).

anon291OP1y ago

> outside of trade show demos, kernels are condegened. What is true is there are recurring "themes/patterns" that are handled by engineers for a class of products. Lately this is flash attention

I never said they weren't using code generation. I said that each one requires a manual tune. You will set various parameters, determine if the generated code does well enough and then if there's performance to squeeze out, you modify the code generator.

> I guess you work at AMD.

Close but not quite

almostgotcaught1y ago

> that each one requires a manual tune

Ya definitely not - everyone uses grid search or whatever latest BPO tuning strategy.

anon291OP1y ago

Oh right. Those require no engineering effort because I said so.

1 more reply

j / k navigate · click thread line to collapse

0 comments

almostgotcaught1y ago

Outside of trade show demos, kernels are codegened. What is true is there are recurring "themes/patterns" that are handled by engineers for a class of products. Lately this is flash attention...

> Again, it is trivial to write a correct opencl matrix multiply, but that's never going to be the highest performance.

anon291OP1y ago

> outside of trade show demos, kernels are condegened. What is true is there are recurring "themes/patterns" that are handled by engineers for a class of products. Lately this is flash attention

> I guess you work at AMD.

Close but not quite

almostgotcaught1y ago

> that each one requires a manual tune

Ya definitely not - everyone uses grid search or whatever latest BPO tuning strategy.

anon291OP1y ago

Oh right. Those require no engineering effort because I said so.

1 more reply

j / k navigate · click thread line to collapse