undefined | Better HN

0 pointsConst-me2y ago0 comments

These things are nice to have, but you don’t actually need them.

It only takes 1-2 pages of HLSL to implement efficient matrix multiplication. It’s not rocket science, the area is well researched and there’re many good articles on how to implement these BLAS routines efficiently.

Moreover, manually written compute shaders enable stuff missing from these off-the-shelf higher level libraries.

It’s easy to merge multiple compute operations into a single shader. When possible, this sometimes saves gigabytes of memory bandwidth (and therefore time) these high-level libraries spend writing/reading temporary tensors.

It’s possible to re-shape immutable or rarely changing tensors into better memory layouts. Example for CPU compute https://stackoverflow.com/a/75567894/126995 the idea is equally good on GPUs.

It’s possible to use custom data formats, and they don’t require any hardware support. Upcasting BF16 to FP32 is 1 shader instruction (left shift), downcasting FP32 to BF16 only takes a few of them (for proper rounding), no hardware support necessary. Can pack quantized or sparse tensors into a single ByteAddressBuffer, again nothing special is required from hardware. Can implement custom compression formats for these tensors.

0 comments

4 comments · 2 top-level

kiratp2y ago· 2 in thread

The bar for the AI ecosystem to get models running is git clone + <20 lines of Python.

If AMD can’t make that work on consumer GPUs, then the only option they have is to undercut Nvidia so deeply that it makes sense for the big players to hire a team to write AMD’s software for them.

Const-meOP2y ago

> undercut Nvidia so deeply that it makes sense for the big players to hire a team

At least for some use cases, I think that has already happened. nVidia forbids using GeForce GPUs in data centers (in the EULA of the drivers), AMD allows that. Cost efficiency difference between AMD high-end desktop GPUs, and nVidia GPUs which they allow to deploy in data centers, is about an order of magnitude now. For example, L40 card costs $8000-9000, and delivers performance similar to $1000 AMD RX 7900 XTX.

For this reason, companies which run large models at scale are spending ridiculous amount of money on compute costs, often by renting nVidia’s data center targeted GPUs from IAAS providers. OpenAI’s CEO once described compute costs of running ChatGPT as “eye watering”.

For companies like OpenAI, I think investing money in development of vendor-agnostic ML libraries makes a lot of sense in terms of ROI.

mrguyorama2y ago

But you can't run most models on the consumer AMD GPUs, so even though AMD "allows" it, nobody except supercomputer clusters uses AMD GPUs for compute, because all the expensive data scientists you hired will bitch and moan until you get them something they can just run standard CUDA stuff on.

1 more reply

empyrrhicist2y ago

So everyone that wants to use GPUs to accelerate their compute needs to write a BLAS implementation in shaders... FFS that's not reasonable at all!

Sure, it's entirely possible, but there's a reason that high level libraries are the go-to for scientific compute.

j / k navigate · click thread line to collapse