undefined | Better HN

0 pointsgpt512d ago0 comments

Why is AMD not more popular then if labs are so flexibly with giving away CUDA?

0 comments

people are trying, especially for inference. For training, it’s just too high risk to tank your training I think.

TPUs are at least dogfooded by Google deepmind, no team AFAIK has gotten the AMD stack to train well.

Interesting. Why? My current mental model is that AMD chips are just a bit behind, so, less efficient, but no biggie. Do labs even use CUDA?

nl12d ago

This is somewhat out of date (Dec 2024), but gives you some idea of how far behind AMD was then: https://newsletter.semianalysis.com/p/mi300x-vs-h100-vs-h200...

Pull quotes:

AMD’s software experience is riddled with bugs rendering out of the box training with AMD is impossible. We were hopeful that AMD could emerge as a strong competitor to NVIDIA in training workloads, but, as of today, this is unfortunately not the case. The CUDA moat has yet to be crossed by AMD due to AMD’s weaker-than-expected software Quality Assurance (QA) culture and its challenging out of the box experience.

[snip]

> The only reason we have been able to get AMD performance within 75% of H100/H200 performance is because we have been supported by multiple teams at AMD in fixing numerous AMD software bugs. To get AMD to a usable state with somewhat reasonable performance, a giant ~60 command Dockerfile that builds dependencies from source, hand crafted by an AMD principal engineer, was specifically provided for us

[snip]

> AMD hipBLASLt/rocBLAS’s heuristic model picks the wrong algorithm for most shapes out of the box, which is why so much time-consuming tuning is required by the end user.

etc etc. The whole thing is worth reading.

I'm sure it has (and will continue to) improved since then. I hear good things about the Lemonade team (although I think that is mostly inference?)

But the NVidia stack has improved too.

_vertigo12d ago

That’s insane. There should be a big team of people at AMD whose whole job is just to dogfood their stuff for training like this. Speaking of which, Amazon is in the same boat, I’m constantly surprised that Amazon is not treating improving Inferentia/Trainium software as an uber-priority. (I work at Amazon)

uberduper12d ago

amd gpus compete but they lack the interconnect. NVLink performance is a huge deal for training.

0-_-012d ago

What I hear is that getting your network to work on AMD is a huge pain.

dnadler12d ago

Yeah, historically it’s been software that’s limited AMD here. Not surprised to hear that may still be the issue. NVidia’s biggest edge was really CUDA.

2 more replies

j / k navigate · click thread line to collapse

0 comments

mattnewton12d ago

people are trying, especially for inference. For training, it’s just too high risk to tank your training I think.

TPUs are at least dogfooded by Google deepmind, no team AFAIK has gotten the AMD stack to train well.

coder-312d ago

Interesting. Why? My current mental model is that AMD chips are just a bit behind, so, less efficient, but no biggie. Do labs even use CUDA?

nl12d ago

This is somewhat out of date (Dec 2024), but gives you some idea of how far behind AMD was then: https://newsletter.semianalysis.com/p/mi300x-vs-h100-vs-h200...

Pull quotes:

[snip]

> AMD hipBLASLt/rocBLAS’s heuristic model picks the wrong algorithm for most shapes out of the box, which is why so much time-consuming tuning is required by the end user.

etc etc. The whole thing is worth reading.

I'm sure it has (and will continue to) improved since then. I hear good things about the Lemonade team (although I think that is mostly inference?)

But the NVidia stack has improved too.

_vertigo12d ago

uberduper12d ago

amd gpus compete but they lack the interconnect. NVLink performance is a huge deal for training.

0-_-012d ago

What I hear is that getting your network to work on AMD is a huge pain.

dnadler12d ago

Yeah, historically it’s been software that’s limited AMD here. Not surprised to hear that may still be the issue. NVidia’s biggest edge was really CUDA.

2 more replies

j / k navigate · click thread line to collapse