undefined | Better HN

0 pointslmeyerov2y ago0 comments

I've worked in one of the top computing labs, with top GPU computing startups, have investor money from Nvidia, wrote CUDA for years, and hire people to do write GPU code. And would say, most people -- even Nvidia employees and our own -- are individually bad at writing good CUDA code: it takes a highly multi-skilled team working together to make anything more than demoware. Most people who say they can write CUDA, when you scratch a little bit of the items I put below, you realize they can only for some basic one-offs. Think some finance person running one job for a month, but not at the equivalent of a senior python/java/c++ developer doing whatever reliable backend code they're hired to do that lives on.

To give a feel, while at Berkeley, we had an award-winning grad student working on autotuning CUDA kernels and empirically figuring out what does / doesn't work well on some GPUs. Nvidia engineers would come to him to learn about how their hardware and code works together for surprisingly basic scenarios.

It's difficult to write great CUDA code because it needs to excel in multiple specializations at the same time:

* It's not just writing fast low-level code, but knowing which algorithmic code to do. So you or your code reviewer needs to be an expert at algorithms. Worse, those algorithms are both high-level, and unknown to most programmers, also specific to hardware models, think scenarios like NUMA-aware data parallel algorithms for irregular computations. The math is generally non-traditional too, e.g., esoteric matrix tricks to manipulate sparsity and numerical stability.

* You ideally will write for 1 or more generations of architectures. And each architecture changes all sorts of basic constants around memory/thread/etc counts at multiple layers of the architecture. If you're good, you also add some sort of autotuning & JIT layers around that to adjust for different generations, models, and inputs.

* This stuff needs to compose. Most folks are good at algorithms, software engineering, or performance... not all three at the same time. Doing this for parallel/concurrent code is one of the hardest areas of computer science. Ex: Maintaining determinism, thinking through memory life cycles, enabling async vs sync frameworks to call it, handling multitenancy, ... . In practice, resiliency in CUDA land is ~non-existent. Overall, while there are cool projects, the Rust etc revolution hasn't happened here yet, so systems & software engineering still feels like early unix & c++ vs what we know is possible.

* AI has made it even more interesting nowadays. The types of processing on GPUs are richer now, multi+many GPU is much more of a thing, and disk IO as well. For big national lab and genAI foundation model level work, you also have to think about many racks of GPUs, not just a few nodes. While there's more tooling, the problem space is harder.

This is very hard to build for. Our solution early on was figuring out how to raise the abstraction level so we didn't have to. In our case, we figured out how to write ~all our code as operations over dataframes that we compiled down to OpenCL/CUDA, and Nvidia thankfully picked that up with what became RAPIDS.AI. Maybe more familiar to the HN crowd, it's basically the precursor and GPU / high-performance / energy-efficient / low-latency version of what the duckdb folks recently began on the (easier) CPU side for columnar analytics.

It's hard to do all that kind of optimization, so IMO it's a bad idea for most AI/ML/etc teams to do it. At this point, it takes a company at the scale of Nvidia to properly invest in optimizing this kind of stack, and software developers should use higher-level abstractions, whether pytorch, rapids, or something else. Having lived building & using these systems for 15 years, and worked with most of the companies involved, I haven't put any of my investment dollars into AMD nor Intel due to the revolving door of poor software culture.

Chip startups also have funny hubris here, where they know they need to try, but end up having hardware people run the show and fail at it. I think it's a bit different this time around b/c many can focus just on AI inferencing, and that doesn't need as much what the above is about, at least for current generations.

Edit: If not obvious, much of our code that merits writing with CUDA in mind also merits reading research papers to understand the implications at these different levels. Imagine scheduling that into your agile sprint plan. How many people on your team regularly do that, and in multiple fields beyond whatever simple ICML pytorch layering remix happened last week?

0 comments

4 comments · 2 top-level

j7ake2y ago· 1 in thread

Thanks for the insight. Looks like the principle of “doing things that don’t scale” works surprisingly well even in the ML space.

lmeyerovOP2y ago

Agreed.

If there is a niche that is at the intersection of multiple specialties, and it includes GPU acceleration, there is a good chance it is ripe for a startup to get an early mover advantage. Eg, real-time foundation models for audio around non-english/non-chinese that works small & offline in cars.

Unfortunately, Nvidia has a culture of open sourcing all CUDA code, so if any startup shows something works commercially, Nvidia will rewrite, likely ultimately better, and give away for free, so more companies will do it and buy more GPUs.

aurareturn2y ago· 1 in thread

In your opinion, is it hopeless for something like ROCm to compete given that even CUDA is extremely hard for all parties?

What do you think about Apple's Metal?

lmeyerovOP2y ago

If I was any of these companies, I'd totally invest many billions in ecosystem here. Tensorflow (Google) and pytorch (Facebook) are great examples, it can work. Otherwise, hw companies will continue to lose relevance in the growing server market, and SW companies will have an ever growing Nvidia tax.

But it's not easy for the hw co's. OpenCL was more of a hw company thing (Intel, AMD, mobile chip co's), and while they spend billions on adventures all the time, their SW leadership culture has been bad. They fail to do sustained & deep ecosystem investment, and instead look like small feudal orgs that get their projects pulled arbitrarily whenever the VPs rearrange themselves. For example, given that Intel brought back its old CEO, that was a scary signal to me for this front. Intel specifically had the internal talent, I'm not sure if they still do, just not at the management level, and definitely not culturally at the highest leadership level.

Jensen at Nvidia has always been a special CEO here, even when they were helping game companies make their engines, and I'm guessing that taught him the value of long-term vertical SW & ecosystem investment. Instead of Intel unifying on x86 and c++ (compilers, vtune, Intel tbb, ...), and letting Microsoft / Linux / DB people go higher, Jensen went all the way up the stack to get at full utilization, and unified teams internally on that over 1-2 decades.

Apple is a funnier case. I can see them doing it and then pulling the plug. Eg, Chris Lattner making Swift and then they failed to retain him, and their revolving door of frameworks overall. Internally, they do have the technical talent and $, but I don't understand the culture and commercial alignment.

Finally.. I do think the increasing importance of AI inferencing, yet simultaneous simplicity of it, has opened a disruption opportunity here. We are still at a tiny % of where it is going. Onyx, pytorch, transformers, etc ecosystem are still early days from that perspective. It's fast for a hardware co like Groq to port a new model. So I don't rule out big changes here, and those being used to drive the rest of the ecosystem, like your q on ROCm.

j / k navigate · click thread line to collapse

0 comments

4 comments · 2 top-level

j7ake2y ago· 1 in thread

Thanks for the insight. Looks like the principle of “doing things that don’t scale” works surprisingly well even in the ML space.

lmeyerovOP2y ago

Agreed.

aurareturn2y ago· 1 in thread

In your opinion, is it hopeless for something like ROCm to compete given that even CUDA is extremely hard for all parties?

What do you think about Apple's Metal?

lmeyerovOP2y ago

j / k navigate · click thread line to collapse