So, performance is iffy. Density for density sake doesn’t matter since clusters are power limited.
At the end of the day it's all about performance per dollar/TCO, too, not just raw perf. A standardized benchmark helps to evaluate that.
My guess is that they neglected the software component (hardware guys always disdain software) and have to bend over backwards to get their hardware to run specific models (and only those specific models) well. Or potentially common models don't run well because their cross-chip interconnect is too slow.
Artificial analysis does good API provider inference benchmarking and has evaluated Cerebras, Groq, Sambanova, the many Nvidia-based solutions, etc. IMO it makes way more sense to benchmark actual usable end points rather than submit closed and modified implementations to mlcommons. Graphcore had the fastest BERT submission at one point (when BERT was relevant lol) and it didn't really move the needle at all.
Without optimized implementations their performance will look like shit, even if their chip were years ahead of the competition.
Building efficient implementations with an immature ecosystem and toolchain doesn't sound like a good time. But yeah, huge red flag. If they can't get their chip to perform there's no hope for customers.
“nvidia’s chip is better than yours. If you can’t make your software run well on nvidia’s chip, you have no hope of making it run well on your chip, least of all the first version of your chip.”
That’s why tinycorp is betting on a simple ML framework (tinygrad, which they develop and make available open source) whose promise is, due to the few operations needed by the framework: it’ll be very easy to get this software to run on a (eg your) new chip and then you can run ML workloads.
I’m not a (real) expert in the field but find the reasoning compelling. And it might be a good explanation for the competition for nvidia existing in hardware, but seemingly not in reality (ie including software that does something with it).
This sounds easy in theory, but in reality, based on current models, the implementations are often tuned to make them work fast on the chip. As an engineer in the ML compiler space, I think this idea of just using small primitives, which comes from the compiler / bytecode world, is not going to yield acceptable performance.
I would love to try some of the stuff I do with CUDA on AMD hardware to get some first-hand experience, but it's a though sell: They are not as widely available to rent and telling my boss to order a few GPUs, so we can inspect that potential mess for ourselves is not convincing either.