With Nvidia, the SXM connection pinouts have always been held proprietary and confidential. For example, P100's and V100's have standard PCI-e lanes connected to one of the two sides of their MegArray connectors, and if you know that pinout you could literally build PCI-e cards with SXM2/3 connectors to repurpose those now obsolete chips (this has been done by one person).
There are thousands, maybe tens of thousands of P100's you could pickup for literally <$50 apiece these days which technically give you more Tflops/$ than anything on the market, but they are useless because their interface was not ever made open and has not been reverse engineered openly and the OEM baseboards (Dell, Supermicro mainly) are still hideously expensive outside China.
I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.
See flash attention, triton, etc as core enabling libraries. Not to mention all of the custom CUDA kernels all over the place. Take all of this and then stack layers on top of them...
Unfortunately there is famously "GPU poor vs GPU rich". Pascal puts you at "GPU destitute" (regardless of assembled VRAM) and outside of implementations like llama.cpp that go incredible and impressive lengths to support these old archs you will very quickly run into show-stopping issues that make you wish you just handed over the money for >= 7.0.
I support any use of old hardware but this kind of reminds me of my "ancient" X5690 that has impressive performance (relatively speaking) but always bites me because it doesn't have AVX.
But my work/fun is in CFD. One of the main codes I use for work was written to be supported primarily at the time of Pascal. Other HPC stuff too that can be run via OpenCL, and is still plenty compatible. Things compiled back then will still run today; It's not a moving target like ML has been.
> I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.
Very cool hobby. It’s also unfortunate how stringent e-waste rules lead to so much perfectly fine hardware to be scrapped. And how the remainder is typically pulled apart to the board / module level for spares. Makes it very unlikely to stumble over more or less complete-ish systems.
Yes, it has a decent memory bandwidth (~750 GB/s) and it runs CUDA. But it only has 16 GB and doesn't support tensor cores or low precision floats. It's in a weird place.
Now that video gaming is secondary (tertiary?) to Nvidia’s revenue stream, they could give a shit which brand gamers prefer. It’s small time now. All that matters is who companies are buying their GPUs from for AI stuff. Break down that CUDA wall and it’s open-season. I wonder how they plan to stave that off. It’s only a matter of time before people get tired of writing C++ code to interface with CUDA.
A while ago NVIDIA and the GraalVM team demoed grCUDA which makes it easy to share memory with CUDA kernels and invoke them from any managed language that runs on GraalVM (which includes JIT compiled Python). Because it's integrated with the compiler the invocation overhead is low:
https://developer.nvidia.com/blog/grcuda-a-polyglot-language...
And TornadoVM lets you write kernels in JVM langs that are compiled through to CUDA:
There are similar technologies for other languages/runtimes too. So I don't think that will cause NVIDIA to lose ground.
But maybe there was more I missed. I'll take another look.
I think this is literally the impetus for their OAM spec which makes the pinout open and shareable. Up until that, they had to keep the actual designs of the baseboards out of the public due to that part being still controlled Nvidia IP.
Or, these salvage operations just stripped racks and kept the small stuff and e-waste the racks because they think it's the more efficient use of their storage space and would be easier to sell, without thinking correctly.
The Gaudi 3 multi-chip package also looks interesting. I see 2 central compute dies, 8 HBM die stacks, and then 6 small dies interleaved between the HBM stacks - curious to know whether those are also functional, or just structural elements for mechanical support.
This is one of the secret recipes of Intel. They can use older tech and push it a little further to catch/surpass current gen tech until current gen becomes easier/cheaper to produce/acquire/integrate.
They have done it with their first quad core processors by merging two dual core processors (Q6xxx series), or by creating absurdly clocked single core processors aimed at very niche market segments.
We have not seen it until now, because they were sleeping at the wheel, and knocked unconscious by AMD.
Any other examples of this? I remember the secret sauce being a process advantage over the competition, exactly the opposite of making old tech outperform the state of the art.
Would you say this means Intel is "back," or just not completely dead?
I truly do hope it is successful so we can have some alternative accelerators.
Also, hardware has a lifecycle. At some point the old hardware isn't worth running in a large scale operation because it consumes more in electricity to run 24/7 than it would cost to replace with newer hardware. But then it falls into the hands of people who aren't going to run it 24/7, like hobbyists and students, which as a manufacturer you still want to support because that's how you get people to invest their time in your stuff instead of a competitor's.
Yesterday I thought about installing and trying to use https://news.ycombinator.com/item?id=39372159 (Reor is an open-source AI note-taking app that runs models locally.) and feed it my markdown folder but I stop midway, asking myself "don't I need some kind of powerful GPU for that ?". And now I am thinking "wait, should I wait for `standard` pluggable AI computing hardware device ? Is that Intel Gaudi 3 something like that ?".
WHAT‽ It's basically got the equivalent of a 24-port, 200-gigabit switch built into it. How does that make sense? Can you imaging stringing 24 Cat 8 cables between servers in a single rack? Wait: How do you even decide where those cables go? Do you buy 24 Gaudi 3 accelerators and run cables directly between every single one of them so they can all talk 200-gigabit ethernet to each other?
Also: If you've got that many Cat 8 cables coming out the back of the thing how do you even access it? You'll have to unplug half of them (better keep track of which was connected to what port!) just to be able to grab the shell of the device in the rack. 24 ports is usually enough to take up the majority of horizontal space in the rack so maybe this thing requires a minimum of 2-4U just to use it? That would make more sense but not help in the density department.
I'm imagining a lot of orders for "a gradient" of colors of cables so the data center folks wiring the things can keep track of which cable is supposed to go where.
> The Gaudi 3 accelerators inside of the nodes are connected using the same OSFP links to the outside world as happened with the Gaudi 2 designs, but in this case the doubling of the speed means that Intel has had to add retimers between the Ethernet ports on the Gaudi 3 cards and the six 800 Gb/sec OSFP ports that come out of the back of the system board. Of the 24 ports on each Gaudi 3, 21 of them are used to make a high-bandwidth all-to-all network linking those Gaudi 3 devices tightly to each other. Like this:
> As you scale, you build a sub-cluster with sixteen of these eight-way Gaudi 3 nodes, with three leaf switches – generally based on the 51.2 Tb/sec “Tomahawk 5” StrataXGS switch ASICs from Broadcom, according to Medina – that have half of their 64 ports running at 800 GB/sec pointing down to the servers and half of their ports pointing up to the spine network. You need three leaf switches to do the trick:
> To get to 4,096 Gaudi 3 accelerators across 512 server nodes, you build 32 sub-clusters and you cross link the 96 leaf switches with a three banks of sixteen spine switches, which will give you three different paths to link any Gaudi 3 to any other Gaudi 3 through two layers of network. Like this:
The cabling works out neatly in the rack configurations they envision. The idea here is to use standard Ethernet instead of proprietary Infiniband (which Nvidia got from acquiring Mellanox). Because each accelerator can reach other accelerators via multiple paths that will (ideally) not be over-utilized, you will be able to perform large operations across them efficiently without needing to get especially optimized about how your software manages communication.
https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet
> RDMA over Converged Ethernet (RoCE) or InfiniBand over Ethernet (IBoE)[1] is a network protocol which allows remote direct memory access (RDMA) over an Ethernet network. It does this by encapsulating an InfiniBand (IB) transport packet over Ethernet.
Sounds very cool.O.5-1.5/2m copper cables are easily available and cheap and 4-8m (and even longer) is also possible with copper but tends to be more expensive and harder to get by.
Even 800gb is possible with copper cables these days but you’ll end up spending just as much if not more on cabling as the rest of your kit…https://www.fibermall.com/sale-460634-800g-osfp-acc-3m-flt.h...
100GBe is only supported on twinax anyway, so Cat8 is irrelevant here. The other 3 ports are probably QSFP or something.
But I'm not how big and how expensive a 24 channel cat 8 snake would be (!).
I hope to work on this for AMD MI300x soon. My company just got added to the MLCommons organization.
Seems like an okay $8,000 - $30,000 investment, and bare metal server maintenance isn’t that complicated these days.
I didn't know "terabytes (TB)" was a unit of memory bandwidth...
"The Intel Gaudi 3 accelerator, architected for efficient large-scale AI compute, is manufactured on a 5 nanometer (nm) process"
Althought at least for Python JITs, Intel seems to also be doing something.
And then there is the graphical debugging experience for GPGPU on CUDA, that feels like doing CPU debugging.
This experience is largely a misalignment between what AMD thinks their product is and what the Linux world thinks software is. My pet theory is it's a holdover from the GPU being primarily a games console product as that's what kept the company alive through the recent dark times. There's money now but some of the best practices are sticky.
In games dev, you ship a SDK. Speaking with personal experience here as I was on the playstation dev tools team. That's a compiler, debugger, profiler, language runtimes, bunch of math libs etc all packaged together with a single version number for the whole thing. A games studio downloads that and uses it for the entire dev cycle of the game. They've noticed that compiler bugs move so each game is essentially dependent on the "characteristics" of that toolchain and persuading them to gamble on a toolchain upgrade mid cycle requires some feature they really badly want.
HPC has some things in common with this. You "module load rocm-5.2" or whatever and now your whole environment is that particular toolchain release. That's where the math libraries are and where the compiler is.
With that context, the internal testing process makes a lot of sense. At some point AMD picks a target OS. I think it's literally "LTS Ubuntu" or a RedHat release or similar. Something that is already available anyway. That gets installed on a lot of CI machines, test machines, developer machines. Most of the boxes I can ssh into have Ubuntu on them. The userspace details don't matter much but what this does do is fix the kernel version for a given release number. Possibly to one of two similar kernel versions. Then there's a multiple month dev and testing process, all on that kernel.
Testing involves some largish number of programs that customers care about. Whatever they're running on the clusters, or some AI things these days. It also involves a lot of performance testing where things getting slower is a bug. The release team are very clear on things not going out the door if things are broken or slower and it's not a fun time to have your commit from months ago pulled out of the bisection as the root cause. That as-shipped configuration - kernel 5.whatever, the driver you build yourself as opposed to the one that kernel shipped with, the ROCm userspace version 4.1 or so - taken together is pretty solid. It sometimes falls over in the field anyway when running applications that aren't in the internal testing set but users of it don't seem anything like as cross as the HN crowd.
This pretty much gives you the discrepancy in user experience. If you've got a rocm release running on one of the HPC machines, or you've got a gaming SDK on a specific console version, things work fairly well and because it's a fixed point things that don't work can be patched around.
In contrast, you can take whatever linux kernel you like and use the amdkfd driver in that, combined with whatever ROCm packages your distribution has bundled. Last I looked it was ROCm 5.2 in debian, lightly patched. A colleague runs Arch which I think is more recent. Gentoo will be different again. I don't know about the others. That kernel probably isn't from the magic list of hammered on under testing. The driver definitely isn't. The driver people work largely upstream but the gitlab fork can be quite divergent from it, much like the rocm llvm can be quite divergent from the upstream llvm.
So when you take the happy path on Linux and use whatever kernel you happen to have installed, that's a codebase that went through whatever testing the kernel project does on the driver and reflects the fraction of a kernel dev branch that was upstream at that point in time. Sometimes it's very stable, sometimes it's really not. I stubbornly refuse to use the binary release of ROCm and use whatever driver is in Debian testing and occasionally have a bad time with stability as a result. But that's because I'm deliberately running a bleeding edge dev build because bugs I stumble across have a chance of me fixing them before users run into it.
I don't think people using apt-get install rocm necessarily know whether they're using a kernel that the userspace is expected to work with or a dev version of excitement since they look the same. The documentation says to use the approved linux release - some Ubuntu flavour with a specific version number - but doesn't draw much attention to the expected experience if you ignore that command.
This is strongly related to the "approved cards list" that HN also hates. It literally means the release testing passed on the cards in that list, and the release testing was not run on the other ones. So you're back into the YMMV region, along with people like me stubbornly running non-approved gaming hardware on non-approved kernels with a bunch of code I built from source using a different compiler to the one used for the production binaries.
None of this is remotely apparent to me from our documentation but it does follow pretty directly from the games dev / HPC design space.
https://www.semianalysis.com/p/is-intel-back-foundry-and-pro...
https://docs.nvidia.com/cuda/parallel-thread-execution/index...
Think prototype consumer product with total cost preferably < $500, definitely less than $1000.
I can't speak to the ease of configuration but know that some people have used these successfully.
That's amusing. :D
what is the programming interface here ? this is not CUDA right ...so how is this being done ?
https://www.intel.com/content/dam/www/central-libraries/us/e...
Not the best of times for stuff that doesn't fit matrix processing units.
How much does a single 200Gbit active (or inactive) fiber cable cost? Probably thousands of dollars.. making even the cabling for each card Very Expensive. Nevermind the network switches themselves..
Simultaneously impressive and disappointing.
If you're going fiber instead of twinax it's another order of magnitude and a bit for trancievers, but cables are pretty cheap still.
You seem to be loading negative energy into this release from the get-go
If you search for "AOC Fiber", lots of resources will pop up. FS.com is one helpful resource.
https://community.fs.com/article/active-optical-cable-aoc-ri...
> Active optical cable (AOC) can be defined as an optical fiber jumper cable terminated with optical transceivers on both ends. It uses electrical-to-optical conversion on the cable ends to improve speed and distance performance of the cable without sacrificing compatibility with standard electrical interfaces.