The new "NVL" variant adds ~20% more memory per GPU by enabling the sixth HBM stack (previously only five out of six were used). Additionally, GPUs now come in pairs with 600GB/s bandwidth between the paired devices. However, the pair then uses PCIe as the sole interface to the rest of the system. This topology is an interesting hybrid of the previous DGX (put all GPUs onto a unified NVLink graph), and the more traditional PCIe accelerator cards (star topology of PCIe links, host CPU is the root node). Probably not an issue, I think PCIe 5.0 x16 is already fast enough to not bottleneck multi-GPU training too much.
I have seen some benchmarks from academia but nothing in the private sector.
I wonder if they thought they were moving too fast and wanted to milk amphere/ada as long as possible.
Not having any competition whatsoever means Nvidia can release what they like when they like.
I got an email from vultr, saying that they're "officially taking reservations for the NVIDIA HGX H100", so I guess all public clouds are going to get those soon.
You can safely assume an entity bought as many as they could.
[1] https://www.qualcomm.com/products/technology/processors/clou...
[2] https://github.com/quic/software-kit-for-qualcomm-cloud-ai-1...
For anything that can be run remotely, it'll always be deployed and optimized server-side first. Higher utilization means more economy.
Then trickle down to local and end user devices if it makes sense.
Centralization of compute has not always won (even if that compute is mostly controlled by a single company). The failure of cloud gaming vs consoles, and the success of Apple (which is very centralized but pushes a lot of ML compute out to the edge) for example.
It was a slap in the face when the 4090 had the same memory capacity as the 3090.
A6000 is 5000 dollars, ain't no hobbyist at home paying for that.
If you are a business user then you must pay Nvidia gargantuan amounts of money.
This is the outcome of a market leader with no real competition - you pay much more for lower power than the consumer GPUs and you are forced into ujsing their business GPUs through software license restrictions on the drivers.
Given the size of LLMs, this should be possible with just a little bit of extra VRAM.
ATI seems to be holding the idiot ball.
Port stable diffusion and clip to their hardware. Train an upsized version sized for a 48GB card. Release a prosumer 48gb card... get huge uptake from artists and creators using the tech.
Whether or not there is real competition depends entirely on whether Intels Arc line of GPUs stays in the market.
AMD strangely has decided not to compete. Its newest GPU the 7900 XTX is an extremely powerful card, close to the top of the line Nvidia RTX 4090 in raster performance.
If AMD had introduced it with an aggressively low price then then they could have wedged Nvidia, which is determinbed to exploit it's market dominance by squeezing the maximum money out of buyers.
Instead, AMD has decided to simply follow Nvidia in squeezing for maximum prices, with AM prices slightly behind Nvidia.
It's a strange decision from AMD who is well behind in market and apparently seems disinterested in increasing that market share by competing aggressively.
So a third player is needed - Intel - it's alot harder for three companies to sit on outrageously high prices for years rather than compete with each other for market share.
Since Intel GPUs are again TSMC manufactured, you really aren't going to see price improvements unless Intel subsidizes all of this.
This is not correct.
Much less powerful GPUs represent better value but the market is ridiculously overpriced at the moment.
- The Intel Falcon Shores XPU is basically a big GPU that can use DDR5 DIMMS directly, hence it can fit absolutely enormous models into a single pool. But it has been delayed to 2025 :/
- AMD have not mentioned anything about the (not delayed) MI300 supporting DIMMs. If it doesn't, its capped to 128GB, and its being marketed as an HPC product like the MI200 anyway (which you basically cannot find on cloud services).
Nvidia also has some DDR5 grace CPUs, but the memory is embedded and I'm not sure how much of a GPU they have. Other startups (Tenstorrent, Cerebras, Graphcore and such) seemed to have underestimated the memory requirements of future models.
That's the problem. Good DDR5 RAM's memory speed is <100GB/s, while nvidia could has up to 2TB/s, and still the bottleneck lies on memory speed for most applications.
Anyway, what I was implying is that simply fitting a trillion parameter model into a single pool is probably more efficient than splitting it up over a power hungry interconnect. Bandwidth is much lower, but latency is also slower, you are shuffling much less data around.
I'm not saying they shouldn't bother with RAM at all, mind you. But given some target price, it's a balance thing between compute and RAM, and right now it seems that RAM is the bigger hurdle.
Depending on the model the performance is sometimes not all that different. I believe for solely inference on some models the speed difference may barely be noticeable, where for other training activities it may make 10+% difference [1]
[0] https://pytorch.org/tutorials/intermediate/model_parallel_tu...
[1] https://huggingface.co/transformers/v4.9.2/performance.html
> “The reason we took [NVLink] off is that we need I/O for other things, so we’re using that area to cram in as many AI processors as possible,” Jen-Hsun Huang explained of the reason for axing NVLink.[0]
"NVLink is bad for your games and AI, trust me bro."
But then this card, actually aimed at ML applications, uses it.
0. https://www.techgoing.com/nvidia-rtx-4090-no-longer-supports...
It's also enormously more expensive and I'm not sure if you can buy it new without getting the nvidia compute server.
Previously, GPUs were designed for gamers, and no game really "needs" more than 16 GB of VRAM. I've seen reviews of the A100 and H100 cards saying that the 80GB is ample for even the most demanding usage.
Now? Suddenly GPUs with 1 TB of memory could be immediately used, at scale, by deep-pocket customers happy to throw their entire wallets at NVIDIA.
This new H100 NVL model is a Frankenstein's monster stitched together from whatever they had lying around. It's a desperate move to corner the market early as possible. It's just the beginning, a preview of the times to come.
There will be a new digital moat, a new capitalist's empire, built upon on the scarcity of cards "big enough" to run models that nobody but a handful of megacorps can afford to train.
In fact, it won't be enough to restrict access by making the models expensive to train. The real moat will be models too expensive to run. Users will have to sign up, get API keys, and stand in line.
"Safe use of AI" my ass. Safe profits, more like. Safe monopolies, safe from competition.