I expect system lockups when doing any sort of model inference. From the experiences of the last few years I assume it is driver bugs. Based on their rate of improvement they probably will get there in around 2025, but their past performance has been so bad I wouldn't recommend buying a card for machine learning until they've proven that they're taking the situation seriously.
Although in my opinion buy AMD anyway if you need a GPU on linux. Their open source drivers are a lot less hassle as long as you don't need BLAS.
If you have an officially supported card https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... and are using PyTorch, then you're pretty much good to go. Also, HIPify works pretty well these days.
I think where most people have been getting into trouble is with trying to run with unsupported cards (eg, *ALL* of AMD's consumer cards), or wanting to run on Windows. This is obviously a huge fail on AMD's part since anyone who's tried to do anything with any of those consumer cards will just assume the data center cards are the same, but they're quite different. It doesn't help that I've never seen any CDNA2 card on sale/available in retail. How does AMD ever expect to get any adoption when no developers have hardware they can write code to? It's completely mental.
Not that drivers are simple to make, but still. It's like owning a Ferrari that works perfectly, but you can only drive north.
Here’s an example https://github.com/Const-me/Whisper/issues/42 BTW, a lot of BLAS in the compute shaders of that software.
I thought it was just me for a while, but after watching George Hotz's famous meltdown trying to program on an AMD GPU I do wonder if they're underestimating the power of a few good public "how to use the damn thing" sessions. They've been pushing ROCm which would probably be great if it worked reliably.
CUDA is the default tech for GPGPU in HPC or AI applications, for more than a decade now. By now, people have found most of these driver bugs, and nVidia has fixed them.
Similarly, compute shaders is the only tech for GPGPU used in videogames. Modern videogames are using compute shaders for a decade now, in increasing volumes. For example, UE5 even renders triangle meshes with them [1].
However, OpenCL and ROCm are niche technologies. I’ve been hearing complaints about driver quality for some time now. For obvious reason, AMD and Intel prioritize driver bugs which affect modern videogames sold in many million copies, compared to the bugs which only affect a few people working on HPC, AI or other niche GPGPU applications.
> they're underestimating the power of a few good public "how to use the damn thing" sessions
I agree the learning curve is steep, with the lack of good materials. For an introduction article, see [2]. Ignore the parts about D3D10 hardware, the article is old and D3D10 hardware is no longer relevant. Another one, with slightly more depth, is [3]. For an example how to multiply large dense matrices with a compute shader see [4], but that example is rather advanced because optimizations, and because weird memory layout conventions inherited from the upstream project.
[1] https://www.youtube.com/watch?v=TMorJX3Nj6U
[2] https://developer.download.nvidia.com/compute/DevZone/docs/h...
[3] https://github.com/jstoecker/dxcompute-docs/tree/main
[4] https://github.com/Const-me/Whisper/blob/master/ComputeShade...
But it really is. That's the state of the market. The video game artists are GPU-programmers, they've hit DirectX11, DX12, and Vulkan with a wide variety of video games and have turned that ecosystem very stable.
-------------
DX11 is 32-bit only atomics, I don't think its a very serious solution in practice. 64-bit atomics (especially 64-bit CAS) is already very limiting compared to CPU-world where 128-bit CAS is needed to fix the obscure ABA-problem.
DX and Vulkan just have... so much API-crap you need to even get Hello World / SAXY up.
C++Amp was wonderful back in 2014, but it too is stuck in DirectX11 and therefore 32-bit atomic world. And it hasn't had an update since then. Microsoft really should have kept investing in C++Amp IMO.
-------------
ROCm is fine if you get the hardware and if it remains supported. But I think in practice, people expect support longer than what AMD is willing to give.
Yeah, I think I encountered that while porting a hash map from CUDA to HLSL.
However, I'm not sure that's necessarily a huge deal. Probably not an issue for machine learning or BLAS stuff, these use cases don't need fine-grained thread synchronization.
For applications which would benefit from such synchronization, traditional lock-free techniques ported from CPU (i.e. compare and swap atomics on global memory) can be slow due to huge count of active threads on GPUs. I mean it's obviously better than locks, but sometimes it's possible to do something better instead of CAS.
> so much API-crap you need to even get Hello World / SAXY up.
I agree for D3D12 and especially Vulkan, but IMO D3D11 is not terribly bad. It has a steep learning curve, but the amount of boilerplate for simple apps is IMO reasonable. Especially for ML or similar GPGPU stuff which only needs a small subset of the API: compute shaders only, no textures, no render targets, no depth-stencil views, no input layouts, etc.
However, unlike simple apps, real-life ones often need profiler and queue depth limiter, relatively hard to implement on top of these queries. I think Microsoft should ship them both in Windows SDK.
It only takes 1-2 pages of HLSL to implement efficient matrix multiplication. It’s not rocket science, the area is well researched and there’re many good articles on how to implement these BLAS routines efficiently.
Moreover, manually written compute shaders enable stuff missing from these off-the-shelf higher level libraries.
It’s easy to merge multiple compute operations into a single shader. When possible, this sometimes saves gigabytes of memory bandwidth (and therefore time) these high-level libraries spend writing/reading temporary tensors.
It’s possible to re-shape immutable or rarely changing tensors into better memory layouts. Example for CPU compute https://stackoverflow.com/a/75567894/126995 the idea is equally good on GPUs.
It’s possible to use custom data formats, and they don’t require any hardware support. Upcasting BF16 to FP32 is 1 shader instruction (left shift), downcasting FP32 to BF16 only takes a few of them (for proper rounding), no hardware support necessary. Can pack quantized or sparse tensors into a single ByteAddressBuffer, again nothing special is required from hardware. Can implement custom compression formats for these tensors.
Id say though that the whole ROCm runtime is in a bit of a weird situation.
But if if you run anything 5.15-ish or later you don’t need proprietary drivers.
In practice it probably means that gfx1030 (Navi 21) GPUs should work (RX6800-RX6950), but again, it also means also those cards (and every other card that AMD currently sells to individuals) is "unsupported."
[1] https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h...
2. Works on more consumer grade cards
3. Ecosystem advantage (lots of software developed against an existing and well supported ecosystem)
I have a laptop with a mobile 2060 and a desktop with a top-of-the-line consumer 7900XTX. As of yet, the 7900XTX isn't officially supported (and I haven't bothered to go down the obnoxious rabbit hole to figure out how to compute on it). Meanwhile, I can load up CUDA.jl on my laptop in mere minutes with absolutely no fuss.
Edit: if there are any GPU gurus out there who are capable of working on AMDGPU.jl to make it work on cards like the 7900XTX out of the box and writing documentation/tutorials for it... start a Patreon. I bet you could fund some significant effort getting that up and running!
"Formal support for RDNA 3-based GPUs on Linux is planned to begin rolling out this fall, starting with the 48GB Radeon PRO W7900 and the 24GB Radeon RX 7900 XTX, with additional cards and expanded capabilities to be released over time." [0]
[0] https://community.amd.com/t5/rocm/new-rocm-5-6-release-bring...
Why is it so difficult for other manufacturers to provide a compatible layer? If Apple can make Direct X 12 work on Apple Silicon surely AMD should be able to make CUDA (which has to be much simpler that DX12) work on their graphics cards? Is there some fundamental architectural differences that stop this from working?
A modern Linux system should have uptime measured in years with minimal effort. A modern Linux system with Nvidia GPUs will have uptime of weeks with a lot of fuss.
(I'm no expert, just someone who's managed a number of PCs and a few servers.)
With that said, Pop!OS does a really nice job of handling the Nvidia software stack - I've been running it on the laptop mentioned above for several years with no issues (though I don't leave my machines on 24/7).
It’s classic first mover advantage (plus just a better product / more resourcing to make it a better product honestly). I think you have to be a really massive scale to make the cost per card worth the cost per engineer math work out, unless AMD significantly closes the compatibility gap. But AMD’s job here is to fill a leaky bucket, because new CUDA code is being written every day, and they don’t seem serious about it.
Yes, it's multiplying and adding matrices. That and mapping some simple function over an array.
Neural networks are only that.
I've only just started using it for Llama running locally on my computer at home and I have to say... colour me impressed.
It generates the output slightly faster than reading speed so for me it works perfectly well.
The 24GB of VRAM should keep it relevant for a bit too and I can always buy another and NVLink them should the need arise.
If anything, I think models are going to shrink a bit, because assumptions around small models reaching capacity during traiing don’t seem fully accurate in practice[0]. We’re already starting to see some effects, like Phi-1[1] (a 1.3B code model outperforming 15B+ models), and BTLM-3B-8K[2] (a 3B model outperforming 7B models)
[0]: https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...
[1]: https://arxiv.org/pdf/2306.11644.pdf
[2]: https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a...
That can be compared to what OpenAI’s scaling law paper[0] calls the “entropy of natural language”, which they estimate at about 0.57 bits per byte, based on the differing power law for data vs. compute. In my mind, that highlights more the imprecision of the approach than the information-theoretic content of language semantics: an omniscient being would predict things better, so the closest thing to true entropy should be computed from the list of matching text prefixes among all texts ever.
However, that split isn’t automatic. You can’t expect to run a 40GB model on that, unless perhaps if it’s been designed for that—the way llama.cpp can split a model between the GPU and CPU, for instance.
What you can do without trouble is keep more models loaded, do more things at the same time, and occasionally run the same model at double speed if it batches well.
I went with the 3090 because I wanted the most VRAM for the buck, and the price of new GPUs is insane. Most GPUs in the $500-1500 range, even the Quadros and A series, don’t have anywhere near 24GB of VRAM.
For 33b? It should be much faster.
What stack are you running? Llama.cpp and exLlama are SOTA as far as I know.
In a competitive market that line has distortions where one player trts to undercut the other.
There are no bargains because there is almost no competitive pressure and so there is barely any distortion in that line.
"There's a full blown run on GPU compute on a level I think people do not fully comprehend right now. Holy cow.
I've talked to a lot of vendors in the last 7 days. It's crazy out there y'all. NVIDIA allegedly has sold out its whole supply through the year. So at this point, everyone is just maximizing their LTVs and NVIDIA is choosing who gets what as it fulfills the order queue.
" [0]Stop obsessing about cloud GPUs.
Go buy retail GPUs.
Make them work. Adapt your software. Whatever, stop whining about cloud GPUs just switch to retail.
Solve the barriers/limitations. This has been the essences of computing for 60 years. Stop being intimidated by Nvidia. Get your job done in what is available.
Buy AMD, buy Intel. Work out how to make your GPU thing work on them. Stop wringing your hands about how there’s no cloud GPUs when there’s a ton of cheap retail GPUs.
INNOVATE. Look beyond nvidia. Stop whining.
If you’ve bet your entire business on cloud GPUs then you’re a fool. Bet on retail GPUs.
>Go buy retail GPUs.
If you're doing this professionally, you know what you need and what your budget is.
If you're doing this personally, to simply learn? I did the math for myself and figured I could probably buy enough gpu credits for the cost of a 4090 build that it would probably serve me all the way through getting a phd.
See your R&D investment goes up in flames after driver update to block such workload.
I don't think this is possible when you can't pool memory on the 40 series retail GPUs.
Or, I rent just as much top performance hardware as I need, scaling as I go along, and worry about execution and implementation of my niche application instead. You can see why cloud is winning right now.
> AMD GPUs are great in terms of pure silicon: Great FP16 performance, great memory bandwidth. However, their lack of Tensor Cores or the equivalent makes their deep learning performance poor compared to NVIDIA GPUs. Packed low-precision math does not cut it. Without this hardware feature, AMD GPUs will never be competitive.
Edit: what about Intel arc GPU? Any hope there?
This has pretty much always been true. AMD cards always had more FLOPS and ROPs and memory bandwidth than the competing nVidia cards which benchmark the same. Is that a pro for AMD? Uhhhh doesn't really sound like it.
Don't AMD deliberately gimp their consumer cards to prevent cannibalising the pro cards? I vaguely recall reading about that a while back.
That being the case, they have already done the R&D but they chose to use the tech on the higher-margin kit, thus preventing hobbyists from buying AMD.
I'm not sure exactly what AMD expected to happen when doing that, especially when Nvidia continues to support CUDA on basically every GPU they've ever made: https://developer.nvidia.com/cuda-gpus#compute (looks like back to a GeForce 9400 GT, released in 2008)
Let's say
- I have a motherboard + cpu + other components and they've both got plenty of pcie lanes to spare, total this part draws 250W (incl the 25% extra wattage headroom)
- start off with one RTX 4090, TDP 450W, with headroom ~600W.
- I want to scale up by adding more 4090s over time, as many as my pcie lanes can support.
1. How do I add more PSUs over time?
2. Recommended initial PSU wattage? Recommended wattage for each additional pair of 4090s?
3. Recommended PSU brands and models for my use case?
4. Is it better to use PCI gen5 spec-rated PSUs? ATX 3.0? 12vhpwr cables rather than the ordinary 8-pin cables? I've also read somewhere that power cables between different brands of PSUs are *not* interchangeable??
5. Whenever I add an additional PSU, do I need to do something special to electrically isolate the PCIe slots?
6. North American outlets are rated for ~15A * 120V. So roughly 1800W. I can just use one outlet per psu whenever it's under 1800W, right? For simplicity let's also ignore whatever load is on that particular electrical circuit.
Each GPU means another 600W. Let's say I want to add another PSU for every 2 4090s. I understand that to sync the bootup of multiple PSUs you need an add2psu adapter.I understand the motherboard can provide ~75W for a pcie slot. I take it that the rest comes from the psu power cables. I've seen conflicting advice online - apparently miners use pcie x1 electrically isolated risers for additional power supplies, but also I've seen that it's fine as long as every input power cable for 1 gpu just comes from one psu, regardless of whether it's the one that powers the motherboard. Either way x1 risers is an unattractive option bc of bandwidth limitations.
pls help
2. You can power limit GPUs down to 250W and barely lose any performance depending on your use case, highly recommend it. So any PSU that can provide those is good.
3. HP 1200w power supplies are both plenty and cheap on ebay, even though they are rated at 1200w, because they are so cheap, you're better of just running them at ~500w and buy multiple of it instead of overheating a single one. A nice benefit of running them at lower wattages is the very loud tiny fan doesn't have to spin as hard and create a ton of noise.
4. Not needed, but having a single cable might be convenient, they are pretty expensive though.
5. You don't need to do anything special here, except if you add too many GPUs, the motherboard might have issues booting because the 75w per gpu draw is too much, but usually those motherboards will have an extra GPU power cable (like the ROMED8-2T) and some risers let you hook up the power cable directly to them so PCIe is only used for data transfer.
6. It's not the outlet, it's the circuit that matters. And keep in mind that whatever power wattage you set on the GPU, you need to account for ac/dc loss, so you need to add an additional ~10-15% to the usage.
If you power limit it to 250W, each additional GPU is essentially an extra ~280W or so. If you plan on having like 8 GPUs or more and you plan to run them 24/7, you're better off just calling a local colocation center and run it there, since they have much cheaper electricity cost, it comes out cheaper for you and you have all the benefits of being in a datacenter.
You're going to have a bad time with this assumption; typical non-kitchen household circuits in the U.S. are 15A for the circuit. Each outlet is usually limited to 15A, but the circuit breaker serving the entire circuit is almost certainly 15A as well; one outlet at maximum load will not leave capacity for another outlet on the same circuit to be simultaneously drawing maximum amperage.
Typical residential construction would have a 15A circuit for 1-2 rooms, often with a separate circuit for lighting. Some rooms, e.g. kitchens will have 20A circuits, and some houses may have been built with 20A circuits serving more outlets / rooms.
Combined with the ~1800W per 15A circuit restriction (I wouldn’t load the circuit to 100%, so really ~1600W) I’m not sure you can achieve what you’re going for.
If you’re really wanting to do this, consider adding a say 30A circuit near the circuit breaker of your home, usually the garage or basement and put the equipment there. I would get a dehumidifier in either location.
Everyone talks about Nvidia GPUs and AMD MI250/MI300, where is Intel? Would love to have a 3rd player.
I guess what Intel is missing is that it does not have a PC version GPU(ARC is far behind AMD and Nvidia GPU cards), so it can not establish its developer ecosystem and its OneAPI is a hard sell for its AI plan.
Either make ARC or whatever GPU as good as Nvidia/AMD graphic cards, or at least make lots of great AI compute accelerator stick to stay in the game, or no future in the AI era for Intel, sadly.
I would imagine that someone really serious about training (or any other CUDA workload) uses both.
If you only care about ML stuff, sure, the calculation is different.
For occasionally use, the major constraint isn't speed so much as which models fit. I tend to look at $/GB VRAM as my major spec. Something like a 3060 12GB is an outlier for fitting sensible models while being cheap.
I don't mind waiting a minute instead of 15 seconds for some complex inference if I do it a few times per day. Or having training be slower if it comes up once every few months.
Source: https://www.anandtech.com/show/18963/samsung-completes-initi...
https://www.crucial.com/memory/ddr5/ct2k32g48c40u5
If a $1000 GPU came with that, it would blow everything else out-of-the-water for model size. Speed? No. Model size? Yes.
If it came with 320GB, I could run ChatGPT-grade LLMs. That's $800 worth of DDR5.
Instead, I get 24GB on the 3090 or 4090 for $2k.
A $3k LLM-capable card would not be a hard expense to justify.
I just waited for a reputable seller to show up. I also limited myself to professional sales from the EU to avoid any potential import issues.
I guess there is always a risk involved, but probably more buying from a first time private profile with just one object listed, than from a business that sells every day with high rep.
But I guessed it is expected that Nvidia doesn't want to cannibalize the 4080.
It's hard to blame nvidia when nobody seems to be trying to compete with them on the low end of ML and DL.
One issue is price, it costs almost twice as much. Another one is memory bandwidth, RTX 4000 SFF only delivers 320 GB/second. That is much slower than 4070Ti (504 GB/second) and slightly faster than 4060Ti (288 GB/second). Also the clock frequencies are half of 4070Ti, so the compute performance is worse.
Specs: 2x RTX 3090, NVLink Bridge, 128GB DDR4 3200 RAM, Ryzen 7 3700X, X570 SLI mainboard, 2TB M.2 NVMe SSD, air cooled mesh case.
Finding the 3-slot nvlink bridge is hard and it's usually expensive. I think it's not worth it in most cases. I managed to find a cheap used one. Cooling is also a challenge. The cards are 2.7 slots wide and the spacing is usually 3 slots, so there isn't much room. Some people are putting 3d printed shrouds on the back of the PC case to suck the air out of the cards with an extra external fan. Also limiting the power from 350W to 280W or so per card doesn't cost a lot of performance. The CPU is not limiting the performance at all, as long as you have 4 cores per GPU you're good.
2x RTX 3090
128 GB DDR5
Intel core i9 600 series
Z790 Mainboard
I used Intel instead of AMD for the cpu, which pushed my prices higher... but I saved on the back side by skipping the NVLink Bridge.
Good to know I'm not missing much with out the Bridge, since I get about 13tok/s on Llama-65B 4 bit if I push all layers onto the GPU.
ps: ~280W power limit is a good call, it won't heat up your room too much.
But after Stable Diffusion came out, I started to play around with it and was pleasantly surprised that the GPU could handle it!
The setup is a little messy, and Linux only.
For someone targeting AI, definitely pick an Nvidia card with 12+ GBs of VRAM.
4090 ($1,600) > 3090 ($1300 new - $600 used) > 3060 ($300)
used 3090 is ideal value. Lots of models will need the 24gb ram
Two, I recommend ignoring electricity cost and using all you can. If it's cheaper now than it ever will be, use it while it's cheap. If it will go down due to renewables, nuclear, etc in the future, it's good to buy up the GPUs while their price is artificially depressed from energy fears.
Third, go for server type PSUs and breakout boards. The server PSUs cant be beaten in watts for your dollar, and are extremely efficient.
Finally, consider scooping up some x79 and x99 xeon boards from Chinese sellers. They're cheap as hell, have PCI lanes out the wazoo, etc. This means you don't have to fool with as many mobos to run the same amount of gpus. If you go this route, don't get the bottom of the barrel no-name motherboards. Machinist is a decent one.
But Nvidias monopoly mean a they cripple their retail cards and push the AI stuff to data centers.
If only there was many manufacturers of AI hardware and software there would be abundant cheap products at every level.
AMD and Intel don’t seem to be able to compete and there’s no sign that will change.
So AI is going to remain expensive and hard to get for a very long time.