Which GPU(s) to Get for Deep Learning (opens in new tab)

(timdettmers.com)

223 pointssnow_mac2y ago128 comments

128 comments

96 comments · 24 top-level

roenxi2y ago· 16 in thread

Evaluating AMD GPU by their specs is not going to paint the full picture. Their drivers are a serious problem. I've managed to get ROCm mostly working on my system (ignoring all the notifications of what is officially supported, the jammy debs from the official repo seem to work on Debian testing). The range of supported setups is limited so it is quite easy to end up in a similar situation.

I expect system lockups when doing any sort of model inference. From the experiences of the last few years I assume it is driver bugs. Based on their rate of improvement they probably will get there in around 2025, but their past performance has been so bad I wouldn't recommend buying a card for machine learning until they've proven that they're taking the situation seriously.

Although in my opinion buy AMD anyway if you need a GPU on linux. Their open source drivers are a lot less hassle as long as you don't need BLAS.

lhl2y ago

In the data center, I think AMD is a lot more viable than most people think. MosaicML recently did a test and were able to swap MI250s with A100s basically seamlessly, within a single training run even, and ran into no issues: https://www.mosaicml.com/blog/amd-mi250

If you have an officially supported card https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... and are using PyTorch, then you're pretty much good to go. Also, HIPify works pretty well these days.

I think where most people have been getting into trouble is with trying to run with unsupported cards (eg, *ALL* of AMD's consumer cards), or wanting to run on Windows. This is obviously a huge fail on AMD's part since anyone who's tried to do anything with any of those consumer cards will just assume the data center cards are the same, but they're quite different. It doesn't help that I've never seen any CDNA2 card on sale/available in retail. How does AMD ever expect to get any adoption when no developers have hardware they can write code to? It's completely mental.

FredPret2y ago

I got really excited until you said all of their consumer cards are out. That's even more infuriating - people have mammoth computing devices laying around and they can't make full use of them, because of drivers.

Not that drivers are simple to make, but still. It's like owning a Ferrari that works perfectly, but you can only drive north.

shiftpgdn2y ago

I think tinygrad is working on AMD and SnapDragon adoption

figomore2y ago

You can use the WebGPU backend on Tinygrad. It's working well for my test with a Nvidia 960 running inference (Unet 3D). I don't know how well WebGPU is supported on AMD GPUs.

Const-me2y ago

ROCm is not the only option, compute shaders are very reliable on all GPUs. And thanks to the Valve’s work on DXVK 2.0, modern Linux runs Windows D3D11 software just fine.

Here’s an example https://github.com/Const-me/Whisper/issues/42 BTW, a lot of BLAS in the compute shaders of that software.

roenxi2y ago

I dunno, are they? AMD should to pay someone to put up some "how to multiply a 2x2 matrix on our GPU for the average programmer!" tutorials somewhere obvious. I saw a lot of GPU lockups before I gave up on trying and decided that it wasn't worth it. Maybe compute shaders were a thing I should have tried. To be honest, I don't know much about them because my attempts in the space were shut down pretty hard by driver bugs linked to OpenCL and ROCm.

I thought it was just me for a while, but after watching George Hotz's famous meltdown trying to program on an AMD GPU I do wonder if they're underestimating the power of a few good public "how to use the damn thing" sessions. They've been pushing ROCm which would probably be great if it worked reliably.

Const-me2y ago

> driver bugs linked to OpenCL and ROCm

CUDA is the default tech for GPGPU in HPC or AI applications, for more than a decade now. By now, people have found most of these driver bugs, and nVidia has fixed them.

Similarly, compute shaders is the only tech for GPGPU used in videogames. Modern videogames are using compute shaders for a decade now, in increasing volumes. For example, UE5 even renders triangle meshes with them [1].

However, OpenCL and ROCm are niche technologies. I’ve been hearing complaints about driver quality for some time now. For obvious reason, AMD and Intel prioritize driver bugs which affect modern videogames sold in many million copies, compared to the bugs which only affect a few people working on HPC, AI or other niche GPGPU applications.

> they're underestimating the power of a few good public "how to use the damn thing" sessions

I agree the learning curve is steep, with the lack of good materials. For an introduction article, see [2]. Ignore the parts about D3D10 hardware, the article is old and D3D10 hardware is no longer relevant. Another one, with slightly more depth, is [3]. For an example how to multiply large dense matrices with a compute shader see [4], but that example is rather advanced because optimizations, and because weird memory layout conventions inherited from the upstream project.

[1] https://www.youtube.com/watch?v=TMorJX3Nj6U

[2] https://developer.download.nvidia.com/compute/DevZone/docs/h...

[3] https://github.com/jstoecker/dxcompute-docs/tree/main

[4] https://github.com/Const-me/Whisper/blob/master/ComputeShade...

1 more reply

dragontamer2y ago

It disappoints me that DirectX remains one of the best GPU-compute solutions in practice right now. And Vulkan too I guess.

But it really is. That's the state of the market. The video game artists are GPU-programmers, they've hit DirectX11, DX12, and Vulkan with a wide variety of video games and have turned that ecosystem very stable.

-------------

DX11 is 32-bit only atomics, I don't think its a very serious solution in practice. 64-bit atomics (especially 64-bit CAS) is already very limiting compared to CPU-world where 128-bit CAS is needed to fix the obscure ABA-problem.

DX and Vulkan just have... so much API-crap you need to even get Hello World / SAXY up.

C++Amp was wonderful back in 2014, but it too is stuck in DirectX11 and therefore 32-bit atomic world. And it hasn't had an update since then. Microsoft really should have kept investing in C++Amp IMO.

-------------

ROCm is fine if you get the hardware and if it remains supported. But I think in practice, people expect support longer than what AMD is willing to give.

Const-me2y ago

> 32-bit only atomics, I don't think its a very serious solution in practice

Yeah, I think I encountered that while porting a hash map from CUDA to HLSL.

However, I'm not sure that's necessarily a huge deal. Probably not an issue for machine learning or BLAS stuff, these use cases don't need fine-grained thread synchronization.

For applications which would benefit from such synchronization, traditional lock-free techniques ported from CPU (i.e. compare and swap atomics on global memory) can be slow due to huge count of active threads on GPUs. I mean it's obviously better than locks, but sometimes it's possible to do something better instead of CAS.

> so much API-crap you need to even get Hello World / SAXY up.

I agree for D3D12 and especially Vulkan, but IMO D3D11 is not terribly bad. It has a steep learning curve, but the amount of boilerplate for simple apps is IMO reasonable. Especially for ML or similar GPGPU stuff which only needs a small subset of the API: compute shaders only, no textures, no render targets, no depth-stencil views, no input layouts, etc.

However, unlike simple apps, real-life ones often need profiler and queue depth limiter, relatively hard to implement on top of these queries. I think Microsoft should ship them both in Windows SDK.

2 more replies

singhrac2y ago

Sure, compute shaders might work, but don’t you need rocBLAS, rocSPARSE, MIOpen, etc? Are people reinventing those in compute shaders in another package?

Const-me2y ago

These things are nice to have, but you don’t actually need them.

It only takes 1-2 pages of HLSL to implement efficient matrix multiplication. It’s not rocket science, the area is well researched and there’re many good articles on how to implement these BLAS routines efficiently.

Moreover, manually written compute shaders enable stuff missing from these off-the-shelf higher level libraries.

It’s easy to merge multiple compute operations into a single shader. When possible, this sometimes saves gigabytes of memory bandwidth (and therefore time) these high-level libraries spend writing/reading temporary tensors.

It’s possible to re-shape immutable or rarely changing tensors into better memory layouts. Example for CPU compute https://stackoverflow.com/a/75567894/126995 the idea is equally good on GPUs.

It’s possible to use custom data formats, and they don’t require any hardware support. Upcasting BF16 to FP32 is 1 shader instruction (left shift), downcasting FP32 to BF16 only takes a few of them (for proper rounding), no hardware support necessary. Can pack quantized or sparse tensors into a single ByteAddressBuffer, again nothing special is required from hardware. Can implement custom compression formats for these tensors.

2 more replies

irusensei2y ago

What do you mean by drivers? The kernel ones? AMDGPU and KFD runs out of the box and without problems from my use case so far.

Id say though that the whole ROCm runtime is in a bit of a weird situation.

But if if you run anything 5.15-ish or later you don’t need proprietary drivers.

lhl2y ago

The more relevant question is which GPU is the OP using? The only officially ROCm supported GPU available for retail purchase is the RDNA2-based Radeon Pro W6800. [1]

In practice it probably means that gfx1030 (Navi 21) GPUs should work (RX6800-RX6950), but again, it also means also those cards (and every other card that AMD currently sells to individuals) is "unsupported."

[1] https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h...

JonChesterfield2y ago

Are you running the ROCm jobs on the same GPU as the system GUI? I use built from source rocm on debian with reasonable success, but I do remember gnome crashing pretty reliably when trying to run compute tests on my laptop.

api2y ago

How are the Windows drivers for AMD? OS shouldn't matter all that much if its primary role is to host or train models. As long as your code can run under the OS in question it's fine.

pantalaimon2y ago

I hope RustiCL will become a viable alternative there.

andy_ppp2y ago· 10 in thread

I hear a lot about CUDA and how bad ROCm is etc. and I’ve been trying to understand what exactly CUDA is doing that is so special; isn’t the maths for neural networks mostly multiplying large arrays/tensors together? What magic is CUDA doing that is so different for other vendors to implement? Is it just lock-in, the type of operations that are available, some kind of magical performance advantage or something else that CUDA is doing?

empyrrhicist2y ago

1. Driver stability

2. Works on more consumer grade cards

3. Ecosystem advantage (lots of software developed against an existing and well supported ecosystem)

I have a laptop with a mobile 2060 and a desktop with a top-of-the-line consumer 7900XTX. As of yet, the 7900XTX isn't officially supported (and I haven't bothered to go down the obnoxious rabbit hole to figure out how to compute on it). Meanwhile, I can load up CUDA.jl on my laptop in mere minutes with absolutely no fuss.

Edit: if there are any GPU gurus out there who are capable of working on AMDGPU.jl to make it work on cards like the 7900XTX out of the box and writing documentation/tutorials for it... start a Patreon. I bet you could fund some significant effort getting that up and running!

officialchicken2y ago

As of today, there is zero consumer card support from AMD. It is an option only if you have a PRO card.

"Formal support for RDNA 3-based GPUs on Linux is planned to begin rolling out this fall, starting with the 48GB Radeon PRO W7900 and the 24GB Radeon RX 7900 XTX, with additional cards and expanded capabilities to be released over time." [0]

[0] https://community.amd.com/t5/rocm/new-rocm-5-6-release-bring...

empyrrhicist2y ago

Right, which SUCKS. Everyone who wants to prototype on their existing gear before jumping into a big pro card purchase is stuck with Nvidia, and the availability/performance of the software stack shows it.

1 more reply

andy_ppp2y ago

I’m asking at a lower level than this, CUDA presumably has a list of functionality for GPGPU stuff like tensors, loading data, splitting up training, and building pipelines of networks /attention stuff that can efficiently fit neural networks to many sorts of data.

Why is it so difficult for other manufacturers to provide a compatible layer? If Apple can make Direct X 12 work on Apple Silicon surely AMD should be able to make CUDA (which has to be much simpler that DX12) work on their graphics cards? Is there some fundamental architectural differences that stop this from working?

3 more replies

adfiognquio32y ago

Nvidia's software is also pretty atrocious, in my opinion. The output of various tools is cryptic, updates regularly result in a totally broken system, and things often stop working for no discernible reason. Nvidia GPUs are always the most finnicky part of a system.

A modern Linux system should have uptime measured in years with minimal effort. A modern Linux system with Nvidia GPUs will have uptime of weeks with a lot of fuss.

(I'm no expert, just someone who's managed a number of PCs and a few servers.)

empyrrhicist2y ago

Right, but they can get away with that because they have essentially no competition.

With that said, Pop!OS does a really nice job of handling the Nvidia software stack - I've been running it on the laptop mentioned above for several years with no issues (though I don't leave my machines on 24/7).

mattnewton2y ago

Everyone else does the work to make sure it runs on cudnn, because they bought the hardware when it was the only reasonable solution, and if it works on anything else that’s just a happy accident. So you’ll spend weeks of your incredibly expensive engineering or researcher time fighting compatibility issues because you saved $1k by going with an amd card. Your researchers/engineers conclude it’s the only reasonable solution for now and build on nvidia.

It’s classic first mover advantage (plus just a better product / more resourcing to make it a better product honestly). I think you have to be a really massive scale to make the cost per card worth the cost per engineer math work out, unless AMD significantly closes the compatibility gap. But AMD’s job here is to fill a leaky bucket, because new CUDA code is being written every day, and they don’t seem serious about it.

andy_ppp2y ago

Yup, filling the bucket could be worth hundreds of billions though. Maybe even trillions, seems like a sensible punt.

marcosdumay2y ago

> it’s the maths for neural networks largely multiplying large arrays/tensors together?

Yes, it's multiplying and adding matrices. That and mapping some simple function over an array.

Neural networks are only that.

yeahwhatever102y ago

It’s the inter GPU communication. Scatter and Gather have much worse performance on AMD GPUs.

ItsBob2y ago· 9 in thread

Just as an FYI/additional data point, I bought a 3090 FE from Ebay a few months ago for £605 including delivery.

I've only just started using it for Llama running locally on my computer at home and I have to say... colour me impressed.

It generates the output slightly faster than reading speed so for me it works perfectly well.

The 24GB of VRAM should keep it relevant for a bit too and I can always buy another and NVLink them should the need arise.

espadrine2y ago

> The 24GB of VRAM should keep it relevant for a bit too

If anything, I think models are going to shrink a bit, because assumptions around small models reaching capacity during traiing don’t seem fully accurate in practice[0]. We’re already starting to see some effects, like Phi-1[1] (a 1.3B code model outperforming 15B+ models), and BTLM-3B-8K[2] (a 3B model outperforming 7B models)

[0]: https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...

[1]: https://arxiv.org/pdf/2306.11644.pdf

[2]: https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a...

wongarsu2y ago

We had a long phase of "models aren't good enough but get better if we make them bigger, let's see how far we can go". This year we finally reached "some models are pretty great, let's see if we can do the same with smaller models". I'm excited for where this will take us.

XCSme2y ago

Is there any way to compute the "capacity" of a model? In theory, if it's encoding all data with 100% efficiency, I guess the data stored in the model should be something like 2^parameters count (weights + biases) ?

espadrine2y ago

There’s a theoretical, but impractical, way: for a given model, each possible set of weight/bias values yields a specific loss value when ran against the full corpus. There’s at least one set of weight values which minimizes it, for which the idealized bit-per-byte entropy can be computed.

That can be compared to what OpenAI’s scaling law paper[0] calls the “entropy of natural language”, which they estimate at about 0.57 bits per byte, based on the differing power law for data vs. compute. In my mind, that highlights more the imprecision of the approach than the information-theoretic content of language semantics: an omniscient being would predict things better, so the closest thing to true entropy should be computed from the list of matching text prefixes among all texts ever.

[0]: https://arxiv.org/pdf/2001.08361

1 more reply

PeterStuer2y ago

Anyone with experience running 2 linked consumer GPU's want to chime in how good this works in practice?

Filligree2y ago

You get a fast link between the GPUs, which should help when you’ve got a model split between them.

However, that split isn’t automatic. You can’t expect to run a 40GB model on that, unless perhaps if it’s been designed for that—the way llama.cpp can split a model between the GPU and CPU, for instance.

What you can do without trouble is keep more models loaded, do more things at the same time, and occasionally run the same model at double speed if it batches well.

2 more replies

marcyb5st2y ago

I think you need enterprise grade cards for it to make it work. If I remember correctly consumer cards with nvlink can't share resources to host a 40GB model in vram.

gymbeaux2y ago

I bought a used 3090 FE from eBay for $600 too! Mine is missing the connector latch, but seems to be firmly inserted so I think fire risk is negligible.

I went with the 3090 because I wanted the most VRAM for the buck, and the price of new GPUs is insane. Most GPUs in the $500-1500 range, even the Quadros and A series, don’t have anywhere near 24GB of VRAM.

brucethemoose22y ago

> It generates the output slightly faster than reading speed

For 33b? It should be much faster.

What stack are you running? Llama.cpp and exLlama are SOTA as far as I know.

nl2y ago· 6 in thread

You can tell how NVIDIA dominants the market by the fact their price/performance "curve" is almost a straight line.

In a competitive market that line has distortions where one player trts to undercut the other.

There are no bargains because there is almost no competitive pressure and so there is barely any distortion in that line.

MrBuddyCasino2y ago

I suppose this is one of the reasons (besides AMD dropping the ball) they aren't even trying to be competitive in the gaming market - they can sell the same mm2 silicon area for much more to AI startups:

"There's a full blown run on GPU compute on a level I think people do not fully comprehend right now. Holy cow.

I've talked to a lot of vendors in the last 7 days. It's crazy out there y'all. NVIDIA allegedly has sold out its whole supply through the year. So at this point, everyone is just maximizing their LTVs and NVIDIA is choosing who gets what as it fulfills the order queue.

" [0]

[0] https://twitter.com/Suhail/status/1683642991490269185

andrewstuart2y ago

Suhail lacks wisdom.

Stop obsessing about cloud GPUs.

Go buy retail GPUs.

Make them work. Adapt your software. Whatever, stop whining about cloud GPUs just switch to retail.

Solve the barriers/limitations. This has been the essences of computing for 60 years. Stop being intimidated by Nvidia. Get your job done in what is available.

Buy AMD, buy Intel. Work out how to make your GPU thing work on them. Stop wringing your hands about how there’s no cloud GPUs when there’s a ton of cheap retail GPUs.

INNOVATE. Look beyond nvidia. Stop whining.

If you’ve bet your entire business on cloud GPUs then you’re a fool. Bet on retail GPUs.

wing-_-nuts2y ago

>Stop obsessing about cloud GPUs.

>Go buy retail GPUs.

If you're doing this professionally, you know what you need and what your budget is.

If you're doing this personally, to simply learn? I did the math for myself and figured I could probably buy enough gpu credits for the cost of a 4090 build that it would probably serve me all the way through getting a phd.

oskarw852y ago

>Stop wringing your hands about how there’s no cloud GPUs when there’s a ton of cheap retail GPUs.

See your R&D investment goes up in flames after driver update to block such workload.

1 more reply

cosentiyes2y ago

> Solve the barriers/limitations

I don't think this is possible when you can't pool memory on the 40 series retail GPUs.

epups2y ago

Ok, so make an enormous capital investment in tech that will be outdated in two years - if not already outdated, as you mentioned AMD and Intel. Then, I'll need to hire geniuses who can extract juice out of this hardware at a scale that not even Google, Amazon, Microsoft and others could.

Or, I rent just as much top performance hardware as I need, scaling as I go along, and worry about execution and implementation of my niche application instead. You can see why cloud is winning right now.

1 more reply

politelemon2y ago· 6 in thread

So Nvidia is going to pretty much corner the market for a long time? This bit I expected but was still sad to read. Surely we would benefit from competition. It would probably take a lot of investment from AMD to make that happen, I imagine.

> AMD GPUs are great in terms of pure silicon: Great FP16 performance, great memory bandwidth. However, their lack of Tensor Cores or the equivalent makes their deep learning performance poor compared to NVIDIA GPUs. Packed low-precision math does not cut it. Without this hardware feature, AMD GPUs will never be competitive.

Edit: what about Intel arc GPU? Any hope there?

formerly_proven2y ago

> AMD GPUs are great in terms of pure silicon

This has pretty much always been true. AMD cards always had more FLOPS and ROPs and memory bandwidth than the competing nVidia cards which benchmark the same. Is that a pro for AMD? Uhhhh doesn't really sound like it.

espadrine2y ago

That’s the one thing that I feel is a bit misleading in the article (to be fair, it was initially written years ago, and got rewritten a bit recently). FLOPS comparisons given in the wild are not always apple-to-apple (eg. not including Tensor cores for NVIDIA, but including V_DUAL_DOT2ACC_F32_F16 for AMD), while on the flip side, AMD’s WMMA should address the same goals as Tensor cores. I have an article on comparing the two: https://espadrine.github.io/blog/posts/recomputing-gpu-perfo...

ItsBob2y ago

> It would probably take a lot of investment from AMD to make that happen, I imagine

Don't AMD deliberately gimp their consumer cards to prevent cannibalising the pro cards? I vaguely recall reading about that a while back.

That being the case, they have already done the R&D but they chose to use the tech on the higher-margin kit, thus preventing hobbyists from buying AMD.

lhl2y ago

A few years ago AMD split off their GPU architectures to CDNA (focused on data center compute) and RDNA (focused on rendering for gaming and workstations). This in itself is fine and what Nvidia was already doing, it makes sense to optimize silicon for each use case, but where AMD took a massive wrong turn is that they decided to stop supporting compute completely for their RDNA (and all legacy) cards.

I'm not sure exactly what AMD expected to happen when doing that, especially when Nvidia continues to support CUDA on basically every GPU they've ever made: https://developer.nvidia.com/cuda-gpus#compute (looks like back to a GeForce 9400 GT, released in 2008)

empyrrhicist2y ago

Its like they don't care about having a pipeline of programmers ready to use their hardware, and want to ignore most of the workstation market.

disintegore2y ago

Sadly this is still a market segment in which a proprietary stack dominates. From the perspective of AMD, they could be looking at a situation in which they can either throw billions of dollars at a monopoly protected by intellectual property law, and probably fail, or take a Pareto principle approach and cover their usual niche.

pizza2y ago· 4 in thread

Trying to build a scalable home 4090 cluster but running into a lot of confusion...

Let's say

- I have a motherboard + cpu + other components and they've both got plenty of pcie lanes to spare, total this part draws 250W (incl the 25% extra wattage headroom)

- start off with one RTX 4090, TDP 450W, with headroom ~600W.

- I want to scale up by adding more 4090s over time, as many as my pcie lanes can support.

    1. How do I add more PSUs over time? 

    2. Recommended initial PSU wattage? Recommended wattage for each additional pair of 4090s?

    3. Recommended PSU brands and models for my use case?

    4. Is it better to use PCI gen5 spec-rated PSUs? ATX 3.0? 12vhpwr cables rather than the ordinary 8-pin cables? I've also read somewhere that power cables between different brands of PSUs are *not* interchangeable??

    5. Whenever I add an additional PSU, do I need to do something special to electrically isolate the PCIe slots?

    6. North American outlets are rated for ~15A * 120V. So roughly 1800W. I can just use one outlet per psu whenever it's under 1800W, right? For simplicity let's also ignore whatever load is on that particular electrical circuit.

Each GPU means another 600W. Let's say I want to add another PSU for every 2 4090s. I understand that to sync the bootup of multiple PSUs you need an add2psu adapter.

I understand the motherboard can provide ~75W for a pcie slot. I take it that the rest comes from the psu power cables. I've seen conflicting advice online - apparently miners use pcie x1 electrically isolated risers for additional power supplies, but also I've seen that it's fine as long as every input power cable for 1 gpu just comes from one psu, regardless of whether it's the one that powers the motherboard. Either way x1 risers is an unattractive option bc of bandwidth limitations.

pls help

ftufek2y ago

1. You can pair normal atx PSUs for the motherboard/CPU and server PSUs for the GPUs using breakout boards.

2. You can power limit GPUs down to 250W and barely lose any performance depending on your use case, highly recommend it. So any PSU that can provide those is good.

3. HP 1200w power supplies are both plenty and cheap on ebay, even though they are rated at 1200w, because they are so cheap, you're better of just running them at ~500w and buy multiple of it instead of overheating a single one. A nice benefit of running them at lower wattages is the very loud tiny fan doesn't have to spin as hard and create a ton of noise.

4. Not needed, but having a single cable might be convenient, they are pretty expensive though.

5. You don't need to do anything special here, except if you add too many GPUs, the motherboard might have issues booting because the 75w per gpu draw is too much, but usually those motherboards will have an extra GPU power cable (like the ROMED8-2T) and some risers let you hook up the power cable directly to them so PCIe is only used for data transfer.

6. It's not the outlet, it's the circuit that matters. And keep in mind that whatever power wattage you set on the GPU, you need to account for ac/dc loss, so you need to add an additional ~10-15% to the usage.

If you power limit it to 250W, each additional GPU is essentially an extra ~280W or so. If you plan on having like 8 GPUs or more and you plan to run them 24/7, you're better off just calling a local colocation center and run it there, since they have much cheaper electricity cost, it comes out cheaper for you and you have all the benefits of being in a datacenter.

steffan2y ago

> 6. North American outlets are rated for ~15A * 120V. So roughly 1800W. I can just use one outlet per psu whenever it's under 1800W, right? For simplicity let's also ignore whatever load is on that particular electrical circuit.

You're going to have a bad time with this assumption; typical non-kitchen household circuits in the U.S. are 15A for the circuit. Each outlet is usually limited to 15A, but the circuit breaker serving the entire circuit is almost certainly 15A as well; one outlet at maximum load will not leave capacity for another outlet on the same circuit to be simultaneously drawing maximum amperage.

Typical residential construction would have a 15A circuit for 1-2 rooms, often with a separate circuit for lighting. Some rooms, e.g. kitchens will have 20A circuits, and some houses may have been built with 20A circuits serving more outlets / rooms.

gymbeaux2y ago

So those miner motherboards with the crap ton of PCIe x1 slots typically have a molex connector on the motherboard for each of those slots. Molex is famous for starting fires. I’m not sure I would ever go with a setup with molex connectors, but then I’m not sure you have another option. The issue is if they used PCIe power connectors instead, you often wouldn’t have enough of those left over for your GPU, so I get why they went with molex, it’s just a very old, and by modern standards crappy connector.

Combined with the ~1800W per 15A circuit restriction (I wouldn’t load the circuit to 100%, so really ~1600W) I’m not sure you can achieve what you’re going for.

If you’re really wanting to do this, consider adding a say 30A circuit near the circuit breaker of your home, usually the garage or basement and put the equipment there. I would get a dehumidifier in either location.

Tepix2y ago

Have you read Tim's guide?

synergy202y ago· 4 in thread

4090 is now in high end PCs, with 24GB VRAM, that's what I'm going to buy.

Everyone talks about Nvidia GPUs and AMD MI250/MI300, where is Intel? Would love to have a 3rd player.

whywhywhywhy2y ago

Consider the 3090, same memory but was way cheaper than the 4090 when I was looking, might be a good trade off if you don't really need the 40 speed boost.

synergy202y ago

3090 is still under powered by quite a bit though it does have 24GB

singhrac2y ago

Intel has Habana Gaudi2, which is an A100 competitor, but you can only access it on Intel’s developer cloud, apparently.

synergy202y ago

yes even MI300 from AMD is data center only just like A100 and H100.

I guess what Intel is missing is that it does not have a PC version GPU(ARC is far behind AMD and Nvidia GPU cards), so it can not establish its developer ecosystem and its OneAPI is a hard sell for its AI plan.

Either make ARC or whatever GPU as good as Nvidia/AMD graphic cards, or at least make lots of great AI compute accelerator stick to stay in the game, or no future in the AI era for Intel, sadly.

xnx2y ago· 4 in thread

Do local GPUs make sense? For the same price, can't you got a full years worth of cloud gpu time?

Yenrabbit2y ago

Cloud GPU providers are running low on capacity at the moment as people frantically suck up capacity to hop on the AI bandwagon, raising worries about availability. So having guaranteed access is maybe one motivation for local GPUs. But for me the main reason to go local is more psychological. I've mostly used cloud compute up until now but whenever I'm paying an hourly cost (even a small one) there is a pressure to 'make it worthwhile' and I feel guilty when the GPU is sitting idle. This disincentivizes playing and experimentation, whereas when you can run things locally there is almost no friction for quickly trying something out.

disintegore2y ago

Looking at the pricing, if you only spin those instances up when you need them, you can go a while before you break even. Otherwise it only takes a few months depending on the GPU.

I would imagine that someone really serious about training (or any other CUDA workload) uses both.

wing-_-nuts2y ago

Having looked at the pricing of retail card vs cloud, I came to the conclusion I could probably buy enough cloud compute to complete a phd before I 'paid for' the cost of a 4090 build...

TillE2y ago

Buying a high-end gaming GPU also lets you do, well, high-end gaming, 3D and video renders, etc.

If you only care about ML stuff, sure, the calculation is different.

frognumber2y ago· 3 in thread

I think there's one more axis: Frequency-of-use.

For occasionally use, the major constraint isn't speed so much as which models fit. I tend to look at $/GB VRAM as my major spec. Something like a 3060 12GB is an outlier for fitting sensible models while being cheap.

I don't mind waiting a minute instead of 15 seconds for some complex inference if I do it a few times per day. Or having training be slower if it comes up once every few months.

_cnmh2y ago

Hopefully the next generation of cards have high-VRAM variants.

bick_nyers2y ago

"As for capacity, Samsung’s first GDDR7 chips are 16Gb, matching the existing density of today’s top GDDR6(X) chips. So memory capacities on final products will not be significantly different from today’s products, assuming identical memory bus widths. DRAM density growth as a whole has been slowing over the years due to scaling issues, and GDDR7 will not be immune to that."

Source: https://www.anandtech.com/show/18963/samsung-completes-initi...

frognumber2y ago

I can buy a DDR5 64GB kit from Crucial for $160.

https://www.crucial.com/memory/ddr5/ct2k32g48c40u5

If a $1000 GPU came with that, it would blow everything else out-of-the-water for model size. Speed? No. Model size? Yes.

If it came with 320GB, I could run ChatGPT-grade LLMs. That's $800 worth of DDR5.

Instead, I get 24GB on the 3090 or 4090 for $2k.

A $3k LLM-capable card would not be a hard expense to justify.

PeterStuer2y ago· 3 in thread

I'm sticking with nVidia for now (currently a 3090 bought secondhand of eBay) as it is the most tested/supported by far, but it is great to see AMD making progress (finally) as some competition in this segment is desperatly needed.

fnands2y ago

Any tips for getting one off ebay without getting screwed? I want to pull the trigger, but a bit scared.

PeterStuer2y ago

It was my first purchase of of ebay, so not sure I can advice much.

I just waited for a reputable seller to show up. I also limited myself to professional sales from the EU to avoid any potential import issues.

I guess there is always a risk involved, but probably more buying from a first time private profile with just one object listed, than from a business that sells every day with high rep.

stephenitis2y ago

ditto. Second hard graphics cards are such a wild west to me.

arvinsim2y ago· 3 in thread

Really a shame that the 4070ti doesn't have 16GB.

But I guessed it is expected that Nvidia doesn't want to cannibalize the 4080.

teruakohatu2y ago

Every level below the *100 series has some sort of limitation to give incentives to upgrade one or two levels.

It's hard to blame nvidia when nobody seems to be trying to compete with them on the low end of ML and DL.

Const-me2y ago

nVidia has a 20 GB GPU with the same chip as 4070Ti, the model is RTX 4000 SFF.

One issue is price, it costs almost twice as much. Another one is memory bandwidth, RTX 4000 SFF only delivers 320 GB/second. That is much slower than 4070Ti (504 GB/second) and slightly faster than 4060Ti (288 GB/second). Also the clock frequencies are half of 4070Ti, so the compute performance is worse.

XCSme2y ago

> RTX 4000 SFF

Max Power Consumption - 70W.

Huh?

1 more reply

Tepix2y ago· 2 in thread

I used Tim's guide to build a dual RTX 3090 PC, paying 2300€ in total by getting used components. It can run inference of Llama-65B 4bit quantized at more than 10tok/s.

Specs: 2x RTX 3090, NVLink Bridge, 128GB DDR4 3200 RAM, Ryzen 7 3700X, X570 SLI mainboard, 2TB M.2 NVMe SSD, air cooled mesh case.

Finding the 3-slot nvlink bridge is hard and it's usually expensive. I think it's not worth it in most cases. I managed to find a cheap used one. Cooling is also a challenge. The cards are 2.7 slots wide and the spacing is usually 3 slots, so there isn't much room. Some people are putting 3d printed shrouds on the back of the PC case to suck the air out of the cards with an extra external fan. Also limiting the power from 350W to 280W or so per card doesn't cost a lot of performance. The CPU is not limiting the performance at all, as long as you have 4 cores per GPU you're good.

horsawlarway2y ago

My build is close to this. I purchased everything new except the 3090s, and I paid about $3000.

2x RTX 3090

128 GB DDR5

Intel core i9 600 series

Z790 Mainboard

I used Intel instead of AMD for the cpu, which pushed my prices higher... but I saved on the back side by skipping the NVLink Bridge.

Good to know I'm not missing much with out the Bridge, since I get about 13tok/s on Llama-65B 4 bit if I push all layers onto the GPU.

bwv8482y ago

Managed to snatch a 3090 during the GPU shortage in 2020. Did a lot of training and mining, and got some of my results published, think I gained much more than the cost of the hardware purchases. Kinda miss the day of eth mining. 3090 is a still good card and I'm pretty sure your rig is going to serve you well.

ps: ~280W power limit is a good call, it won't heat up your room too much.

fnands2y ago· 1 in thread

App based on this post to help you decide what to buy: https://nanx.me/gpu/

cosmojg2y ago

TL;DR, your best option right now is the RTX 4090 with the budget picks being either a used RTX 3090 or a used RTX 3090 Ti.

graton2y ago· 1 in thread

I almost immediately became suspicious on the accuracy of this article when they said the "Nvidia RTX 40 Ampere series". Ampere was the architecture name for the RTX 30 series. Ada Lovelace is the architecture name for the RTX 40 series.

fnands2y ago

Probably just an accident. Tim Dettmers has been updating this post for years and it's a super valuable resource.

savandriy2y ago

I've bought a Radeon RX 6700XT (12GB) last year, primarily for playing games.

But after Stable Diffusion came out, I started to play around with it and was pleasantly surprised that the GPU could handle it!

The setup is a little messy, and Linux only.

For someone targeting AI, definitely pick an Nvidia card with 12+ GBs of VRAM.

reducesuffering2y ago

You'll want lots of memory, so depends on your price point.

4090 ($1,600) > 3090 ($1300 new - $600 used) > 3060 ($300)

used 3090 is ideal value. Lots of models will need the 24gb ram

paul_funyun2y ago

One, don't use a case. Look at how miners mounted their hardware on racks and take notes. Cheaper, better for temps, and the most efficient use of space.

Two, I recommend ignoring electricity cost and using all you can. If it's cheaper now than it ever will be, use it while it's cheap. If it will go down due to renewables, nuclear, etc in the future, it's good to buy up the GPUs while their price is artificially depressed from energy fears.

Third, go for server type PSUs and breakout boards. The server PSUs cant be beaten in watts for your dollar, and are extremely efficient.

Finally, consider scooping up some x79 and x99 xeon boards from Chinese sellers. They're cheap as hell, have PCI lanes out the wazoo, etc. This means you don't have to fool with as many mobos to run the same amount of gpus. If you go this route, don't get the bottom of the barrel no-name motherboards. Machinist is a decent one.

andrewstuart2y ago

There’s clearly demand to buy AI capable GPUs at the store at a low price.

But Nvidias monopoly mean a they cripple their retail cards and push the AI stuff to data centers.

If only there was many manufacturers of AI hardware and software there would be abundant cheap products at every level.

AMD and Intel don’t seem to be able to compete and there’s no sign that will change.

So AI is going to remain expensive and hard to get for a very long time.

jcuenod2y ago

Any advice for mobile gpus? I'm interested in getting a laptop (preferably in the portable category). Obviously it's not going to be in 4090 territory, that's a tradeoff I'm willing to make.

adultSwim2y ago

Weird to leave out Apple. They seem to be the cheapest option to get a large amount of GPU memory.

justinclift2y ago

Raw performance rating for the RTX 3070 seems very weirdly placed in the chart. It's below the RTX 3060 Ti, which doesn't seem to make any sense.

lyapunova2y ago

I never tire of this. Tim is a wonderful no nonsense person. I love these posts and I love that it stays up to date.

kristianp2y ago

For a compromise, how is the recently released 4060ti with 16gb RAM? Its about a third the price of a 4090.

32gbsd2y ago

Omg that's a long read but very informative

j / k navigate · click thread line to collapse

128 comments

96 comments · 24 top-level

roenxi2y ago· 16 in thread

Although in my opinion buy AMD anyway if you need a GPU on linux. Their open source drivers are a lot less hassle as long as you don't need BLAS.

lhl2y ago

FredPret2y ago

Not that drivers are simple to make, but still. It's like owning a Ferrari that works perfectly, but you can only drive north.

shiftpgdn2y ago

I think tinygrad is working on AMD and SnapDragon adoption

figomore2y ago

You can use the WebGPU backend on Tinygrad. It's working well for my test with a Nvidia 960 running inference (Unet 3D). I don't know how well WebGPU is supported on AMD GPUs.

Const-me2y ago

ROCm is not the only option, compute shaders are very reliable on all GPUs. And thanks to the Valve’s work on DXVK 2.0, modern Linux runs Windows D3D11 software just fine.

Here’s an example https://github.com/Const-me/Whisper/issues/42 BTW, a lot of BLAS in the compute shaders of that software.

roenxi2y ago

Const-me2y ago

> driver bugs linked to OpenCL and ROCm

CUDA is the default tech for GPGPU in HPC or AI applications, for more than a decade now. By now, people have found most of these driver bugs, and nVidia has fixed them.

> they're underestimating the power of a few good public "how to use the damn thing" sessions

[1] https://www.youtube.com/watch?v=TMorJX3Nj6U

[2] https://developer.download.nvidia.com/compute/DevZone/docs/h...

[3] https://github.com/jstoecker/dxcompute-docs/tree/main

[4] https://github.com/Const-me/Whisper/blob/master/ComputeShade...

1 more reply

dragontamer2y ago

It disappoints me that DirectX remains one of the best GPU-compute solutions in practice right now. And Vulkan too I guess.

-------------

DX and Vulkan just have... so much API-crap you need to even get Hello World / SAXY up.

-------------

ROCm is fine if you get the hardware and if it remains supported. But I think in practice, people expect support longer than what AMD is willing to give.

Const-me2y ago

> 32-bit only atomics, I don't think its a very serious solution in practice

Yeah, I think I encountered that while porting a hash map from CUDA to HLSL.

However, I'm not sure that's necessarily a huge deal. Probably not an issue for machine learning or BLAS stuff, these use cases don't need fine-grained thread synchronization.

> so much API-crap you need to even get Hello World / SAXY up.

However, unlike simple apps, real-life ones often need profiler and queue depth limiter, relatively hard to implement on top of these queries. I think Microsoft should ship them both in Windows SDK.

2 more replies

singhrac2y ago

Sure, compute shaders might work, but don’t you need rocBLAS, rocSPARSE, MIOpen, etc? Are people reinventing those in compute shaders in another package?

Const-me2y ago

These things are nice to have, but you don’t actually need them.

Moreover, manually written compute shaders enable stuff missing from these off-the-shelf higher level libraries.

It’s possible to re-shape immutable or rarely changing tensors into better memory layouts. Example for CPU compute https://stackoverflow.com/a/75567894/126995 the idea is equally good on GPUs.

2 more replies

irusensei2y ago

What do you mean by drivers? The kernel ones? AMDGPU and KFD runs out of the box and without problems from my use case so far.

Id say though that the whole ROCm runtime is in a bit of a weird situation.

But if if you run anything 5.15-ish or later you don’t need proprietary drivers.

lhl2y ago

The more relevant question is which GPU is the OP using? The only officially ROCm supported GPU available for retail purchase is the RDNA2-based Radeon Pro W6800. [1]

[1] https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h...

JonChesterfield2y ago

api2y ago

How are the Windows drivers for AMD? OS shouldn't matter all that much if its primary role is to host or train models. As long as your code can run under the OS in question it's fine.

pantalaimon2y ago

I hope RustiCL will become a viable alternative there.

andy_ppp2y ago· 10 in thread

empyrrhicist2y ago

1. Driver stability

2. Works on more consumer grade cards

3. Ecosystem advantage (lots of software developed against an existing and well supported ecosystem)

officialchicken2y ago

As of today, there is zero consumer card support from AMD. It is an option only if you have a PRO card.

[0] https://community.amd.com/t5/rocm/new-rocm-5-6-release-bring...

empyrrhicist2y ago

1 more reply

andy_ppp2y ago

3 more replies

adfiognquio32y ago

A modern Linux system should have uptime measured in years with minimal effort. A modern Linux system with Nvidia GPUs will have uptime of weeks with a lot of fuss.

(I'm no expert, just someone who's managed a number of PCs and a few servers.)

empyrrhicist2y ago

Right, but they can get away with that because they have essentially no competition.

mattnewton2y ago

andy_ppp2y ago

Yup, filling the bucket could be worth hundreds of billions though. Maybe even trillions, seems like a sensible punt.

marcosdumay2y ago

> it’s the maths for neural networks largely multiplying large arrays/tensors together?

Yes, it's multiplying and adding matrices. That and mapping some simple function over an array.

Neural networks are only that.

yeahwhatever102y ago

It’s the inter GPU communication. Scatter and Gather have much worse performance on AMD GPUs.

ItsBob2y ago· 9 in thread

Just as an FYI/additional data point, I bought a 3090 FE from Ebay a few months ago for £605 including delivery.

I've only just started using it for Llama running locally on my computer at home and I have to say... colour me impressed.

It generates the output slightly faster than reading speed so for me it works perfectly well.

The 24GB of VRAM should keep it relevant for a bit too and I can always buy another and NVLink them should the need arise.

espadrine2y ago

> The 24GB of VRAM should keep it relevant for a bit too

[0]: https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...

[1]: https://arxiv.org/pdf/2306.11644.pdf

[2]: https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a...

wongarsu2y ago

XCSme2y ago

espadrine2y ago

[0]: https://arxiv.org/pdf/2001.08361

1 more reply

PeterStuer2y ago

Anyone with experience running 2 linked consumer GPU's want to chime in how good this works in practice?

Filligree2y ago

You get a fast link between the GPUs, which should help when you’ve got a model split between them.

What you can do without trouble is keep more models loaded, do more things at the same time, and occasionally run the same model at double speed if it batches well.

2 more replies

marcyb5st2y ago

I think you need enterprise grade cards for it to make it work. If I remember correctly consumer cards with nvlink can't share resources to host a 40GB model in vram.

gymbeaux2y ago

I bought a used 3090 FE from eBay for $600 too! Mine is missing the connector latch, but seems to be firmly inserted so I think fire risk is negligible.

brucethemoose22y ago

> It generates the output slightly faster than reading speed

For 33b? It should be much faster.

What stack are you running? Llama.cpp and exLlama are SOTA as far as I know.

nl2y ago· 6 in thread

You can tell how NVIDIA dominants the market by the fact their price/performance "curve" is almost a straight line.

In a competitive market that line has distortions where one player trts to undercut the other.

There are no bargains because there is almost no competitive pressure and so there is barely any distortion in that line.

MrBuddyCasino2y ago

"There's a full blown run on GPU compute on a level I think people do not fully comprehend right now. Holy cow.

" [0]

[0] https://twitter.com/Suhail/status/1683642991490269185

andrewstuart2y ago

Suhail lacks wisdom.

Stop obsessing about cloud GPUs.

Go buy retail GPUs.

Make them work. Adapt your software. Whatever, stop whining about cloud GPUs just switch to retail.

Solve the barriers/limitations. This has been the essences of computing for 60 years. Stop being intimidated by Nvidia. Get your job done in what is available.

Buy AMD, buy Intel. Work out how to make your GPU thing work on them. Stop wringing your hands about how there’s no cloud GPUs when there’s a ton of cheap retail GPUs.

INNOVATE. Look beyond nvidia. Stop whining.

If you’ve bet your entire business on cloud GPUs then you’re a fool. Bet on retail GPUs.

wing-_-nuts2y ago

>Stop obsessing about cloud GPUs.

>Go buy retail GPUs.

If you're doing this professionally, you know what you need and what your budget is.

oskarw852y ago

>Stop wringing your hands about how there’s no cloud GPUs when there’s a ton of cheap retail GPUs.

See your R&D investment goes up in flames after driver update to block such workload.

1 more reply

cosentiyes2y ago

> Solve the barriers/limitations

I don't think this is possible when you can't pool memory on the 40 series retail GPUs.

epups2y ago

1 more reply

politelemon2y ago· 6 in thread

Edit: what about Intel arc GPU? Any hope there?

formerly_proven2y ago

> AMD GPUs are great in terms of pure silicon

espadrine2y ago

ItsBob2y ago

> It would probably take a lot of investment from AMD to make that happen, I imagine

Don't AMD deliberately gimp their consumer cards to prevent cannibalising the pro cards? I vaguely recall reading about that a while back.

That being the case, they have already done the R&D but they chose to use the tech on the higher-margin kit, thus preventing hobbyists from buying AMD.

lhl2y ago

empyrrhicist2y ago

Its like they don't care about having a pipeline of programmers ready to use their hardware, and want to ignore most of the workstation market.

disintegore2y ago

pizza2y ago· 4 in thread

Trying to build a scalable home 4090 cluster but running into a lot of confusion...

Let's say

- I have a motherboard + cpu + other components and they've both got plenty of pcie lanes to spare, total this part draws 250W (incl the 25% extra wattage headroom)

- start off with one RTX 4090, TDP 450W, with headroom ~600W.

- I want to scale up by adding more 4090s over time, as many as my pcie lanes can support.

    1. How do I add more PSUs over time? 

    2. Recommended initial PSU wattage? Recommended wattage for each additional pair of 4090s?

    3. Recommended PSU brands and models for my use case?

    4. Is it better to use PCI gen5 spec-rated PSUs? ATX 3.0? 12vhpwr cables rather than the ordinary 8-pin cables? I've also read somewhere that power cables between different brands of PSUs are *not* interchangeable??

    5. Whenever I add an additional PSU, do I need to do something special to electrically isolate the PCIe slots?

    6. North American outlets are rated for ~15A * 120V. So roughly 1800W. I can just use one outlet per psu whenever it's under 1800W, right? For simplicity let's also ignore whatever load is on that particular electrical circuit.

Each GPU means another 600W. Let's say I want to add another PSU for every 2 4090s. I understand that to sync the bootup of multiple PSUs you need an add2psu adapter.

pls help

ftufek2y ago

1. You can pair normal atx PSUs for the motherboard/CPU and server PSUs for the GPUs using breakout boards.

2. You can power limit GPUs down to 250W and barely lose any performance depending on your use case, highly recommend it. So any PSU that can provide those is good.

4. Not needed, but having a single cable might be convenient, they are pretty expensive though.

steffan2y ago

gymbeaux2y ago

Combined with the ~1800W per 15A circuit restriction (I wouldn’t load the circuit to 100%, so really ~1600W) I’m not sure you can achieve what you’re going for.

Tepix2y ago

Have you read Tim's guide?

synergy202y ago· 4 in thread

4090 is now in high end PCs, with 24GB VRAM, that's what I'm going to buy.

Everyone talks about Nvidia GPUs and AMD MI250/MI300, where is Intel? Would love to have a 3rd player.

whywhywhywhy2y ago

Consider the 3090, same memory but was way cheaper than the 4090 when I was looking, might be a good trade off if you don't really need the 40 speed boost.

synergy202y ago

3090 is still under powered by quite a bit though it does have 24GB

singhrac2y ago

Intel has Habana Gaudi2, which is an A100 competitor, but you can only access it on Intel’s developer cloud, apparently.

synergy202y ago

yes even MI300 from AMD is data center only just like A100 and H100.

Either make ARC or whatever GPU as good as Nvidia/AMD graphic cards, or at least make lots of great AI compute accelerator stick to stay in the game, or no future in the AI era for Intel, sadly.

xnx2y ago· 4 in thread

Do local GPUs make sense? For the same price, can't you got a full years worth of cloud gpu time?

Yenrabbit2y ago

disintegore2y ago

Looking at the pricing, if you only spin those instances up when you need them, you can go a while before you break even. Otherwise it only takes a few months depending on the GPU.

I would imagine that someone really serious about training (or any other CUDA workload) uses both.

wing-_-nuts2y ago

Having looked at the pricing of retail card vs cloud, I came to the conclusion I could probably buy enough cloud compute to complete a phd before I 'paid for' the cost of a 4090 build...

TillE2y ago

Buying a high-end gaming GPU also lets you do, well, high-end gaming, 3D and video renders, etc.

If you only care about ML stuff, sure, the calculation is different.

frognumber2y ago· 3 in thread

I think there's one more axis: Frequency-of-use.

I don't mind waiting a minute instead of 15 seconds for some complex inference if I do it a few times per day. Or having training be slower if it comes up once every few months.

_cnmh2y ago

Hopefully the next generation of cards have high-VRAM variants.

bick_nyers2y ago

Source: https://www.anandtech.com/show/18963/samsung-completes-initi...

frognumber2y ago

I can buy a DDR5 64GB kit from Crucial for $160.

https://www.crucial.com/memory/ddr5/ct2k32g48c40u5

If a $1000 GPU came with that, it would blow everything else out-of-the-water for model size. Speed? No. Model size? Yes.

If it came with 320GB, I could run ChatGPT-grade LLMs. That's $800 worth of DDR5.

Instead, I get 24GB on the 3090 or 4090 for $2k.

A $3k LLM-capable card would not be a hard expense to justify.

PeterStuer2y ago· 3 in thread

fnands2y ago

Any tips for getting one off ebay without getting screwed? I want to pull the trigger, but a bit scared.

PeterStuer2y ago

It was my first purchase of of ebay, so not sure I can advice much.

I just waited for a reputable seller to show up. I also limited myself to professional sales from the EU to avoid any potential import issues.

I guess there is always a risk involved, but probably more buying from a first time private profile with just one object listed, than from a business that sells every day with high rep.

stephenitis2y ago

ditto. Second hard graphics cards are such a wild west to me.

arvinsim2y ago· 3 in thread

Really a shame that the 4070ti doesn't have 16GB.

But I guessed it is expected that Nvidia doesn't want to cannibalize the 4080.

teruakohatu2y ago

Every level below the *100 series has some sort of limitation to give incentives to upgrade one or two levels.

It's hard to blame nvidia when nobody seems to be trying to compete with them on the low end of ML and DL.

Const-me2y ago

nVidia has a 20 GB GPU with the same chip as 4070Ti, the model is RTX 4000 SFF.

XCSme2y ago

> RTX 4000 SFF

Max Power Consumption - 70W.

Huh?

1 more reply

Tepix2y ago· 2 in thread

I used Tim's guide to build a dual RTX 3090 PC, paying 2300€ in total by getting used components. It can run inference of Llama-65B 4bit quantized at more than 10tok/s.

Specs: 2x RTX 3090, NVLink Bridge, 128GB DDR4 3200 RAM, Ryzen 7 3700X, X570 SLI mainboard, 2TB M.2 NVMe SSD, air cooled mesh case.

horsawlarway2y ago

My build is close to this. I purchased everything new except the 3090s, and I paid about $3000.

2x RTX 3090

128 GB DDR5

Intel core i9 600 series

Z790 Mainboard

I used Intel instead of AMD for the cpu, which pushed my prices higher... but I saved on the back side by skipping the NVLink Bridge.

Good to know I'm not missing much with out the Bridge, since I get about 13tok/s on Llama-65B 4 bit if I push all layers onto the GPU.

bwv8482y ago

ps: ~280W power limit is a good call, it won't heat up your room too much.

fnands2y ago· 1 in thread

App based on this post to help you decide what to buy: https://nanx.me/gpu/

cosmojg2y ago

TL;DR, your best option right now is the RTX 4090 with the budget picks being either a used RTX 3090 or a used RTX 3090 Ti.

graton2y ago· 1 in thread

fnands2y ago

Probably just an accident. Tim Dettmers has been updating this post for years and it's a super valuable resource.

savandriy2y ago

I've bought a Radeon RX 6700XT (12GB) last year, primarily for playing games.

But after Stable Diffusion came out, I started to play around with it and was pleasantly surprised that the GPU could handle it!

The setup is a little messy, and Linux only.

For someone targeting AI, definitely pick an Nvidia card with 12+ GBs of VRAM.

reducesuffering2y ago

You'll want lots of memory, so depends on your price point.

4090 ($1,600) > 3090 ($1300 new - $600 used) > 3060 ($300)

used 3090 is ideal value. Lots of models will need the 24gb ram

paul_funyun2y ago

One, don't use a case. Look at how miners mounted their hardware on racks and take notes. Cheaper, better for temps, and the most efficient use of space.

Third, go for server type PSUs and breakout boards. The server PSUs cant be beaten in watts for your dollar, and are extremely efficient.

andrewstuart2y ago

There’s clearly demand to buy AI capable GPUs at the store at a low price.

But Nvidias monopoly mean a they cripple their retail cards and push the AI stuff to data centers.

If only there was many manufacturers of AI hardware and software there would be abundant cheap products at every level.

AMD and Intel don’t seem to be able to compete and there’s no sign that will change.

So AI is going to remain expensive and hard to get for a very long time.

jcuenod2y ago

Any advice for mobile gpus? I'm interested in getting a laptop (preferably in the portable category). Obviously it's not going to be in 4090 territory, that's a tradeoff I'm willing to make.

adultSwim2y ago

Weird to leave out Apple. They seem to be the cheapest option to get a large amount of GPU memory.

justinclift2y ago

Raw performance rating for the RTX 3070 seems very weirdly placed in the chart. It's below the RTX 3060 Ti, which doesn't seem to make any sense.

lyapunova2y ago

I never tire of this. Tim is a wonderful no nonsense person. I love these posts and I love that it stays up to date.

kristianp2y ago

For a compromise, how is the recently released 4060ti with 16gb RAM? Its about a third the price of a 4090.

32gbsd2y ago

Omg that's a long read but very informative

j / k navigate · click thread line to collapse