undefined | Better HN

0 pointssmallnamespace7d ago0 comments

It’s architecturally not a good approach. System RAM is much slower so you should put data that doesn’t need to be used often on it. That knowledge is at the application layer. Adding a CUDA shim makes system RAM appear like VRAM, which gets things to run, but it will never run very well.

The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.

0 comments

jbverschoor7d ago

Not true for unified systems. And for strix halo you need to dedicate the amount which is annoying.

You’re basically stating that swapping is also a bad idea. And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest

Tuna-Fish7d ago

On some workloads, swapping is a bad idea.

The fundamental problem here is that the workload of LLMs is (vastly simplified) a repeated linear read of all the weights, in order. That is, there is no memory locality in time. There is literally anti-locality; When you read a set of weights, you know you will not need them again until you have processed everything else.

This means that many of the old approaches don't work, because time locality is such a core assumption underlying all of them. The best you can do is really a very large pool of very fast ram.

In the long term, compute is probably going to move towards the memory.

zozbot2347d ago

The main blocker with swapping is not even the limited bandwidth, it's actually the extreme write workload on data elements such as the per-layer model activations - and, to a much lesser extent, the KV-cache. In contrast, there are elements such as inactive experts for highly sparse MoE models, where swapping makes sense since any given expert will probably be unused. You're better off using that VRAM/RAM for something else. So the logic of "reserve VRAM for the highest-value uses, use system RAM as a second tier, finally use storage as a last resort or for read-only data" is still quite valid.

1 more reply

dataflow7d ago

> You’re basically stating that swapping is also a bad idea.

Is that a crazy thing to say? I can't recall the last time I was grateful for swap; it might've been before 2010.

dahart6d ago

Try turning swap off and really find out if you’re not grateful for it. Might be fine if you’re never using all your RAM, but if you are, swap off isn’t fun and you might realize you’ve been unconsciously grateful this whole time. ;) Swap might be important for GPU usage even when not using something like greenboost, since display GPUs sometimes use system RAM to back the GPU VRAM.

1 more reply

literalAardvark6d ago

If you've used any unreserved VM ever you're grateful for swapping.

Somewhat indirectly but still.

Tsiklon6d ago

Strix Halo’s unified setup is pretty cool. In systems with 128GB of memory, in BIOS set the dedicated GPU memory to the smallest permitted and the Drivers will use the whole main memory pool appropriately in Linux and Windows

stuaxo6d ago

Does this work on the open source amdgpu drivers ?

I've been a bit too busy to turn mine on for a while.

2 more replies

imtringued7d ago

It's not true for unified systems, because they have no secondary RAM that could be used to extend the GPU memory.

It's pretty weird to insist on a counterargument that has no implications or consequences to the presented argument.

Yes, swapping is a bad idea.

Your second argument also falls flat, because the standard CUDA hardware setup doesn't use CXL so cache coherence isn't available. You're left with manual memory synchronization. Pretending that GPUs have cache for system RAM when they don't is pretty suspect.

timnetworks7d ago

Some people are not concerned with having it run the fastest, just having it run at all may be enough.

m-schuetz7d ago

From my experience, accessing system RAM from the GPU is so slow, it might as well count as "does not work". It's orders of magnitudes faster to memcpy large swaths of memory that you are going to use to the GPU, rather than accessing system mem from a kernel which then takes ages to wait for that small block/page of memory, then waits again for the next small page/block of memory, etc. Latency hiding doesnt work anymore if the latency is that large.

dahart6d ago

You’re right for some workloads, but not all of them. The same could have been said for disk swap since the beginning though, and people still found it valuable. Disk swapping with spinning drives did used to be multiple orders of magnitude slower than RAM. But it prevented applications or the system from crashing.

Using system memory from the GPU isn’t that bad if your compute is high enough and you don’t transfer that much data. There are commercial applications that support it and only see low 2-digit percentage perf impact and not the multiples you might expect. Plus on Windows on Nvidia hardware, the driver will automatically use system memory if you oversubscribe VRAM, and I believe this was introduced to support running Stable Diffusion on smaller GPUs.

nl7d ago

But then you can use CPU/RAM offload, which already allows you to offload without a kernel module.

jmward016d ago

> It’s architecturally not a good approach.

Yes, with current LLMs and current hardware and current supporting software this is a true statement. My point wasn't that this approach suddenly changes that, it was that it makes it easier to explore alternatives that might change that. Let's imagine some possibilities:

- Models that use a lot of weight reuse: If you strategically reuse layers 3-4x that could give a lot of time for async loading of future weights.

- Models that select experts for several layers at a time: Same thing, while crunching on the current layer you have teed-up future layers that can be transferring in

- HW makers start improving memory bandwidth: This is already happening right? AMD and Apple are pushing unified memory architectures with much higher bandwidth but still not quite there compared to GPUs. This could lead to a hybrid approach that makes those machines much more competitive. similarly, HW makers could bring back technologies that died on the vine that could help, things like Intel's optaine come to mind. Start making mass storage as fast as system memory is now and the equation may change.

These are quick dart throws that probably have obvious holes in them but the point is platforms like this help us explore paths that appeared dead-end until that one change makes them viable and then allows them to take over. It may not happen. It may be a dead end. But that logic means we will never go out on a limb and try something new. We need people and tech that challenges assumptions and makes it easy for people to try out ideas to keep the tech ecosystem evolving. This does that. Even if this particular project doesn't succeed it is a great thing to do if for no other reason it likely just spurred a bunch of people to try their own crazy hacks for LLM inference. Maybe it even enabled a use case with GPUs that nobody realized existed and has nothing to do with LLMs.

j / k navigate · click thread line to collapse

0 comments

jbverschoor7d ago

Not true for unified systems. And for strix halo you need to dedicate the amount which is annoying.

You’re basically stating that swapping is also a bad idea. And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest

Tuna-Fish7d ago

On some workloads, swapping is a bad idea.

This means that many of the old approaches don't work, because time locality is such a core assumption underlying all of them. The best you can do is really a very large pool of very fast ram.

In the long term, compute is probably going to move towards the memory.

zozbot2347d ago

1 more reply

dataflow7d ago

> You’re basically stating that swapping is also a bad idea.

Is that a crazy thing to say? I can't recall the last time I was grateful for swap; it might've been before 2010.

dahart6d ago

1 more reply

literalAardvark6d ago

If you've used any unreserved VM ever you're grateful for swapping.

Somewhat indirectly but still.

Tsiklon6d ago

stuaxo6d ago

Does this work on the open source amdgpu drivers ?

I've been a bit too busy to turn mine on for a while.

2 more replies

imtringued7d ago

It's not true for unified systems, because they have no secondary RAM that could be used to extend the GPU memory.

It's pretty weird to insist on a counterargument that has no implications or consequences to the presented argument.

Yes, swapping is a bad idea.

timnetworks7d ago

Some people are not concerned with having it run the fastest, just having it run at all may be enough.

m-schuetz7d ago

dahart6d ago

nl7d ago

But then you can use CPU/RAM offload, which already allows you to offload without a kernel module.

jmward016d ago

> It’s architecturally not a good approach.

- Models that use a lot of weight reuse: If you strategically reuse layers 3-4x that could give a lot of time for async loading of future weights.

- Models that select experts for several layers at a time: Same thing, while crunching on the current layer you have teed-up future layers that can be transferring in

j / k navigate · click thread line to collapse