Well, that does at least answer my immediate question about why I would ever swap from expensive RAM to really expensive RAM:) Feels niche, but when you want it it's a good idea.
Edit: Although, this is predicated on the system being able to release VRAM that is acting as swap when it's time to start a game. Can it do that?
The reason I wrote this is I run this laptop in hybrid (AMD display + NVIDIA as swap). So all at VRAM was going to waste.
On your question re: switchable swap. It's on my to-do list ;)
Microsoft: hold my beer
On the old Amstrad PCWs that were everywhere at least in the UK in the mid 80s to mid 90s you could have up to 512kB of RAM, a fair chunk of which could be a RAM disk. This made compiling stuff in Turbo Pascal really fast too :-)
That said, still an nice and fun concept. Though caching got better since I assume :)
How it is reported? As SWAP space, not as RAM.
There are a bunch of datacenter GPUs that support full cache coherency, but if you used them like that the VRAM would be very high latency from the CPU. So it would only be really slow.
Well, GPUs also have stupid amounts of compute on them. I have to imagine that there is some kind of database format that's useful with GPU compute attached.
Since the data is already in VRAM, the GPU can sort, join, or otherwise manipulate data as needed.
I believe within 2-3 years databases and data warehouses on GPU will be common. The widespread use of agents to query data will be a part of this, as there will be a need to run far more queries at lower latency than needed for the ETL and BI workloads of the past.
Where does a few more steps of evolution take us? A wide path between a few heavy devices, and then the CPU off to the side just orchestrating the data flow?
It must have failed because I never heard of an update to this GPU. But AMD definitely made a GPU with 4x NVMe SSDs attached to the GPU.
This HN comment and the linked post brought up a lot of good points. The main takeaway is that swap should primarily be considered a mechanism for equality of reclamation, not for emergency extra memory, where equality of reclamation means file-backed pages and anonymous pages are subject to similar criteria for being evicted from physical memory.
I used to have zero swap on my Linux desktop and this convinced me to add at least a small swap partition.
I don't consider swap to be emergency RAM storage. I know that the kernel will decide by itself to use swap even if it has plenty of available RAM and the swappiness threshold is not reached.
Nevertheless, my two decent laptops (one with 16 GB RAM, the other with 64 GB RAM) never swap, even with Docker Swarm and multiple stacks, multiple VMs, desktop activities, and gaming.
It's been a while since I last saw a physical machine actively swapping.
I understand that some limited hardware may need swap, but I can't see such hardware having a GPU with plenty of VRAM.
That said, hacking things is always fun :)
Edit: Typo
Is not popular in general, so yes. But also no - I don't use swap ever, if I have to go over the RAM (32GB being low, with 64GB the norm), might as well consider the system dead.
>Sequential throughput: ~1.3 GB/s
[on a RTX 3070 Laptop]
This RTX 3070 chip is on PCIe 4.0 x16 which should give 64GB/s. The 8GB of GDDR6 is 448GB/s.
Swapping to an NVMe drive would be twice as fast, but with higher latency.
Edit: Their benchmarks are also run using ZRAM, which compresses pages before writing to swap. Not sure what the performance overhead of that is, but it's probably quite a bit.
First of all, it's a userspace program hooking the nbd driver, which is known for being slow. It also uses a bounce buffer in userspace before transferring to the GPU. So when the kernel needs to swap a page, it has to first copy it into a userspace facing buffer. The userspace program that has to wake back up and issue the cuda operation to copy the page into device memory.
nbd also doesn't really do a good job of supporting high queue depth or merging adjacent accesses. So if the kernel is issuing a bunch of 4K page swaps without any coalescing, you're going to end up with at least million kernel/userspace context switches per second just to handle 4 GB/s (4 GB / 4K page), let alone 64 GB/s. And that's just the NBD portion, forget the mess that is the NVIDIA driver. PCIe can move a lot of data, but in order to get anything even resembling the full bandwidth, you have to have use DMA engines with long page lists. Having to set up a transfer for every 4K page over PCIe will not reach full saturation of the bus.
Swapping to NVMe is a very optimized path -> the swapper can submit lists of pages directly to the NVMe driver and the controller can DMA them directly out of RAM, no copies or context switches CPU side at all.
This could probably be improved by migrating to the ublk driver as it might let you avoid the userspace bounce buffer. It'd also be able to have multiple write queues to at least set up CUDA copies in parallel.
Even if the swap system overhead drops to just a data copy, the memory management layer prevents swap from scaling to higher bandwidths. The issue is not data movement; it is in the page unmapping step (which requires expensive TLB shootdowns). Larger kernel changes are required.
My group wrote a paper on this: https://dl.acm.org/doi/10.1145/3731569.3764842
Linux's swap system is undergoing some large refactors lately. Hopefully some insights either from our work or Hermit (NSDI '23) can make it in to the mainline. I think Hermit's `rmap` optimization in particular is a candidate for upstream use.
one can get rid of zram and just reimplement some compression in shaders but I think that would be a pointless optimization.
RAM/VRAM don’t degrade from use.
but flash endurance isn't a strong argument here. you probably have O(TB) of flash, and aren't going to produce PB of swap writes any time soon. if you do a lot of swapping to a small flash device, it'll happen sooner.
I'm typing from a quite old 4GB laptop, which swaps heavily to a 250G SATA ssd. sure, it's not great, but it also costs zero. currently 9GB of swap is used, and it's not really noticeable. if I open 20 more tabs, it can introduce pauses.
google says this drive was released in 2014, and SMART says POH is about 10 years.
SMART also says wear leveling count is 665 and total written is 165327189538 LBAs (78834 GiB, or 338 drive-writes). I'm not expecting it to die soon, though using a 4G laptop is a bit of a stunt these days...
the point is that a system that has sustained heavy swapping for years has not generates so many writes to worry much. a modern system with 10x speed and 10x capacity (and probably less RAM deficit) would have even less effect. even for QDR with it's few-hundred cycle endurance spec...
All of this is to say that, it does have a potential impact on flash, if you rebuild often, which tends to happen on Gentoo.
Hard drives that huge scare me as it would take days to backup all the data off them.
[0]: https://knowyourmeme.com/memes/its-one-banana-michael-what-c...
(Kinda goes against the original spirit of the reference)
In the end I just had to bite the bullet and take a gamble on finding ECC DDR4 RAM that would work with the ancient AMD chipset...
This particular implementation seems to be running over too many layers to be particularly performant. Why not a custom block driver instead?
The problem with putting (system) RAM on a PCIe card is that PCIe is not a cache-coherent interconnect. If you have a cache line that resides on your GPU sitting inside your processor's cache a remote modification to that memory by either the GPU, another CPU core or some other PCIe device with NOT invalidate the CPU cache line. You also have the fun situation that if it's modified on both ends simultaneously the resulting state will be non-deterministic.
Device drivers have to be very careful about synchronization when accessing memory-like areas on PCIe. CXL adds a cache coherency protocol among other things, so that invalidations and snoops can be exchanged over the interconnect.
Why can I not just enter a simple command to entirely-disable swapfile, like with Linux's:
>>>>swapoff -a
Seems kind of silly, unless the point is intentionally to wear-down the SSD's lifespan.
Having a GUI swapfile-disable system preference would be awesome. It would also be awesome if Apple finally abandoned this system settings/layout "phase" – it's still word-salad (compared to decades of preference panes).
#Apple #Feedback #swapfile
----
>I have 20GB of RAM free (me)
>>~need to have quick access to main memory
>I have 20GB of RAM free (still me)
>>~yeah but "quickly reclaimed for another purpose"
>I have 20GB of RAM free (!m!e!)
//of//32GB//ttl//
----
In linuxland, I'd just type `sudo swapoff -a` and be done with it. That machine has 96GB of RAM, so it would have ~84GB of RAM free (if, hypothetically, the same hardare/configuration were operating that system).
Does. not. need. a swapfile.
The operating system, during bootup, should think "hey I have dozens of gigabytes of RAM, won't be needing any swapfiles" – behind the scenes and without input.
That page also has a fuse filesystem implementation on top of opencl - https://github.com/Overv/vramfs - which may be more compatible.
Man, that brings back memories.
>GpuRamDrive
>Create a virtual drive backed by GPU RAM.
https://github.com/prsyahmi/GpuRamDrive
Fork with AMD support:
--------
[0] I want to code, I like the nitty-gritty, and if I want to outsource I'd prefer to outsource to a human¹ than GenAI
[1] they might outsource to GenAI of course, that is their choice and as long as they properly verify the output before handing it on to me I shouldn't have to care
With X11 it's not that bad (buffers are pre-allocated), but with Wayland allocations are a lot more dynamic, so running low on VRAM can easily crash the whole desktop. I just had a few of such crashes with Hyprland+llama-server+KVM switching between computers without freeing VRAM.
sounds VERY low, also, wouldn't random read/write speed be MUCH more relevant here?
There is originally https://github.com/Overv/vramfs however that has the overhead of a FUSE filesystem + loop device when using as a swap device.
The performance is rather lackluster however, it's far from a miracle "now you effectively have more ram for a 90% performance drop" - it definitely feels like traditional swapping
Fresh benchmarks against NVMe included:
https://www.seanlobjoit.com/posts/2026-06-12-vram-swap-two-w...
The kernels job is to manage resources - and GPU ram is one such resource, and it can be used for many of the same uses as regular ram.
Now if it could be dynamically used and vacated on other GPU workloads?
It would be nice to have dynamic scaling or even just auto-shutoff on VRAM pressure if I forget I have this enabled and then fire up a game or LLM.