You don't want this kind of thing happening when it is running a filesystem.
Maybe someone could run workloads across CUDA and ZLUDA (Nvidia, and other hardware), but really we just might need more reliability to efficiently and reliability run a file system across disparate GPU hardware.
edit: blind as a bat, says so right in the paper of course:
PMem is mapped directly to the GPU, and NVMe memory is accessed via Peer to Peer-DMA (P2PDMA)
[1]: https://nvmexpress.org/wp-content/uploads/Enabling-the-NVMe-...
[2]: https://lwn.net/Articles/767281/
[3]: https://www.nvmexpress.org/wp-content/uploads/NVMe_Over_Fabr...
Once you got that then the CPU is just the orchesterator, and wouldn't necessarily need to be so beefy.
Also, Optane was like $4 per GB, so a moderately-sized drive, like 256GB, is already above $1000.
>GpuRamDrive
>Create a virtual drive backed by GPU RAM.
https://github.com/prsyahmi/GpuRamDrive
Fork with AMD support:
https://github.com/brzz/GpuRamDrive/
Fork that has fixes and support for other cards and additional features:
seq1m 2205 2190 q1t1
rndq32 41.31 38.77
rnd q1t1 34.70 32.80
To be honest i didn't know what to expect, aside for a very high reading and writing speed. I was a bit disappointed in seeing random reading and writing were so slow, the only use i could think about would be having photosets or things like that over there, and then saving the session on ssd when closing the program, but it is easily solved by using a newer nvme ssd
1) How does this work differ from Mark Silberstein's GPUfs from 2014 [1]?
2) Does this work assume the storage device is only accessed by the GPU? Otherwise, how do you guarantee consistency when multiple processes can map, read and write the same files? You mention POSIX. POSIX has MAP_SHARED. How is this situation handled?
3) Related to (2), on the device level, how do you sync CPU (on an SMP, multiple cores) and GPU accesses?
Just quoting the paper:
>Using GPUfs, Silberstein et al . [ 24] demonstrate that offering a library interface to CPU FS eases access to storage for GPU programmers, but GPUfs only calls a CPU-side file system. GPU4FS offers a similar interface to GPUfs, but runs the file system on the GPU.
In this case, it is indeed novel to run the logic of the filesystem on the GPU itself. It's definitely worth the investigation!
(I worked on a FUSE filesystem that had these issues.)
I think the main benefit here is not having to do memory copies through the CPU, which frees up memory bandwidth for other things.
Issuing individual truncates of 1B files can be just as much of a CPU problem then an IO one for example.
It would be interesting to know if this approach could optimize the performance of training and inference for large models.
are shaders turing complete ? ;)