Full-scale file system acceleration on GPU [pdf] (opens in new tab)

(dl.gi.de)

148 pointswest0n2y ago45 comments

45 comments

33 comments · 13 top-level

multimind2y ago· 5 in thread

A friend of mine used to work for a GPU database startup as an integration engineer. He got frustrated because GPU drivers ( not just AMD but also Nvidia ) are intrinsically unstable and not designed for long flawless runs. If a few bits have a wrong value in a deep neural network or a pixel is wrong in a game, it does not matter much. In databases ( or file systems for that matter ) it does mean everything! It is hard to believe at first, but his former company now offers solutions without GPU acceleration that simply work, but they also lost their USP.

amelius2y ago

Yeah, I had a lot of nVidia GPUs suddenly disappear mid-training when even nvidia-smi couldn't find them; this was on different systems (Linux) and only a reboot fixed it.

You don't want this kind of thing happening when it is running a filesystem.

LeanderK2y ago

Strange. I never had any problem with nvidia GPUs, but I only ever used data center GPU like the V100 (and don't set them up myself). There's a lot of things that go wrong, at least my nvidia GPU always works.

solardev2y ago

Could you use some sort of RAID array of GPUs to compensate...?

nonplus2y ago

nvidia-smi exposes all cards, so you could run the same workload on multiple cards. This (likely) won't solve the problem of certain failure modes being intrinsic to the work being completed/compute environment. I would speculate some of those aggressive failure modes would present themselves across all the hardware.

Maybe someone could run workloads across CUDA and ZLUDA (Nvidia, and other hardware), but really we just might need more reliability to efficiently and reliability run a file system across disparate GPU hardware.

yosefk2y ago

If the game or your training crashes though, it matters a lot. What sort of bugs give you wrong values without crashing, especially driver bugs?.. something is strange here

magicalhippo2y ago· 4 in thread

Given that PCIe allows data to be piped directly from one device to another without going through the host CPU[1][2], I guess it might make sense to just have the GPU read blocks straight from the NVMe (or even NVMe-of[3]) rather than having the CPU do a lot of work.

edit: blind as a bat, says so right in the paper of course:

PMem is mapped directly to the GPU, and NVMe memory is accessed via Peer to Peer-DMA (P2PDMA)

[1]: https://nvmexpress.org/wp-content/uploads/Enabling-the-NVMe-...

[2]: https://lwn.net/Articles/767281/

[3]: https://www.nvmexpress.org/wp-content/uploads/NVMe_Over_Fabr...

wtallis2y ago

I'm not sure they're actually doing NVMe yet; using Optane PMem is a bit of a cheat so that accessing storage is just plain memory reads and writes over PCIe. Implementing an NVMe device driver to set up and interact with command queues would be an extra layer of complexity that I think they left for future work.

magicalhippo2y ago

Sure, but my point was that it should be quite possible to get regular NVMes working.

Once you got that then the CPU is just the orchesterator, and wouldn't necessarily need to be so beefy.

1 more reply

nine_k2y ago

Didn't they stop making Optane? :(

Also, Optane was like $4 per GB, so a moderately-sized drive, like 256GB, is already above $1000.

2 more replies

az2262y ago

For GPUs where Nvidia has turned off P2P, can RAM or NVMe drives be used for emulating P2P? Let’s assume you have a RAID AIC with 4 or 8 high speed SSDs. Could you make 3 3090s work as well as 3 A5000 RTX for training a model?

molticrystal2y ago· 2 in thread

While it is not a 1:1 comparison there has been a driver for windows that allows the creation of a ram drive from vram for NVIDIA cards.

>GpuRamDrive

>Create a virtual drive backed by GPU RAM.

https://github.com/prsyahmi/GpuRamDrive

Fork with AMD support:

https://github.com/brzz/GpuRamDrive/

Fork that has fixes and support for other cards and additional features:

https://github.com/Ado77/GpuRamDrive

amarcheschi2y ago

I tried and tested it on my 5700xt,in crystaldiskmark i got (5 repeeated times on 1giB) Read Write (MB/s) seq1m 2339 2620 q8t1

seq1m 2205 2190 q1t1

rndq32 41.31 38.77

rnd q1t1 34.70 32.80

To be honest i didn't know what to expect, aside for a very high reading and writing speed. I was a bit disappointed in seeing random reading and writing were so slow, the only use i could think about would be having photosets or things like that over there, and then saving the session on ssd when closing the program, but it is easily solved by using a newer nvme ssd

Zambyte2y ago

For Linux: https://wiki.archlinux.org/title/Swap_on_video_RAM

afr0ck2y ago· 2 in thread

I didn't fully read the paper, but few questions come into mind.

1) How does this work differ from Mark Silberstein's GPUfs from 2014 [1]?

2) Does this work assume the storage device is only accessed by the GPU? Otherwise, how do you guarantee consistency when multiple processes can map, read and write the same files? You mention POSIX. POSIX has MAP_SHARED. How is this situation handled?

3) Related to (2), on the device level, how do you sync CPU (on an SMP, multiple cores) and GPU accesses?

[1] https://dl.acm.org/doi/10.1145/2553081

riedel2y ago

> 1) How does this work differ from Mark Silberstein's GPUfs from 2014 [1]?

Just quoting the paper:

>Using GPUfs, Silberstein et al . [ 24] demonstrate that offering a library interface to CPU FS eases access to storage for GPU programmers, but GPUfs only calls a CPU-side file system. GPU4FS offers a similar interface to GPUfs, but runs the file system on the GPU.

afr0ck2y ago

Thanks for the quote!

In this case, it is indeed novel to run the logic of the filesystem on the GPU itself. It's definitely worth the investigation!

KingOfCoders2y ago· 2 in thread

Like Microsoft DirectStorage?

wtallis2y ago

Nope. This is an implementation of one of several things that people often imagine Microsoft's DirectStorage to be, but the real DirectStorage is a lot more mundane.

KingOfCoders2y ago

I have no clue, so I've asked, where is the difference?

1 more reply

ec1096852y ago· 2 in thread

Interesting they would discuss system call overhead of opening a file, reading from it and closing it. Seems like in almost all cases the open and close calls would be overwhelmed by the other operations.

eru2y ago

For lots of small files, that might not be the case.

(I worked on a FUSE filesystem that had these issues.)

loeg2y ago

It seems more straightforward to fix your data-in-files layout than to implement a novel in-GPU filesystem, though.

I think the main benefit here is not having to do memory copies through the CPU, which frees up memory bandwidth for other things.

2 more replies

amelius2y ago· 2 in thread

A GPU seems overkill when the bottleneck is the I/O.

OlivierLi2y ago

In systems performance I would advise to never think of any workload as unidimensional (ie: Any file system optimization can either improve IO latency or be useless)

Issuing individual truncates of 1B files can be just as much of a CPU problem then an IO one for example.

amelius2y ago

But why wouldn't using one of many CPU cores be sufficient?

west0nOP2y ago· 1 in thread

According to this paper, GPU4FS is a file system that can run on the GPU and be accessed by applications. Since GPUs cannot make system calls, GPU4FS uses shared video memory (VRAM) and a parallel queue implementation. Applications running on the GPU can utilize GPU4FS after modifying their code, eliminating the need for a CPU-side file system when accessing the file system. The experiments are done on Optane memory.

It would be interesting to know if this approach could optimize the performance of training and inference for large models.

t-32y ago

GPUs seem to have a lot of memory these days - from my limited knowledge, games and other graphics-intensive applications will use too much to make this approach particularly useful but do other applications have a similar level of utilization?

yeison2y ago

How to get hired by NVIDIA! If it does work it's a brilliant idea.

_kdave2y ago

I'm glad that research papers don't start with "we've analyzed linux kernel 2.6.18 sources (because this is what we had on our lab machines) and determined that ext3 is the best filesystem for our research purpose and now present you with a novel idea of using high-tech device on that". The paper acknowledges modern features, takes design from other filesystems (mentioned BTRFS and tree structures). Overall the idea is interesting and promising.

hieu2292y ago

I hope GPU files leads to faster database

brcmthrowaway2y ago

Is this implementing a file system using shader code? Thats insane

are shaders turing complete ? ;)

touisteur2y ago

Now this is all fun, but has anyone managed to make these mechanisms work with Multicast PCIe ? I really need GPUdirect and StorageDirect to support this, until PCIe catches up to today's (or Blackwell's) NVLink ... around PCIe 12?

j / k navigate · click thread line to collapse

45 comments

33 comments · 13 top-level

multimind2y ago· 5 in thread

amelius2y ago

Yeah, I had a lot of nVidia GPUs suddenly disappear mid-training when even nvidia-smi couldn't find them; this was on different systems (Linux) and only a reboot fixed it.

You don't want this kind of thing happening when it is running a filesystem.

LeanderK2y ago

solardev2y ago

Could you use some sort of RAID array of GPUs to compensate...?

nonplus2y ago

yosefk2y ago

If the game or your training crashes though, it matters a lot. What sort of bugs give you wrong values without crashing, especially driver bugs?.. something is strange here

magicalhippo2y ago· 4 in thread

edit: blind as a bat, says so right in the paper of course:

PMem is mapped directly to the GPU, and NVMe memory is accessed via Peer to Peer-DMA (P2PDMA)

[1]: https://nvmexpress.org/wp-content/uploads/Enabling-the-NVMe-...

[2]: https://lwn.net/Articles/767281/

[3]: https://www.nvmexpress.org/wp-content/uploads/NVMe_Over_Fabr...

wtallis2y ago

magicalhippo2y ago

Sure, but my point was that it should be quite possible to get regular NVMes working.

Once you got that then the CPU is just the orchesterator, and wouldn't necessarily need to be so beefy.

1 more reply

nine_k2y ago

Didn't they stop making Optane? :(

Also, Optane was like $4 per GB, so a moderately-sized drive, like 256GB, is already above $1000.

2 more replies

az2262y ago

molticrystal2y ago· 2 in thread

While it is not a 1:1 comparison there has been a driver for windows that allows the creation of a ram drive from vram for NVIDIA cards.

>GpuRamDrive

>Create a virtual drive backed by GPU RAM.

https://github.com/prsyahmi/GpuRamDrive

Fork with AMD support:

https://github.com/brzz/GpuRamDrive/

Fork that has fixes and support for other cards and additional features:

https://github.com/Ado77/GpuRamDrive

amarcheschi2y ago

I tried and tested it on my 5700xt,in crystaldiskmark i got (5 repeeated times on 1giB) Read Write (MB/s) seq1m 2339 2620 q8t1

seq1m 2205 2190 q1t1

rndq32 41.31 38.77

rnd q1t1 34.70 32.80

Zambyte2y ago

For Linux: https://wiki.archlinux.org/title/Swap_on_video_RAM

afr0ck2y ago· 2 in thread

I didn't fully read the paper, but few questions come into mind.

1) How does this work differ from Mark Silberstein's GPUfs from 2014 [1]?

3) Related to (2), on the device level, how do you sync CPU (on an SMP, multiple cores) and GPU accesses?

[1] https://dl.acm.org/doi/10.1145/2553081

riedel2y ago

> 1) How does this work differ from Mark Silberstein's GPUfs from 2014 [1]?

Just quoting the paper:

afr0ck2y ago

Thanks for the quote!

In this case, it is indeed novel to run the logic of the filesystem on the GPU itself. It's definitely worth the investigation!

KingOfCoders2y ago· 2 in thread

Like Microsoft DirectStorage?

wtallis2y ago

Nope. This is an implementation of one of several things that people often imagine Microsoft's DirectStorage to be, but the real DirectStorage is a lot more mundane.

KingOfCoders2y ago

I have no clue, so I've asked, where is the difference?

1 more reply

ec1096852y ago· 2 in thread

eru2y ago

For lots of small files, that might not be the case.

(I worked on a FUSE filesystem that had these issues.)

loeg2y ago

It seems more straightforward to fix your data-in-files layout than to implement a novel in-GPU filesystem, though.

I think the main benefit here is not having to do memory copies through the CPU, which frees up memory bandwidth for other things.

2 more replies

amelius2y ago· 2 in thread

A GPU seems overkill when the bottleneck is the I/O.

OlivierLi2y ago

In systems performance I would advise to never think of any workload as unidimensional (ie: Any file system optimization can either improve IO latency or be useless)

Issuing individual truncates of 1B files can be just as much of a CPU problem then an IO one for example.

amelius2y ago

But why wouldn't using one of many CPU cores be sufficient?

west0nOP2y ago· 1 in thread

It would be interesting to know if this approach could optimize the performance of training and inference for large models.

t-32y ago

yeison2y ago

How to get hired by NVIDIA! If it does work it's a brilliant idea.

_kdave2y ago

hieu2292y ago

I hope GPU files leads to faster database

brcmthrowaway2y ago

Is this implementing a file system using shader code? Thats insane

are shaders turing complete ? ;)

touisteur2y ago

j / k navigate · click thread line to collapse