How do you do that exactly? Are you using eBPF or something else?
Also, for my ML workloads the most common bottleneck is GPU VRAM <-> RAM copies. Doesn't this dramatically increase latency? Or is it more like it increases latency on first data transfer, but as long as you dump everything into VRAM all at once at the beginning you're fine? I'd expect this wouldn't play super well with stuff like PyTorch data loaders, but would be curious to hear how you've faired when testing.