Making the requests asynchronous and issuing lots of requests in parallel is what makes it possible to get good performance out of flash-based storage; P2P DMA would be a relatively minor optimization on top of that. DirectStorage isn't the only way to asynchronously issue batches of storage requests; Windows has long had IOCP and more recently cloned io_uring from Linux.
DirectStorage 1.1 introduced an optional feature for GPU decompression, so that data which is stored on disk in a (the) supported compressed format can be streamed to the GPU and decompressed there instead of needing a round-trip through the CPU and its RAM for decompression. This could help make the P2P DMA option more widely usable by reducing the cases which need to fall back to the CPU, but decompressing on the GPU is nothing that applications couldn't already implement for themselves; DirectStorage just provides a convenient standardized API for this so that GPU vendors can provide a well-optimized decompression implementation. When P2P DMA isn't available, you can still get some computation offloaded from the CPU to the GPU after the compressed data makes a trip through the CPU's RAM.
(Note: official docs about DirectStorage don't really say anything about P2P DMA, but it's clearly being designed to allow for it in the future.)
The GPU4FS described here is a project to implement the filesystem entirely on the GPU: the code to eg. walk the directory hierarchy and locate what address actually holds the file contents is not on the CPU but on the GPU. This approach means the application running on the GPU needs exclusive ownership of the device holding the filesystem. For now, they're using persistent memory as the backing store, but in the future they could implement NVMe and have storage requests originate from the GPU and be delivered directly to the SSD with no CPU or OS involvement.