There are cases where we are 50% faster than O_DIRECT without any "caching". Furthermore, in high bandwidth applications (>4GB/sec) without O_DIRECT its easy to become CPU limited in the blk/midlayer so again we win.
Now that said, I haven't tried the latest blk-mq, scsi-mq, etc patches which are tuned for higher IOP rates. These patches were driven by people plugging in high performance flash arrays and discovering huge performance issues in the kernel. Still, I expect if you plug in a couple high end flash arrays the kernel is going to be the limit rather than the IO subsystem on a modern xeon.