The CPU core/cache uses its physical addresses to route loads and stores, but they don't have to go to the memory controller on the chip. They could go to the SMP unit if the memory controller for the data is on another chip. Or a PCI host bridge on the chip. Either directly to its register space, or to the register space or memory that belongs to a device behind it. These addresses are called "MMIO" memory mapped IO and are the way configuration is done. They are also slow, low bandwidth, synchronous.
x86 has "ports". I don't know about modern implementations but I would guess they are done with MMIO out the back end of the core (e.g., the CPU turns inb/outb instructions into accesses to special memory ranges).
Either way MMIO is the way to configure devices.
DMA is how to move data between device and CPU or devices. But nowadays with high performance devices, you don't set up a request and then send a MMIO command to process that request, and then get an interrupt and do a MMIO to get completion status of the command. The commands and completions themselves are DMA'ed. So you have ring buffers (multiple for a multi queue or virtualized device) for commands and completions which get DMAed. You do the MMIOs and interrupts only to manage starting and stopping (fill, empty, etc) conditions of the queues.
DMAs can go direct to caches in some architectures, but as I said L3 caches are only so large, and data volumes so large that it can be hard to arrange. You would hope your queues are mostly cached at least, but in reality I don't know if that actually happens well.