The theoretical maximum for current chips is less than 16 bytes per cycle. On Haswell you can process (in parallel) 7 blocks in roughly the time it would take to process 1. The latency of each round is 7 cycles, a full AES-128 10 rounds is ~70 cycles, so effectively you can process at most 1.6 bytes per cycle, or 1.14 if you use 256-bit keys (ignoring the cost of key scheduling and overhead here).
Even if you dedicate all CPU cores to the task of encrypting memory, you still stop short of exceeding theoretical memory bandwidth by quite a bit.