The throughput of the AVX-512 computation instructions is matched to the throughput of loads from the L1 cache memory, on all CPUs.
Therefore to reach the maximum throughput, you must have the data in the L1 cache memory. Because L1 is not shared, the throughput of the transfers from L1 scales proportionally with the number of cores, so it can never become a bottleneck.
So the most important optimization target for the programs that use AVX-512 is to ensure that the data is already located in L1 whenever it is needed. To achieve this, one of the most important things is to use memory access patterns that will trigger the hardware prefetchers, so that they will fill the L1 cache ahead of time.
The main memory throughput is not much lower than that of the L1 cache, but the main memory is shared by all cores, so if all cores want data from the main memory at the same time, the performance can drop dramatically.