It could very well be that your application is hitting a memory pattern that favors larger L1 cache, while the huge L3 cache of EPYC is not useful.
------
If you really wanted to know, you should learn how to use hardware performance counters and check out the instructions-per-clock. If you're around 1 or 2 instructions per clock tick, then you're CPU-bound.
If you're less than that, like 0.1 instructions per clock (ie: 10 clocks per instruction), then you're Cache and/or RAM-bound.
-----
From there, you continue your exploration. You count up L1 cache hits, L2 cache hits, L3 cache hits and cache-misses. IIRC, there are some performance counters that even get into the inter-thread communications (but I forget which ones off the top of my head). Assuming you were cache/ram bound of course (if you were CPU-bound, then check your execution unit utilization instead).
EPYC unfortunately doesn't have very accurate default performance counters, and I'd bet that no one really knows how to use M1 performance counters yet either.
While the default PMC counters of AMD/EPYC are inaccurate (but easy to understand), AMD has a second set of hard-to-understand, but very accurate profiling counters called IBS Profiling: https://www.codeproject.com/Articles/1264851/IBS-Profiling-w...
Still, having that information ought to give you a better idea of "why" your code performs the way it does. You may have to activate IBS-profiling inside of your BIOS before these IBS-profiling tools work.
By default, AMD only has the default performance counters available. So you may have a bit of a struggle juggling the BIOS + profiler to get things working just right, and then you'll absolutely struggle at understanding what the hell you're even looking at once all the data is in.