To "fix" performance (i.e., increase throughput by close to 35%) one has to mess with the "energy performance bias" on the (Broadwell) platform, e. g. using x86_energy_perf_policy[0] or cpupower[1]. Otherwise, the CPUs/platform firmware will select to operate in a very dissatisfactory compromise between high-ish power consumption (~90W per socket), but substantially less performance than with having all idle states disabled (= CPU in POLL at all times, resulting in ~135W per core) completely. One can tweak things to reach a sweet spot in the middle, where you can achieve ~99% of the peak performance at very sensible idle power draw (i.e., ~25W when the link isn't loaded).
With Zen3, this hardly mattered at all.
I also got to witness that using IPv4 for the wireguard "overlay" network yielded about 30% better performance than when using IPv6 with ULA prefixes.
[0]: https://man.archlinux.org/man/x86_energy_perf_policy.8
[1]: https://linux.die.net/man/1/cpupower