Skip to content

Top Best Ask Show New Jobs

Unexpected benefit with Ryzen – reducing power for build server (opens in new tab)

(lists.dragonflybsd.org)

247 pointsRovanion7y ago59 comments

59 comments

42 comments · 10 top-level

wolf550e7y ago· 8 in thread

Not every workload is memory bandwidth bound like his "make -j16" compile. Some workloads need memory latency or fast inter-core (and inter-socket) operations (e.g. RDBMS OLTP), some need CPU throughput (e.g. HPC), some need best possible single thread CPU performance (e.g. some gaming).

As he wrote, CPUs are most efficient (compute per Watt) at a specific frequency, and if his CPU mostly waits for RAM, this can be done at low power.

It's probably possible to create x86-64 CPUs with narrower backends (fewer execution units) with microcode-emulated 128 and 256 bit registers/operations (and maybe even emulated FPU) and get a cheaper and faster build server, if it was economical to fab such narrow-use-case chips (those would be good for redis/memcached too I imagine).

> Not every workload is memory bandwidth bound like his "make -j16" compile.

He actually did `make -j32`, not 16. Which is going to absolutely devastate the cache.

`make -j<number of cores x 2>` was a good rule of thumb back when you had 1/2/4 physical CPUs with their own sockets on a motherboard and spinning rust hard disks. A lot of "compilation" time was reading the source code off the disk. But it doesn't make any sense anymore with so many cores, hyperthreading, and SSDs that serve you the file in milliseconds.

If he's bandwidth limited, he would gain a significant performance improvement by reducing the number of processes.

I'd reserve judgement until I saw measurements. Maybe all 32 jobs are using the same cpp/gcc/asm/ld binaries (or whatever the stack is these days) which never get evicted. And I presume this is running under DragonFly BSD whose design goals include aggressive SMP support, so things might be different there. I don't know.

Dylan168077y ago

Zen is already light on vector units, and microcodes 256-bit operations.

It's certainly possible to build a more lightweight core, but most of that work is reducing the complexity of the out-of-order machinery. The FPU+ALU is under a quarter of each Zen core. https://en.wikichip.org/w/images/c/cb/amd_zen_core_%28annota...

BeeOnRope7y ago

It is definitely not "microcoded" - 256-bit operations are just sent in halves to the 128-bit ALUs and combined for the final answer.

Don't get mixed up between "microcoding" and "micro-op" - the latter is something different, slower and which usually requires some kind of transition in the decoders and uop caches to start reading microcoded ops. The latter is the "normal" or "fast" mode for the CPU and just because one instruction turns into two uops (or macro-ops or whatever AMD calls them) doens't mean microcoded.

You do want it to out-of-order and branch predict and speculate enough to issue speculative RAM reads as soon as a possibly needed address is available, to hide RAM latency (as long as rollbacks of speculatively executed operations hide the loaded values in the cache so the speculation leaves no side effects), that is important for performance.

In that picture (thanks!) I see the FPU is big, the decoder is big, the branch predictor is big, the rest is probably needed. Maybe emulated FPU is good for some workloads, maybe ability to program in microinstructions instead of x86-64 is useful too. But maybe silicon area is not the expensive thing (dark silicon, etc.).

Do you have a link about microcoded avx256? I would think it would be way too slow.

randomthrow17y ago

I'm interested in performance optimization (especially under linux) and its intersection with computer architecture. Would you mind recommending me any resources to get started there?

BeeOnRope7y ago

For x86 specifically, Agner Fog's manual, the Intel and AMD optimization manuals, the SO x86 tag wiki [1].

[1] https://stackoverflow.com/tags/x86/info

BeeOnRope7y ago· 7 in thread

I think the claim that parallel compilation with gcc is memory bandwidth bound is unlikely. gcc is known to be a very pointer-chasy, branch-mispredicty load that is highly sensitive to memory latency - far from a streaming load that is sensitive to raw bandwidth.

Still, the conclusion holds: if most of the time is spent waiting for values to come back from memory, a higher core frequency has strongly diminishing returns.

imtringued7y ago

That's only true if you only compile a single file at once which is an exceedingly rare use case for a build server. As soon as you compile files in parallel the CPU can simply switch to the next hardware thread during a memory load from main memory. Then there is the fact that dual channel DDR4 just doesn't provide a lot of memory bandwidth in the first place. A 16 core/32 thread desktop CPU is probably not going to happen on the AM4/Ryzen platform even if everything suddenly supports multi-threading on 16 cores simply because the memory bandwidth isn't enough to translate into meaningful performance increases. GPUs have horrendous memory latencies but they perform well precisely because they can just switch to the next thread and execute that one while waiting.

BeeOnRope7y ago

Well you are mixing the effect of "more cores" and SMT together here. Sure, SMT helps hide some latency effects, but it doesn't significantly increase the demand for bandwidth. The increased bandwidth requirements when introducing SMT are probably approximately modeled by the increase in performance: so a 30% uplift from running two hardware threads per core means that bandwidth requirement increases by about 30%.

That's not enough to turn gcc from a largely latency bound load to a memory bandwidth hog!

vgatherps7y ago

Ryzen only has two threads per core, so one would be able to see at most a 2x gain. That's not insignificant, but still far from what one needs to start seeing bandwidth problems.

AstralStorm7y ago

Closer to zero returns and you still get to improve latency hiding and memory controller design. (including bit width and block sizes)

Even more cache ways won't help too much in this workload.

BeeOnRope7y ago

Unless the AMD design is unusual it is not very close to zero return: a significant part of the "path to memory" involves things run at the core clock, in particular everything from the core to the L2 and probably some part of the coordination logic which communicates with the "uncore". I'm not sure about AMD chips, but on some chips there is a relationship between the uncore speed and the core speed: e.g., the uncore speed might often be the same as the maximum core speed for any core on the socket.

Adding to that, there are other effects that allow core frequency to leak into the performance of memory-bound programs, such as a higher frequency allowing the core run ahead more quickly to get more memory requests in flight, recover more quickly after a branch misprediction, etc. Try it sometime: find something which is really memory bound and crank the frequency way down: there will probably be a significant effect, but not nearly in proportion to the frequency difference.

yazr7y ago

I wonder if his results still hold for gcc -O3?

That be much more CPU-bound. Yes - some optimization will do global traversals.

I wonder if javac/clang has the same characteristics as gcc.

BeeOnRope7y ago

Yeah maybe. I haven't found a huge difference between -O3 and -O2, but maybe I haven't been running big enough compiles.

magila7y ago· 5 in thread

Calling this "Unexpected" seems like a bit of a stretch. In particular this part:

> Of course, in the server space, we've known for a long time that maximum efficiency occurs with a high number of cores running at lower frequencies, and that efficiency trumps performance on machines with high core counts. But I never considered that the consumer Ryzen CPUs could also benefit from the same thing until now.

makes no sense. This principal applies to all CPUs from the smallest SoCs to the largest server CPUs, why on Earth would you not expect it to apply to desktop CPUs?

You could do the same thing with a 6 core i7 and 2133 memory. Intel CPUs have long supported an adjustable power limit to constrain operating frequency based on power consumption just like he describes for Ryzen.

ceratopisan7y ago

You are confusing principle with implementation. Reducing clock speed reduces power usage and you can compensate with more cores is indeed a truth. However, finding that option in consumer hardware has been relatively difficult. That is the surprise indicated.

magila7y ago

Ryzen is known to be memory constrained even with much faster memory than he used. It is completely predicable that he found his CPU to be severely starved for memory bandwidth thus enabling him to reduce operating frequency without penalty.

This is like putting an LS engine in an otherwise stock Miata and acting surprised that you can run the engine at lower RPM and still put in good lap times.

BeeOnRope7y ago

Note that the author is not claiming that he compensated with more cores. He is claiming that the performance is roughly the same, at the same core count (8), regardless of frequency.

BeeOnRope7y ago

Well I suppose the part that might not apply to desktop parts is "efficiency trumps performance". Many desktop uses don't care about efficiency, but care about performance.

For servers, the purchasing decisions are probably much more quantitative, and if you are buying a high core count machine it probably means you have a parallelizable workload, so no efficiency comes into play since you have a lot of choice on the frequency/core-count spectrum, versus power, money, space, etc.

Fnoord7y ago

> Many desktop uses don't care about efficiency, but care about performance.

Why not? I like it when my electricity bill is less severe. For my wallet and the environment alike.

ddorian437y ago· 5 in thread

Any server-hosting with Ryzen + ecc-ram ?

Nux7y ago

https://www.hetzner.com/dedicated-rootserver/matrix-ax

ddorian437y ago

I know but it's not ecc-ram.

BeeOnRope7y ago

Packet.net offers it:

https://www.packet.net/bare-metal/servers/c2-medium-epyc/

Do you have experience with packet.net ? There's virtually no discussion on HN, which I find surprising.

ddorian437y ago

Please see other comments. It's EPYC and not RYZEN. 2.0 Ghz base clock speed is not nice for single-thread.

Jonnax7y ago· 3 in thread

180w to 85w is pretty impressive.

I didn't know that compilation was memory speed limited.

Are there any good benchmarks on it?

Anyone have any examples of getting faster memory boosting build speed?

Over the last few years I'd settled into thinking that high speed ram barely did anything. I guess I was wrong!

lykr0n7y ago

ryzen loves memory speed, more so than Intel. phoronix has benchmarks that you're looking for.

https://www.phoronix.com/scan.php?page=article&item=ryzen-dd...

More specifically, the interprocessor interconnect (infinity fabric) system ryzen uses is tied to the RAM clock. Ryzen clumps there processosor in groups of 4, and uses infinity fabric as an interconnect between those; so I am not sure you will an effect larger then Intel on a quadcore ryzen.

https://www.techpowerup.com/231585/amd-ryzen-infinity-fabric...

_wmd7y ago

Anything involving larger-than-cache sized graph walks will usually be memory limited, be it compilation or iterating an XML document

mekpro7y ago· 2 in thread

This is to be expected. Since 180Watt is not the default TDP of Ryzen 2700X, the default TDP is 105Watt. https://www.amd.com/en/products/cpu/amd-ryzen-7-2700x

Which mean the CPU is already shipped with the reasonable performance/watt TDP and over-TDP it will give diminish return in performance gain.

However, It would be interesting to see benchmark in much lower TDP than 105Watt and see how far the TDP can go down before big drop in performance.

manual7y ago

Excellent look into this by user "The Stilt" can be found here: https://forums.anandtech.com/threads/ryzen-strictly-technica...

It looks something like this: 4GHz 120W, 3.8 90W, 3.6 65W, 3.4 50W, 32 42W, 3.0 33W, 2.0 13W. This excludes the SOC.

Stilt is a magician in getting peak performance per watt out of everything. Down to tweaking individual straps for memory timing on binary firmwares for amd graphics cards.

3.6 @65W is impressive, almost stock speed at nearly half tdp.

ebikelaw7y ago· 2 in thread

He doesn't seem to mention what the build times were with the Xeon.

loeg7y ago

Here's the same author from pretty recently comparing the 2990WX to some Xeons (he says E5-2620 but doesn't mention which version — could be anything from Sandybridge to Broadwell):

http://apollo.backplane.com/DFlyMisc/threadripper.txt

zrm7y ago

The only available E5-2620 with that number of cores per socket is Broadwell.

Short form and generalized: when one subsystem of a larger system is not your bottleneck, it's often possible to lower the resources for that part of the system without impacting overall performance.

polskibus7y ago

Has anyone encountered webpack and C# compilation benchmarks that compare Ryzen and Intel?

Is this the amiga hacker/dragonflyBSD main dev? Why nobody has handed a 2990wx to this man?

j / k navigate · click thread line to collapse