> Zen 4 is AMD's first attempt at putting a loop buffer into a high performance CPU. Validation is always difficult, especially when implementing a feature for the first time. It's not crazy to imagine that AMD internally discovered a bug that no one else hit, and decided to turn off the loop buffer out of an abundance of caution. I can't think of any other reason AMD would mess with Zen 4's frontend this far into the core's lifecycle.
> Still, the Cyberpunk 2077 data bothers me. Performance counters also indicate higher average IPC with the loop buffer enabled when the game is running on the VCache die. Specifically, it averages 1.25 IPC with the loop buffer on, and 1.07 IPC with the loop buffer disabled. And, there is a tiny performance dip on the new BIOS.
Smells of microcode mitigations if you ask me, but naturally let’s wait for the CVE.
If you don't disclose the vulnerability then affected parties cannot start taking countermeasures, except out of sheer paranoia.
Disclosing a vulnerability is a way shift liability onto the end user. You didn't update? Then don't complain. Only rarely do disclosures lead to product liability. I don't remember this (liability) happening with Meltdown and Spectre either. So wouldn't assume this is AMD being secretive.
I can't say more. :(
If so, it might be a classic case of "Team of engineers spent months working on new shiny feature which turned out to not actually have any benefit, but was shipped anyway, possibly so someone could save face".
I see this in software teams when someone suggests it's time to rewrite the codebase to get rid of legacy bloat and increase performance. Yet, when the project is done, there are more lines of code and performance is worse.
In both cases, the project shouldn't have shipped.
Was shipped anyway because it can be disabled with a firmware update and because drastically altering physical hardware layouts mid design was likely to have worse impacts.
Building chips is a multiyear process and most folks don’t understand this.
no. once the core has it and you realize it doesn't help much, it absolutely is a risk to remove it.
Remember most of the technical analysis on Chips and Cheese is a one person effort, and I simply don't have infinite free time or equipment to dig deeper into power. That's why I wrote "Perhaps some more mainstream tech outlets will figure out AMD disabled the loop buffer at some point, and do testing that I personally lack the time and resources to carry out."
This is often pretty common, as the performance characteristics are often unknown until late in the hardware design cycle - it would be "easy" if each cycle was just changing that single unit with everything else static, but that isn't the case as everything is changing around it. And then by the time you've got everything together complete enough to actually test end-to-end pipeline performance, removing things is often the riskier choice.
And that's before you even get to the point of low-level implementation/layout/node specific optimizations, which can then again have somewhat unexpected results on frequency and power metrics.
I wonder if that will be the key benefit of Google's switch to two "major" Android releases each year: it will get people used to nothing newsworthy happening within a version increment. And I also wonder if that's intentional, and my guess is not the tiniest bit.
There is an added benefit though - that the new programmers now are fluent in the code base. That benefit might be worth more than LOCs or performance.
It tests the performance benefit hypothesis in different scenarios and does not find evidence that supports it. It makes one best effort attempt to test the power benefit hypothesis and concludes it with: "Results make no sense."
I think the real take-away is that performance measurements without considering power tell only half the story. We came a long way when it comes to the performance measurement half but power measurement is still hard. We should work on that.
Tell that to the share holders. As a public company, they can very quickly lose enormous amounts of money by being behind or below on just about anything.
There will be a certain number of people who will delay an upgrade a bit more because the new machines don’t have enough extra oomph to warrant it. Little’s Law can apply to finance when it’s interval between purchases.
> Perhaps the best way of looking at Zen 4's loop buffer is that it signals the company has engineering bandwidth to go try things. Maybe it didn't go anywhere this time. But letting engineers experiment with a low risk, low impact feature is a great way to build confidence. I look forward to seeing more of that confidence in the future.
With more detailed power measurements, it could be possible to determine if this is thermal/power budget related? It does sound like the feature was intended to conserve power…
Most of the cores on CCD0 of my non-X3D chip hit 5.6-5.75ghz. CCD 1 has cores topping out at 5.4-5.5ghz.
V-Cache chips for Zen 4 have a huge clock penalty, however the Cache more than makes up for it.
Did he test CCD1 on the same chip with both the feature disabled and enabled? Did he attempt to isolate other changes like security fixes as well? He admitted “no” in his article.
The only proper way to test would be to find a way to disable the feature on a bios that has it enabled and test both scenarios across the same chip, and even then the result may still not be accurate due to other possible branch conditions. A full performance profile could bring accuracy, but I suspect only an AMD engineer could do that…
That being said, some workloads will see a small regression, however AMD has made some small performance improvements since launch.
They should have just made it a BIOS option for Zen 4. The fact they do not appear to have done so does indicate the possibility of a bug or security issue.
If you don't disclose the vulnerability then affected parties cannot start taking countermeasures, except out of sheer paranoia.
Disclosing a vulnerability is a way shift liability onto the end user. You didn't update? Then don't complain. Only rarely do disclosures lead to product liability. I don't remember this (liability) happening with Meltdown and Spectre either. So wouldn't assume this is AMD being secretive.
That's not quite how it was implemented.
Instead, the second 68000 was halted and disconnected from the bus until the first 68000 (the executor) trigged a fault. Then the first 68000 would be held in halt, disconnected from the bus and the second 68000 (the fixer) would take over the bus to run the fault handler code.
After the fault had been handled, the first 68000 could be released from halt and it would resume execution of the instruction, with all state intact.
As for the cost of a second 68000, extra logic and larger PCBs? Well, the of the Motorola 68451 MMU (or equivalent) absolutely dwarfed the cost of everything else, so adding a second CPU really wasn't a big deal.
Technically it didn't need to be another 68000, any CPU would do. But it's simpler to use a single ISA.
For more details, see Motorola's application note here: http://marc.retronik.fr/motorola/68K/68000/Application%20Not...
could anyone do any better on 68000? My incomplete history of CPU dedicated fast paths for moving data:
- 1982 Intel 186/286 'rep movsw' at theoretical 2 cycles per byte (I think its closer to 4 in practice). Brilliant, then intel drops the ball for 20 years :|
- 1986 WDC W65C816 Move Memory Negative (MVN), Move Memory Positive (MVP) at hilarious 7 cycles per byte. Slower than unrolled code, 2x slower than unrolled code using 0 page. Afaik no loop buffer meant its re-fetching whole instruction every loop.
- 1987 NEC TurboGrafx-16/PC Engine 6502 clone by HudsonSoft HuC6280 Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII) at hysterical 6 cycles per byte plus 17 cycles startup. (17 + 6x) = ~160KB/s at 7.16 MHz CPU. For comparison IBM XT with 4.77 MHz NEC V20 does >300KB/s
- 1993 Pentium 'rep movsd' at theoretical 4 bytes per cycle, 0.31 cycles per byte in practice http://www.pennelynn.com/Documents/CUJ/HTML/14.12/DURHAM1/DU...
- 1995 Pentium Pro "fast string mode" strongly hinted at REP MOVS as the optimal way to copy memory.
- 1997 Pentium MMX 'rep movsd' 0.27 cycles per byte. Mem copy with MMX registers 0.29 cycles per byte.
- 2000 SSE2 optimized copy hack.
- 2008 AVX optimized copy hack at ~full L2/memory bus speed for large enough transfers.
- 2012 Ivy Bridge Enhanced REP MOVSB (ERMSB), but funnily still slower than even the SSE2 variants.
- 2019 Ice Lake Fast Short REP MOVSB (FSRM) still somewhat slower than AVX variants on unaligned accesses.
- 2020 Zen3 FSRM !20 times! slower than AVX unaligned, 30% slower on aligned https://lunnova.dev/articles/ryzen-slow-short-rep-mov/
- 2023 And then Intel got Reptar https://lock.cmpxchg8b.com/reptar.html :)
I guess this could also be used as an optimization target at least on devices that are more long lived designs (eg consoles).
Energy used per instruction is almost certainly the metric that should be considered to see the benefits of this loop buffer, not energy used per second (power, watts).
While you can somewhat isolate for this by doing hundreds of runs for both on and off, that takes tons of time and still won’t be 100% accurate.
Even disabling the feature can cause the code to use a different branch which may shift everything around.
I am not specifically familiar with this issue, but I have seen cases where disabling a feature shifted the load from integer units to the FPU or the GPU as an example, or added 2 additional instructions while taking away 5.
But when the CPU is pulling 100w under load? Well now we're talking an amount so small it's irrelevant. Maybe with a well calibrated scope you could figure out if it was on or not.
Since this is in the micro-op queue in the front end, it's going to be more about that very low total power draw side of things where this comes into play. So this would have been something they were doing to see if it helped for the laptop skus, not for the desktop ones.
It's very very expensive to fix a bug in a CPU, so it's easier to expose control flags or microcode so you can patch it out.
https://www.ardent-tool.com/CPU/Cyrix_Cx486.html#soft
https://www.vogons.org/viewtopic.php?t=45756 Register settings for various CPUs
https://www.vogons.org/viewtopic.php?t=30607 Cyrix 5x86 Register Enhancements Revealed
L1, Branch Target Buffer, LSSER (load/store reordering), Loop Buffer, Memory Type Range Registers (Write Combining, Cacheability), all controlled using client side software.
Cyrix 5x86 testing of Loop Buffer showed 0.2% average boost and 2.7% maximum observable speed boost.
( Idly waiting for x86 to try and compete with ARM on efficiency. Unfortunately I dont see Zen 6 or Panther Lake getting close. )
"Both the fetch+decode and op cache pipelines can be active at the same time, and both feed into the in-order micro-op queue. Zen 4 could use its micro-op queue as a loop buffer, but Zen 5 does not. I asked why the loop buffer was gone in Zen 5 in side conversations. They quickly pointed out that the loop buffer wasn’t deleted. Rather, Zen 5’s frontend was a new design and the loop buffer never got added back. As to why, they said the loop buffer was primarily a power optimization. It could help IPC in some cases, but the primary goal was to let Zen 4 shut off much of the frontend in small loops. Adding any feature has an engineering cost, which has to be balanced against potential benefits. Just as with having dual decode clusters service a single thread, whether the loop buffer was worth engineer time was apparently “no”."