AMD Disables Zen 4's Loop Buffer (opens in new tab)

(chipsandcheese.com)

338 pointsluyu_wu1y ago141 comments

141 comments

This is a wild guess, but could this feature be disabled in an attempt at preventing some publicly undisclosed hardware vulnerability?

bell-cot1y ago

The Article more-or-less speculates that:

> Zen 4 is AMD's first attempt at putting a loop buffer into a high performance CPU. Validation is always difficult, especially when implementing a feature for the first time. It's not crazy to imagine that AMD internally discovered a bug that no one else hit, and decided to turn off the loop buffer out of an abundance of caution. I can't think of any other reason AMD would mess with Zen 4's frontend this far into the core's lifecycle.

baq1y ago

Indeed, it might be the case that there’s more than that disabled, since numbers are somewhat surprising:

> Still, the Cyberpunk 2077 data bothers me. Performance counters also indicate higher average IPC with the loop buffer enabled when the game is running on the VCache die. Specifically, it averages 1.25 IPC with the loop buffer on, and 1.07 IPC with the loop buffer disabled. And, there is a tiny performance dip on the new BIOS.

Smells of microcode mitigations if you ask me, but naturally let’s wait for the CVE.

hinkley1y ago

Or another logic bug. We haven’t had a really juicy one in a while.

BartjeD1y ago

Quitely disabling it is also a big risk. Because you're signalling that in all probablity you were aware of the severity of the issue; Enough so that you took steps to patch it.

If you don't disclose the vulnerability then affected parties cannot start taking countermeasures, except out of sheer paranoia.

Disclosing a vulnerability is a way shift liability onto the end user. You didn't update? Then don't complain. Only rarely do disclosures lead to product liability. I don't remember this (liability) happening with Meltdown and Spectre either. So wouldn't assume this is AMD being secretive.

wtallis1y ago

Please don't post duplicate comments like this. Your first comment (https://news.ycombinator.com/item?id=42287118) was fine but spamming a thread with copy-and-pasted comments just hurts the signal to noise ratio.

2 more replies

immibis1y ago

The countermeasure is to disable the loop buffer. Everyone who wants to protect themselves from the unknown vulnerability should disable the loop buffer. Once everyone's done that or had a reasonable opportunity to do that, it can be safely published.

1 more reply

throw_away_x1y21y ago

Bingo.

I can't say more. :(

pdimitar1y ago

Have we learned nothing from Spectre and Meltdown?... :(

9 more replies

bhouston1y ago

Yeah, my first thoughts too.

londons_explore1y ago

The article seems to suggest that the loop buffer provides no performance benefit and no power benefit.

If so, it might be a classic case of "Team of engineers spent months working on new shiny feature which turned out to not actually have any benefit, but was shipped anyway, possibly so someone could save face".

I see this in software teams when someone suggests it's time to rewrite the codebase to get rid of legacy bloat and increase performance. Yet, when the project is done, there are more lines of code and performance is worse.

In both cases, the project shouldn't have shipped.

akira25011y ago

> but was shipped anyway, possibly so someone could save face

Was shipped anyway because it can be disabled with a firmware update and because drastically altering physical hardware layouts mid design was likely to have worse impacts.

eek21211y ago

Well that and changing a chip can take years due to redesigning, putting through validation, RTM, and time to create.

Building chips is a multiyear process and most folks don’t understand this.

usrusr1y ago

What you describe would be shipped physically but disabled, and that certainly happens a lot. For exactly those reasons. What GP described was shipped not only physically present but also not even disabled, because politics. That would be a very different thing.

readyplayernull1y ago

That bathroom with a door to the kitchen.

adgjlsfhk11y ago

> but was shipped anyway, possibly so someone could save face

no. once the core has it and you realize it doesn't help much, it absolutely is a risk to remove it.

glzone11y ago

No kidding. I was adjacent to a tape out w some last minute tweaks - ugh. The problem is the current cycle time is very slow and costly and u spend as much time validating things as you do designing. It’s not programming.

3 more replies

sweetjuly1y ago

The article also mentions they had trouble measuring power usage in general so we can't necessarily (and, really, shouldn't) conclude that it has no impact whatsoever. I highly doubt that AMD's engineering teams are so unprincipled as to allow people to add HW features with no value (why would you dedicate area and power to a feature which doesn't do anything?), and so I'm inclined to give them the benefit of the doubt here and assume that Chips 'n Cheese simply couldn't measure the impact.

clamchowder1y ago

Note - I saw the article through from start to finish. For power measurements I modified my memory bandwidth test to read AMD's core energy status MSR, and modified the instruction bandwidth testing part to create a loop within the test array. (https://github.com/clamchowder/Microbenchmarks/commit/6942ab...)

Remember most of the technical analysis on Chips and Cheese is a one person effort, and I simply don't have infinite free time or equipment to dig deeper into power. That's why I wrote "Perhaps some more mainstream tech outlets will figure out AMD disabled the loop buffer at some point, and do testing that I personally lack the time and resources to carry out."

2 more replies

kimixa1y ago

> engineering teams are so unprincipled as to allow people to add HW features with no value

This is often pretty common, as the performance characteristics are often unknown until late in the hardware design cycle - it would be "easy" if each cycle was just changing that single unit with everything else static, but that isn't the case as everything is changing around it. And then by the time you've got everything together complete enough to actually test end-to-end pipeline performance, removing things is often the riskier choice.

And that's before you even get to the point of low-level implementation/layout/node specific optimizations, which can then again have somewhat unexpected results on frequency and power metrics.

011000111y ago

Working at.. a very popular HW company.. I'll say that we(the SW folks) are currently obsessed with 'doing something' even if the thing we're doing hasn't fully been proven to have benefits outside of some narrow use cases or targeted benchmarks. It's very frustrating, but no one wants to put the time in to do the research up front. It's easier to just move forward with a new project because upper management stays happy and doesn't ask questions.

usrusr1y ago

Is it that expectation of major updates coming in at a fixed cycle? Not only expected by upper management but also by end users? That's a difficult trap to get out of.

I wonder if that will be the key benefit of Google's switch to two "major" Android releases each year: it will get people used to nothing newsworthy happening within a version increment. And I also wonder if that's intentional, and my guess is not the tiniest bit.

1 more reply

markus_zhang1y ago

Do you have new software managers/directors who are encouraging such behavior? From my experience new leaders tend to lean on this tactics to grab power.

1 more reply

iforgotpassword1y ago

Well the other possibility is that the power benchmarks are accurate: the buffer did save power, but then they figured out an even better optimization on the microcodes level that would make the regular path save even more power, so the buffer actually became a power hog.

EVa5I7bHFq9mnYK1y ago

>> when the project is done, there are more lines of code and performance is worse

There is an added benefit though - that the new programmers now are fluent in the code base. That benefit might be worth more than LOCs or performance.

weinzierl1y ago

"The article seems to suggest that the loop buffer provides no performance benefit and no power benefit."

It tests the performance benefit hypothesis in different scenarios and does not find evidence that supports it. It makes one best effort attempt to test the power benefit hypothesis and concludes it with: "Results make no sense."

I think the real take-away is that performance measurements without considering power tell only half the story. We came a long way when it comes to the performance measurement half but power measurement is still hard. We should work on that.

ksaj1y ago

"the project shouldn't have shipped."

Tell that to the share holders. As a public company, they can very quickly lose enormous amounts of money by being behind or below on just about anything.

firebot1y ago

The article clearly articulates that there's no performance benefit. However there's efficiency. It reduces power consumption.

hinkley1y ago

Someone elsewhere quotes a game specific benchmark of about 15%. Which will mostly matter when your FPS starts to make game play difficult.

There will be a certain number of people who will delay an upgrade a bit more because the new machines don’t have enough extra oomph to warrant it. Little’s Law can apply to finance when it’s interval between purchases.

saagarjha1y ago

Only on Hacker News will you get CPU validation fanfiction.

Loic1y ago

For me the most interesting paragraph in the article is:

> Perhaps the best way of looking at Zen 4's loop buffer is that it signals the company has engineering bandwidth to go try things. Maybe it didn't go anywhere this time. But letting engineers experiment with a low risk, low impact feature is a great way to build confidence. I look forward to seeing more of that confidence in the future.

eqvinox1y ago

> Strangely, the game sees a 5% performance loss with the loop buffer disabled when pinned to the non-VCache die. I have no explanation for this, […]

With more detailed power measurements, it could be possible to determine if this is thermal/power budget related? It does sound like the feature was intended to conserve power…

eek21211y ago

He didn’t provide enough detail here. The second CCD on a Ryzen chip is not as well binned as the first one even on. non-X3D chips. Also, EVERY chip is different.

Most of the cores on CCD0 of my non-X3D chip hit 5.6-5.75ghz. CCD 1 has cores topping out at 5.4-5.5ghz.

V-Cache chips for Zen 4 have a huge clock penalty, however the Cache more than makes up for it.

Did he test CCD1 on the same chip with both the feature disabled and enabled? Did he attempt to isolate other changes like security fixes as well? He admitted “no” in his article.

The only proper way to test would be to find a way to disable the feature on a bios that has it enabled and test both scenarios across the same chip, and even then the result may still not be accurate due to other possible branch conditions. A full performance profile could bring accuracy, but I suspect only an AMD engineer could do that…

clamchowder1y ago

Yes, I tested on CCD1 (the non-vcache CCD) on both BIOS versions.

ryao1y ago

He mentioned that it was disabled somewhere between the two UEFI versions he tested. Presumably there are other changes included, so his measurements are not strict A/B testing.

eek21211y ago

It sounds to me like it was too small to make any real difference except in very specific scenarios and a larger one would have been too expensive to implement compared to the benefit.

That being said, some workloads will see a small regression, however AMD has made some small performance improvements since launch.

They should have just made it a BIOS option for Zen 4. The fact they do not appear to have done so does indicate the possibility of a bug or security issue.

crest1y ago

Them *quietly* disabling a feature that few users will notice yet complicates the frontend suggests they pulled this chicken bit because they wanted to avoid or delay disclosing a hardware bug to the general public, but already push the mitigation. Fucking vendors! Will they ever learn? sigh

BartjeD1y ago

Quitely disabling it is also a big risk. Because you're signalling that in all probablity you were aware of the severity of the issue; Enough so that you took steps to patch it.

If you don't disclose the vulnerability then affected parties cannot start taking countermeasures, except out of sheer paranoia.

dannyw1y ago

Every modern CPU has dozens of hardware bugs that aren’t disclosed and quietly patched away or not mentioned.

whaleofatw20221y ago

Devils advocate... if this is being actively exploited or is easily exploitable, the delay in announcement can prevent other actions.

rasz1y ago

Anecdotally one of very few differences between 1979 68000 and 1982 68010 was addition of "loop mode", a 6 byte Loop Buffer :)

crest1y ago

Much more importantly they fixed the MMU support. The original 68000 lost some state required to recover from a page fault the workaround was ugly and expensive: run two CPUs "time shifted" by one cycle and inject a recoverable interrupt on the second CPU. Apparently it was still cheaper than the alternatives at the time if you wanted a CPU with MMU, a 32 bit ISA and a 24 bit address bus. Must have been a wild time.

phire1y ago

> run two CPUs "time shifted" by one cycle and inject a recoverable interrupt on the second CPU.

That's not quite how it was implemented.

Instead, the second 68000 was halted and disconnected from the bus until the first 68000 (the executor) trigged a fault. Then the first 68000 would be held in halt, disconnected from the bus and the second 68000 (the fixer) would take over the bus to run the fault handler code.

After the fault had been handled, the first 68000 could be released from halt and it would resume execution of the instruction, with all state intact.

As for the cost of a second 68000, extra logic and larger PCBs? Well, the of the Motorola 68451 MMU (or equivalent) absolutely dwarfed the cost of everything else, so adding a second CPU really wasn't a big deal.

Technically it didn't need to be another 68000, any CPU would do. But it's simpler to use a single ISA.

For more details, see Motorola's application note here: http://marc.retronik.fr/motorola/68K/68000/Application%20Not...

1 more reply

Dylan168071y ago

That's neat. For small loop buffers, I quite like the GreenArrays forth core. It has 18 bit words that hold 4 instructions each, and one of the opcodes decrements a loop counter and goes back to the start of the word. And it can run appreciably faster while it's doing that.

ack_complete1y ago

The loop buffer on the 68010 was almost useless, because not only was it only 6 bytes, it only held two instructions. One had to be the loop instruction (DBcc), so the loop body had to be a single instruction. Pretty much the only thing it could speed up in practice was an unoptimized memcpy.

rasz1y ago

>unoptimized memcpy

could anyone do any better on 68000? My incomplete history of CPU dedicated fast paths for moving data:

- 1982 Intel 186/286 'rep movsw' at theoretical 2 cycles per byte (I think its closer to 4 in practice). Brilliant, then intel drops the ball for 20 years :|

- 1986 WDC W65C816 Move Memory Negative (MVN), Move Memory Positive (MVP) at hilarious 7 cycles per byte. Slower than unrolled code, 2x slower than unrolled code using 0 page. Afaik no loop buffer meant its re-fetching whole instruction every loop.

- 1987 NEC TurboGrafx-16/PC Engine 6502 clone by HudsonSoft HuC6280 Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII) at hysterical 6 cycles per byte plus 17 cycles startup. (17 + 6x) = ~160KB/s at 7.16 MHz CPU. For comparison IBM XT with 4.77 MHz NEC V20 does >300KB/s

- 1993 Pentium 'rep movsd' at theoretical 4 bytes per cycle, 0.31 cycles per byte in practice http://www.pennelynn.com/Documents/CUJ/HTML/14.12/DURHAM1/DU...

- 1995 Pentium Pro "fast string mode" strongly hinted at REP MOVS as the optimal way to copy memory.

- 1997 Pentium MMX 'rep movsd' 0.27 cycles per byte. Mem copy with MMX registers 0.29 cycles per byte.

- 2000 SSE2 optimized copy hack.

- 2008 AVX optimized copy hack at ~full L2/memory bus speed for large enough transfers.

- 2012 Ivy Bridge Enhanced REP MOVSB (ERMSB), but funnily still slower than even the SSE2 variants.

- 2019 Ice Lake Fast Short REP MOVSB (FSRM) still somewhat slower than AVX variants on unaligned accesses.

- 2020 Zen3 FSRM !20 times! slower than AVX unaligned, 30% slower on aligned https://lunnova.dev/articles/ryzen-slow-short-rep-mov/

- 2023 And then Intel got Reptar https://lock.cmpxchg8b.com/reptar.html :)

fulafel1y ago

Interesting that in the Cortex-A15 this is a "key design feature". Are there any numbers about its effect other chips?

I guess this could also be used as an optimization target at least on devices that are more long lived designs (eg consoles).

nwallin1y ago

I'm curious about this too. I would expect any RISC architecture to gain relatively little from a loop buffer. The point of RISC is that instruction fetch/decode is substantially easier, if not trivial.

Neywiny1y ago

I have a 7950x3d. It's my upgrade from.... Skylake's 6700k. I guess I'm subconsciously drawn to chips with hardware loop buffers disabled by software.

saghm1y ago

If you're going to buy a new machine at some point, definitely let us know in advance so we can avoid it!

syntaxing1y ago

Interesting read, one thing I don’t understand is how much space does loop buffer take on the die? I’m curious with it removed, on future chips could you use the space for something more useful like a bigger L2 cache?

akira25011y ago

I think most modern chips are routing constrained and not floorspace constrained. You can build tons of features but getting them all power and normalized signals is an absolute chore.

1 more reply

Remnant441y ago

My understanding is that it's a pretty small optimization on the front end. It doesn't have a lot of entries to begin with (144) so the amount of space saved is probably negligible. Theoretically, the loop buffer would let you save power or improve performance in a tight loop. In practice, it doesn't seem to do either, and AMD removed it completely for Zen 5.

atq21191y ago

Judging from the diagrams, the loop buffer is using the same storage as the micro-op queue that's there anyway. If that is accurate (and it does seem plausible), then the area cost is just some additional control logic. I suspect the most expensive part is detecting a loop in the first place, but that's probably quite small compared to the size of the queue.

progbits1y ago

It says 144 micro-op entries per core. Not sure how many bytes that is, but L2 caches these days are around 1MB per core, so assuming the loop buffer die space is mostly storage (sounds like it) then it wouldn't make a notable difference.

londons_explore1y ago

In the "power" section, it seems the analysis doesn't divide by the number of instructions executed per second.

Energy used per instruction is almost certainly the metric that should be considered to see the benefits of this loop buffer, not energy used per second (power, watts).

eek21211y ago

Every instruction takes a different amount of clock cycles (and this varies between architectures or iterations of an architecture such as Zen 4-Zen 5), so that is not feasible unless running the workload produced the exact same instructions per cycle, which is impossible due to multi threading/tasking. Even order and the contents of RAM matters since both can change everything.

While you can somewhat isolate for this by doing hundreds of runs for both on and off, that takes tons of time and still won’t be 100% accurate.

Even disabling the feature can cause the code to use a different branch which may shift everything around.

I am not specifically familiar with this issue, but I have seen cases where disabling a feature shifted the load from integer units to the FPU or the GPU as an example, or added 2 additional instructions while taking away 5.

CalChris1y ago

If it saved power wouldn’t that lead to less thermal throttling and thus improved performance? That power had to matter in the first place or it wouldn’t have been worth it in the first place.

kllrnohj1y ago

Not necessarily. Let's say this optimization can save 0.1w in certain situations. If one of those situations is common when the chip is idle just keeping wifi alive, well hey that's 0.1w in a ~1w total draw scenario, that's 10% that's huge!

But when the CPU is pulling 100w under load? Well now we're talking an amount so small it's irrelevant. Maybe with a well calibrated scope you could figure out if it was on or not.

Since this is in the micro-op queue in the front end, it's going to be more about that very low total power draw side of things where this comes into play. So this would have been something they were doing to see if it helped for the laptop skus, not for the desktop ones.

Out_of_Characte1y ago

You're probaly right on the mark with this. Though even desktops and servers can benefit from lower idle power draw. So there is a chance that it might have been moved to a different c-state.

mleonhard1y ago

It looks like they disabled a feature flag. I didn't expect to see such things in CPUs.

astrange1y ago

They have lots of them (called "chicken bits"). Some of them have BIOS flags, some don't.

It's very very expensive to fix a bug in a CPU, so it's easier to expose control flags or microcode so you can patch it out.

rasz1y ago

There were CPUs with whole plethora of optional optimizations. For example Cyrix packed their CPUs with goodies, but had no money to test so made it all optional.

https://www.ardent-tool.com/CPU/Cyrix_Cx486.html#soft

https://www.vogons.org/viewtopic.php?t=45756 Register settings for various CPUs

https://www.vogons.org/viewtopic.php?t=30607 Cyrix 5x86 Register Enhancements Revealed

L1, Branch Target Buffer, LSSER (load/store reordering), Loop Buffer, Memory Type Range Registers (Write Combining, Cacheability), all controlled using client side software.

Cyrix 5x86 testing of Loop Buffer showed 0.2% average boost and 2.7% maximum observable speed boost.

ksec1y ago

Wondering if Loop Buffer is still there with Zen 5?

( Idly waiting for x86 to try and compete with ARM on efficiency. Unfortunately I dont see Zen 6 or Panther Lake getting close. )

monocasa1y ago

It is not.

Pannoniae1y ago

From another article:

"Both the fetch+decode and op cache pipelines can be active at the same time, and both feed into the in-order micro-op queue. Zen 4 could use its micro-op queue as a loop buffer, but Zen 5 does not. I asked why the loop buffer was gone in Zen 5 in side conversations. They quickly pointed out that the loop buffer wasn’t deleted. Rather, Zen 5’s frontend was a new design and the loop buffer never got added back. As to why, they said the loop buffer was primarily a power optimization. It could help IPC in some cases, but the primary goal was to let Zen 4 shut off much of the frontend in small loops. Adding any feature has an engineering cost, which has to be balanced against potential benefits. Just as with having dual decode clusters service a single thread, whether the loop buffer was worth engineer time was apparently “no”."

1 more reply

j / k navigate · click thread line to collapse

141 comments

shantara1y ago

This is a wild guess, but could this feature be disabled in an attempt at preventing some publicly undisclosed hardware vulnerability?

bell-cot1y ago

The Article more-or-less speculates that:

baq1y ago

Indeed, it might be the case that there’s more than that disabled, since numbers are somewhat surprising:

Smells of microcode mitigations if you ask me, but naturally let’s wait for the CVE.

hinkley1y ago

Or another logic bug. We haven’t had a really juicy one in a while.

BartjeD1y ago

Quitely disabling it is also a big risk. Because you're signalling that in all probablity you were aware of the severity of the issue; Enough so that you took steps to patch it.

If you don't disclose the vulnerability then affected parties cannot start taking countermeasures, except out of sheer paranoia.

wtallis1y ago

2 more replies

immibis1y ago

1 more reply

throw_away_x1y21y ago

Bingo.

I can't say more. :(

pdimitar1y ago

Have we learned nothing from Spectre and Meltdown?... :(

9 more replies

bhouston1y ago

Yeah, my first thoughts too.

londons_explore1y ago

The article seems to suggest that the loop buffer provides no performance benefit and no power benefit.

In both cases, the project shouldn't have shipped.

akira25011y ago

> but was shipped anyway, possibly so someone could save face

Was shipped anyway because it can be disabled with a firmware update and because drastically altering physical hardware layouts mid design was likely to have worse impacts.

eek21211y ago

Well that and changing a chip can take years due to redesigning, putting through validation, RTM, and time to create.

Building chips is a multiyear process and most folks don’t understand this.

usrusr1y ago

readyplayernull1y ago

That bathroom with a door to the kitchen.

adgjlsfhk11y ago

> but was shipped anyway, possibly so someone could save face

no. once the core has it and you realize it doesn't help much, it absolutely is a risk to remove it.

glzone11y ago

3 more replies

sweetjuly1y ago

clamchowder1y ago

2 more replies

kimixa1y ago

> engineering teams are so unprincipled as to allow people to add HW features with no value

And that's before you even get to the point of low-level implementation/layout/node specific optimizations, which can then again have somewhat unexpected results on frequency and power metrics.

011000111y ago

usrusr1y ago

Is it that expectation of major updates coming in at a fixed cycle? Not only expected by upper management but also by end users? That's a difficult trap to get out of.

1 more reply

markus_zhang1y ago

Do you have new software managers/directors who are encouraging such behavior? From my experience new leaders tend to lean on this tactics to grab power.

1 more reply

iforgotpassword1y ago

EVa5I7bHFq9mnYK1y ago

>> when the project is done, there are more lines of code and performance is worse

There is an added benefit though - that the new programmers now are fluent in the code base. That benefit might be worth more than LOCs or performance.

weinzierl1y ago

"The article seems to suggest that the loop buffer provides no performance benefit and no power benefit."

ksaj1y ago

"the project shouldn't have shipped."

Tell that to the share holders. As a public company, they can very quickly lose enormous amounts of money by being behind or below on just about anything.

firebot1y ago

The article clearly articulates that there's no performance benefit. However there's efficiency. It reduces power consumption.

hinkley1y ago

Someone elsewhere quotes a game specific benchmark of about 15%. Which will mostly matter when your FPS starts to make game play difficult.

saagarjha1y ago

Only on Hacker News will you get CPU validation fanfiction.

Loic1y ago

For me the most interesting paragraph in the article is:

eqvinox1y ago

> Strangely, the game sees a 5% performance loss with the loop buffer disabled when pinned to the non-VCache die. I have no explanation for this, […]

With more detailed power measurements, it could be possible to determine if this is thermal/power budget related? It does sound like the feature was intended to conserve power…

eek21211y ago

He didn’t provide enough detail here. The second CCD on a Ryzen chip is not as well binned as the first one even on. non-X3D chips. Also, EVERY chip is different.

Most of the cores on CCD0 of my non-X3D chip hit 5.6-5.75ghz. CCD 1 has cores topping out at 5.4-5.5ghz.

V-Cache chips for Zen 4 have a huge clock penalty, however the Cache more than makes up for it.

Did he test CCD1 on the same chip with both the feature disabled and enabled? Did he attempt to isolate other changes like security fixes as well? He admitted “no” in his article.

clamchowder1y ago

Yes, I tested on CCD1 (the non-vcache CCD) on both BIOS versions.

ryao1y ago

He mentioned that it was disabled somewhere between the two UEFI versions he tested. Presumably there are other changes included, so his measurements are not strict A/B testing.

eek21211y ago

It sounds to me like it was too small to make any real difference except in very specific scenarios and a larger one would have been too expensive to implement compared to the benefit.

That being said, some workloads will see a small regression, however AMD has made some small performance improvements since launch.

They should have just made it a BIOS option for Zen 4. The fact they do not appear to have done so does indicate the possibility of a bug or security issue.

crest1y ago

BartjeD1y ago

Quitely disabling it is also a big risk. Because you're signalling that in all probablity you were aware of the severity of the issue; Enough so that you took steps to patch it.

If you don't disclose the vulnerability then affected parties cannot start taking countermeasures, except out of sheer paranoia.

dannyw1y ago

Every modern CPU has dozens of hardware bugs that aren’t disclosed and quietly patched away or not mentioned.

whaleofatw20221y ago

Devils advocate... if this is being actively exploited or is easily exploitable, the delay in announcement can prevent other actions.

rasz1y ago

Anecdotally one of very few differences between 1979 68000 and 1982 68010 was addition of "loop mode", a 6 byte Loop Buffer :)

crest1y ago

phire1y ago

> run two CPUs "time shifted" by one cycle and inject a recoverable interrupt on the second CPU.

That's not quite how it was implemented.

After the fault had been handled, the first 68000 could be released from halt and it would resume execution of the instruction, with all state intact.

Technically it didn't need to be another 68000, any CPU would do. But it's simpler to use a single ISA.

For more details, see Motorola's application note here: http://marc.retronik.fr/motorola/68K/68000/Application%20Not...

1 more reply

Dylan168071y ago

ack_complete1y ago

rasz1y ago

>unoptimized memcpy

could anyone do any better on 68000? My incomplete history of CPU dedicated fast paths for moving data:

- 1982 Intel 186/286 'rep movsw' at theoretical 2 cycles per byte (I think its closer to 4 in practice). Brilliant, then intel drops the ball for 20 years :|

- 1993 Pentium 'rep movsd' at theoretical 4 bytes per cycle, 0.31 cycles per byte in practice http://www.pennelynn.com/Documents/CUJ/HTML/14.12/DURHAM1/DU...

- 1995 Pentium Pro "fast string mode" strongly hinted at REP MOVS as the optimal way to copy memory.

- 1997 Pentium MMX 'rep movsd' 0.27 cycles per byte. Mem copy with MMX registers 0.29 cycles per byte.

- 2000 SSE2 optimized copy hack.

- 2008 AVX optimized copy hack at ~full L2/memory bus speed for large enough transfers.

- 2012 Ivy Bridge Enhanced REP MOVSB (ERMSB), but funnily still slower than even the SSE2 variants.

- 2019 Ice Lake Fast Short REP MOVSB (FSRM) still somewhat slower than AVX variants on unaligned accesses.

- 2020 Zen3 FSRM !20 times! slower than AVX unaligned, 30% slower on aligned https://lunnova.dev/articles/ryzen-slow-short-rep-mov/

- 2023 And then Intel got Reptar https://lock.cmpxchg8b.com/reptar.html :)

fulafel1y ago

Interesting that in the Cortex-A15 this is a "key design feature". Are there any numbers about its effect other chips?

I guess this could also be used as an optimization target at least on devices that are more long lived designs (eg consoles).

nwallin1y ago

Neywiny1y ago

I have a 7950x3d. It's my upgrade from.... Skylake's 6700k. I guess I'm subconsciously drawn to chips with hardware loop buffers disabled by software.

saghm1y ago

If you're going to buy a new machine at some point, definitely let us know in advance so we can avoid it!

syntaxing1y ago

akira25011y ago

I think most modern chips are routing constrained and not floorspace constrained. You can build tons of features but getting them all power and normalized signals is an absolute chore.

1 more reply

Remnant441y ago

atq21191y ago

progbits1y ago

londons_explore1y ago

In the "power" section, it seems the analysis doesn't divide by the number of instructions executed per second.

Energy used per instruction is almost certainly the metric that should be considered to see the benefits of this loop buffer, not energy used per second (power, watts).

eek21211y ago

While you can somewhat isolate for this by doing hundreds of runs for both on and off, that takes tons of time and still won’t be 100% accurate.

Even disabling the feature can cause the code to use a different branch which may shift everything around.

CalChris1y ago

If it saved power wouldn’t that lead to less thermal throttling and thus improved performance? That power had to matter in the first place or it wouldn’t have been worth it in the first place.

kllrnohj1y ago

But when the CPU is pulling 100w under load? Well now we're talking an amount so small it's irrelevant. Maybe with a well calibrated scope you could figure out if it was on or not.

Out_of_Characte1y ago

You're probaly right on the mark with this. Though even desktops and servers can benefit from lower idle power draw. So there is a chance that it might have been moved to a different c-state.

mleonhard1y ago

It looks like they disabled a feature flag. I didn't expect to see such things in CPUs.

astrange1y ago

They have lots of them (called "chicken bits"). Some of them have BIOS flags, some don't.

It's very very expensive to fix a bug in a CPU, so it's easier to expose control flags or microcode so you can patch it out.

rasz1y ago

There were CPUs with whole plethora of optional optimizations. For example Cyrix packed their CPUs with goodies, but had no money to test so made it all optional.

https://www.ardent-tool.com/CPU/Cyrix_Cx486.html#soft

https://www.vogons.org/viewtopic.php?t=45756 Register settings for various CPUs

https://www.vogons.org/viewtopic.php?t=30607 Cyrix 5x86 Register Enhancements Revealed

L1, Branch Target Buffer, LSSER (load/store reordering), Loop Buffer, Memory Type Range Registers (Write Combining, Cacheability), all controlled using client side software.

Cyrix 5x86 testing of Loop Buffer showed 0.2% average boost and 2.7% maximum observable speed boost.

ksec1y ago

Wondering if Loop Buffer is still there with Zen 5?

( Idly waiting for x86 to try and compete with ARM on efficiency. Unfortunately I dont see Zen 6 or Panther Lake getting close. )

monocasa1y ago

It is not.

Pannoniae1y ago

From another article:

1 more reply

j / k navigate · click thread line to collapse