Seems to be a known errata which was fixed by a microcode update
The XSAVES instruction may fail to save XMM registers to the provided state save area if all of the following are true:
• All XMM registers were restored to the initialization value by the most recent XRSTORS instruction because the XSTATE_BV[SSE] bit was clear.
• The state save area for the XMM registers does not contain the initialization state.
• The value in the XMM registers match the initialization value when the XSAVES instruction is executed.
• The MXCSR register has been modified to a value different from the initialization value since the most recent XRSTORS instruction.
Apparently some Ryzen models have no fixed microcode available. You can boot with clearcpuid=xsaves as a workaround, probably at some performance cost.
i.e. thats what this repo seems to be: https://github.com/platomav/CPUMicrocodes
In my mind a CPU instruction is hardwired on the chip, and it blows my mind that we keep finding workarounds to already released hardware.
Maybe someone could dumb that down for me?
You can change the mapping from x86_64 instruction to sequence of micro-operations during boot on modern CPUs. That's what we mean by updating the microcode.
At least that's my understanding, as someone who has implemented a few toy CPUs in digital logic simulation tools and has consumed a bunch of material on the topic as hobbyist but have no actual knowledge of the particulars of how AMD and Intel does stuff.
That part is driven via microcode (kind of like firmware) and fixes in that can fix some CPU bugs.
Which also means that one core can essentially run multiple assembler instructions in parallel (say fetch memory at same time floating point operation is running, at the same time some other integer operation is running etc.) and it just makes it look like it was done serially.
8-bit CPU control logic https://youtu.be/dXdoim96v5A
8-bit CPU reprogramming microcode https://youtu.be/JUVt_KYAp-I
In short, the microcode instructions are a bunch of flags that enable different parts of the processor during that clock cycle (eg is data being loaded off the bus into a register? Is the adder active? Etc), so to implement an instruction that says add value from memory a to value from memory b and store in memory c, the microcode might be: copy memory a onto bus, store bus to register, copy memory b to bus, store b to register, add both register and put result online bus, store value on bus to memory c. (In a hypothetical simple cpu like the one Ben built, a real one is obviously much more sophisticated). So in Ben’s toy CPU, the instructions are just indices nto an EEPROM that stored the control logic but pattern “microcode” for each instruction, and IIRC each instruction takes however many cycles the longest instruction requires (in real life that would be optimised of course).
This is also how some processors like the 6502 have “undocumented” instructions: they’re just bit patterns enabling parts of the processor that weren’t planned or intended.
So you can see that it may be possible to fix a bug in instructions by changing the control logic in this way, even though the actual units being controlled are hard wired. I guess it very much depends on what the bug is. Of course I only know how Ben’s microcode works and not how an advanced processor like the one in question does it, but I imagine the general theme is similar.
The thing is, the microcode is often using instructions that are a very different "shape" from sensible machine-code instructions, because quite often they have to drive gates within the chip directly and not all combinations might make sense. So you might have an instruction that breaks down as "load register A into the ALU A port, load register X in the the ALU B port, carry out an ADD and be ready to latch the result into X but don't actually latch it for another clock cycle in case we're waiting for carry to stabilise", much of which you simply don't want to care about. The instructions might be many many bits long, with a lot of those bits "irrelevant" for a particular task.
The 6502 CPU was a directly-wired CPU where everything was decoded from the current opcode. It doesn't really have "microcode" but it does have a state machine that'll carry out instructions in phases across a few clocks. It does actually have a lot of "undefined" instructions, which are where the opcode decodes into something nonsensical like "load X and Y at the same time into the ALU" which returns something unpredictable.
Take a simple example: the registers are made up of latches that hold onto values and have a set of transistors that switch their latches to connect to the BUS lines or disconnect from them, along with a line that makes them emit their latched value or take a new value to latch. This forms a simple read/write primitive.
If the microcode wants to move the result of an ADD out of the ALU into register R1 then it will assert the relevant control lines:
1. The ALU's hidden SUM register WRITE high which connects the output of its latches to the lines of the bus. For a 64-bit chip there would be 64 lines, one per bit. Each bit line will then go high or low to match the contents of SUM.
2. It will also set R1's READ line high, meaning the transistors that connect R1's bit latch inputs to the bus lines will switch ON, allowing the voltages on each bus line to force R1's latch input lines high or low (for 1 or 0).
In a real modern CPU things are vastly more complex than this but it is just nicer abstractions built on top of these kinds of simple ideas. Microcode doesn't actually control the cache with control lines, it issues higher level instructions to the cache unit that takes responsibility. The cache unit itself may have a microcode engine that itself delegates operations to even simpler functional units until you eventually get to something that is managing control lines to connect/disconnect/trigger things. Much like software higher level components offer their "API" and internally break operations down into multiple simpler steps until you get to the lowest layer doing the actual work.Sometimes, CPU vendors run out of space for such bug fixes. They have to re-introduce another bug to free up space to fix a more serious one. That one kinda blew my mind.
fp registers shared with mmx, sse registers (xmm), avx registers (ymm), and a truckload of them.
Modern implementations have extremely complex frontends, full of elaborate hacks to get performance despite x86.
Complexity breeds bugs, such as this one.
Then Apple switched to x86, and from day to night we witnessed the magnificent spectacle of the entire Apple fanbase performing a whiplash-inducing collective pirouette towards the narrative that, after all, x86 was not so bad.
ARM did not. There are efforts but they are recent, adoption is bad.
RISC-V, on the other hand, put significant effort into this early on, preventing the situation ARM is in.
Sure, ARMs are better in performance-per-watt game, but the question of how to scale them to the level of high-end x86 desktop processors is still open. For now, I'd argue it's not even clear if that's possible.
The difference in engineer years in designing and testing a complicated x86 chip that works correctly versus an ARM chip, however, are pretty big.
The answer is no. The question of how to scale them to the level of high end x86 desktop processors is absolutely not open. It’s clearly a solved problem.
The xmm registers are the low 128 bits of the ymm registers. So you're left with three register files, consisting of 8, 16, and 16 registers each. The only thing out of the ordinary there is the x87 register state, although maybe you'd consider the unusually small size of the register sets as out of the ordinary nowadays (e.g., RISC-V and AArch64 each provide 32 registers in their two register files).
Internally CPUs have large register files and register renaming circuitry anyway, so I suppose everything is shared with everything else, except maybe ESP and EBP.
Interesting to see such big leaps in CPUs still happening. Popcorn!
M1's integrated memory is completely generic LPDDR5. It's the same stuff the rest of the industry uses, there's no power savings here (and it's not any more "integrated" than AMD & Intel have been doing for half a decade or more, either)
The primary interesting thing about the M1/M2's memory is the width of the memory bus, which is how it gets such huge bandwidth numbers. But if anything that's costing power, not saving it.
Although the fact that there apparently exist some APUs that have not had the update applied because of shitty OEMs is very concerning.
One of the fundamentals of selling shit profitably, is that you try to sell the vast majority of what you make, one of the consequences is that the vast majority of users exist outsider of your company and after development.
So you throw some very smart people at the problems of building and validating and building and validating, and hope that no two of them makes or repeats a mistake.
Yes, there've been some disturbingly cavalier attitudes on the patchable software side towards release quality, and more recently by Boeing on the hardware side (which they proceeded to patch from the software side), but I have not seen that sort of wanton disregard for quality from AMD/Intel/NVidia.
Honestly, the fact that this is even interesting enough to show up on HN tells you enough about how frequently it happens.
Edit: @empyrrhicist: ah, my mistake. I misunderstood, thinking more recently manufactured CPUs might come with updated microcode. Thanks for the quick help.
I'm not that familiar with the details of recent CPUs, but at least on the Intel side, I believe they do as part of the stepping identifier.
Who the hell came with such a layout.
Is it really this hard to copy industry standard like github?
It's just copied off basic mailing list designs from the 90s (or the 80s?).
That design is basically what happens when you tell a kernel dev to write a mailing list archiving UI. It's clean because they value simplicity, but they obviously have ~0 non-CLI UI/UX experience.
See: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... https://lpc.events/event/11/contributions/983/attachments/75...