Solving the Mystery of ARM7TDMI Multiply Carry Flag (opens in new tab)

(bmchtech.github.io)

90 pointsskrrtww1y ago33 comments

33 comments

24 comments · 4 top-level

ujikoluk1y ago· 10 in thread

> Seriously, they decided that the program counter should be a general purpose register. Why???

Don't really understand this reaction. Why not? Seems to make for a nice regular design that the PC is just another register.

DevilStuff1y ago

The big issue is that it doesn't really need to be a GPR. You never find yourself using the PC in instructions other than in, say, the occasional add instruction for switch case jumptables, or pushes / pops. So it ends up wasting instruction space, when you could've had an additional register, or encoded a zero register (which is what AARCH64 does nowadays).

Dwedit1y ago

PC-relative instructions is how ARM loads 32-bit literals. "ldr,=label" is a psuedoinstruction that becomes relative to the PC, and it picks the nearest literal pool as the acutal address to read from.

robinsonb51y ago

Funnily enough, for my own toy CPU project I elected to make the program counter a GPR, both readable and writable - because it allowed me to eliminate jump and branch instructions.

Reads from the PC return the address of the next instruction to be executed, so a simple exchange between two registers performs the branch, and supplies the return address. (I did end up special-casing the add instruction so that when adding to the PC the return address ends up in the source register.)

joosters1y ago

But it is (or was originally) used in lots of places, not just jump tables, generally to do relative addressing, for example when you want to refer to data nearby, e.g.

ADD r0, r15, #200

LDR r1, [r15, #-100]

etc

1 more reply

mmaniac1y ago

The thing is... it isn't just another regular register. Reads from the PC are 8 bytes ahead, because that's where the PC is in the pipeline at read time. Branches are PC+8 relative for the same reason.

It's a design choice that makes sense in the classic RISC world where pipeline details are leaked to make implementation simpler. Delay slots in other RISCs work the same way. But it causes a lot of pain as implementations evolve beyond past the original design, and it's why Aarch64 junked a lot of Aarch32's quirks.

dfox1y ago

This is one of the features of early ARM cores that show that the thing is not a classical RISC design, but traditional CISC core with RISC-like ISA and optimized internal buses.

userbinator1y ago

Prefetching and other pipeline interactions need to be special-cased when any instruction reads or writes it.

joosters1y ago

Yeah, it was that way for all previous ARM processors too, for exactly that reason. Adding special cases would have increased the transistor count, for no great benefit.

The only downside was that it exposed internal details of the pipelining IIRC. In the ARM2, a read of the PC would give the current instruction's location + 8, rather than its actual location, because by the time the instruction 'took place' the PC had moved on. So if/when you change the pipelining for future processors, you either make older code break, or have to special case the current behaviour of returning +8.

Anyway, I don't like their reaction. What they mean is 'this decision makes writing an emulator more tricky' but the author decides that this makes the chip designers stupid. If the author's reaction to problems is 'the chip designers were stupid and wrong, I'll write a blog post insulting them' then the problem is with the author.

DevilStuff1y ago

Hey, I'm the author, sorry it came off that way, I was really just poking fun. I should've definitely phrased that better!

But no, I really think that making the program counter a GPR isn't a good design decision - there's pretty good reasons why no modern arches do things that way anymore. I admittedly was originally in the same boat when I first heard of ARMv4T - I thought putting the PC as a GPR was quite clean, but I soon realized it just wastes instruction space, makes branch prediction slightly more complex, decrease the number of available registers (increasing register pressure), all while providing marginal benefit to the programmer

3 more replies

Joker_vD1y ago

> Adding special cases would have increased the transistor count, for no great benefit.

No, it's exactly backwards: supporting PC as a GPR requires special circuitry, especially in original ARM where PC was not even fully a part of the register file. Stephen Furber in his "VLSI RISC Architecture and Organization" describes in section 4.1 "Instruction Set and Datapath Definition" that quite a lot of additional activity happens when PC is involved (which may affect the instruction timings and require additional memory cycles).

userbinator1y ago· 4 in thread

And just to get this out of the way, the carry flag’s behavior after multiplication isn’t an important detail to emulate at all. Software doesn’t rely on it.

On as fixed of a hardware as a game console, and with the accompanying anti-piracy/anti-cheating/emulation efforts of that industry, I'd expect it to be. From the history of emulating previous consoles, we know that any deterministic difference can and will be exploited, either to determine whether the hardware is authentic, or incidentally as a result of unintentional bugs.

This reminds me of the Z80, where two undefined flags resisted analysis for several decades; a 2-year-old set of slides on the state of that here: https://archive.fosdem.org/2022/schedule/event/z80/attachmen...

f1shy1y ago

>> the carry flag’s behavior after multiplication isn’t an important detail to emulate at all. Software doesn’t rely on it.

Famous last words:

https://www.hyrumslaw.com/

comex1y ago

While it’s a bit newer than the GBA, there is at least one Wii game with intentional anti-emulation measures:

https://tcrf.net/Cars_2_(PlayStation_3,_Xbox_360,_Windows,_W...

1 more reply

DevilStuff1y ago

In the GBA scene, people didn't actually tend to exploit the carry flag at all. If there was any anti emulation, it was usually flashcart related or cpu timing related.

Dwedit1y ago

It was mostly the Game Pak Prefetch feature that was used to foil GBA emulators. A game can detect if the number of cycles to access ROM (with prefetch enabled) is incorrect.

skrrtwwOP1y ago· 3 in thread

https://shonumi.github.io/blog/nds_rolling.html

More context on how this value affects (at least one) DS game- see post from December 27th, 2019.

userbinator1y ago

I'm not really familiar with ARM Asm but do you think that's handwritten Asm that its author overlooked the effects of the carry and it just happened to work, or a clever "emulator trap" added by Nintendo's compiler?

bonzini1y ago

It's handwritten and hand-obfuscated assembly, the author didn't know the effect of the carry but knew it was deterministic; and the code worked.

Though it's not clear to me where the corruption after the second MLA instruction comes from, because the second block of three instructions should produce the same output as the first. It is possible that it was copied/pasted incorrectly.

comex1y ago

Are you sure it's handwritten or obfuscated?

I remember from when I used to disassemble compiled ARM code (not on the NDS though) that it was common to see CMP, followed by a bunch of instructions with one condition predicate, followed by a bunch of instructions with the opposite predicate.

In this case, it's subtly wrong to use that pattern, but only on older versions of ARM. That could reflect a very sneaky attempt to break emulators… but it could also just be a compiler bug.

That said, I too don't understand how corruption could be produced unless there was a copy/paste mistake.

flohofwoe1y ago· 3 in thread

> it allows the program counter to be used a general purpose register

From a CPU emulator writer's perspective this isn't all that strange. For instance on Z80 the immediate jump instruction `JP nnnn` is loading a 16-bit immediate value into the internal PC register, which is the same thing as loading a 16-bit value into a regular register pair (e.g. 'LD HL,nnnn') - e.g. the mnemonics for the jump instruction could just as well be `LD PC,nnnn` ;)

A relative jump (which does a signed-add of an 8-bit offset value to the 16-bit address in PC) is the same math as the Z80 indexed addressing mode (IX+d) and (IY+d) (I don't know though if the same transistors are used).

A RET (load 16-bit value from stack into PC) is the the same operation as a POP (load 16-bit value from stack into a regular register pair).

...so it's almost surprising that the program counter isn't exposed as a regular register in most (traditional) CPUs. I guess in modern CPUs it's not so simple because of the internal pipelining though.

adrian_b1y ago

The reason why the PC is not normally exposed as a regular register is that the set of operations needed for the PC is different than for the regular registers.

There are operations required for the PC that are not needed for regular registers, e.g. conditional add (a.k.a. conditional relative jump), add-and-store and load-and-store (a.k.a. procedure call).

On the other hand, there are many operations that are needed for regular registers and which are useless for the PC, e.g. logical operations, shift/rotate, multiplication and division and others.

Because of this, encoding the PC as a regular register is pointless and wasteful of the instruction encoding space.

Moreover, when the ISA has an implicit stack pointer, which is also the only register that can be used as a stack pointer, like the x86 ISA, the set of operations that are used with the SP is a very small subset of the operations available for the regular registers, so encoding the SP as a regular register is also wasteful. Especially in 32-bit x86, where the number of architectural registers was very small, it would have been better if the SP would not have been encoded as a regular register, wasting a register number.

addaon1y ago

> The reason why the PC is not normally exposed as a regular register is that the set of operations needed for the PC is different than for the regular registers.

I'd add to that, that what you give is the reason it's /okay/ to expose the PC as a special register instead of a GPR. The reason that it's /important/ to is that the PC is accessed on every instruction fetch, so if it's part of a uniform register file, it basically eats up an entire read port of that register file. Register file size scaled badly with port count (much worse than it does with register count), so this ends up adding quite a bit of area. (You can hack around this by having a single dedicated read port for just the PC register, but then you're half way to an SPR.)

dfox1y ago

It does not make sense to dedicate full read port for reading PC. Just run a bus directly from the register cells, which is exactly how it is done in ARM7TDMI in question.

1 more reply

j / k navigate · click thread line to collapse

33 comments

24 comments · 4 top-level

ujikoluk1y ago· 10 in thread

> Seriously, they decided that the program counter should be a general purpose register. Why???

Don't really understand this reaction. Why not? Seems to make for a nice regular design that the PC is just another register.

DevilStuff1y ago

Dwedit1y ago

robinsonb51y ago

Funnily enough, for my own toy CPU project I elected to make the program counter a GPR, both readable and writable - because it allowed me to eliminate jump and branch instructions.

joosters1y ago

But it is (or was originally) used in lots of places, not just jump tables, generally to do relative addressing, for example when you want to refer to data nearby, e.g.

ADD r0, r15, #200

LDR r1, [r15, #-100]

etc

1 more reply

mmaniac1y ago

dfox1y ago

This is one of the features of early ARM cores that show that the thing is not a classical RISC design, but traditional CISC core with RISC-like ISA and optimized internal buses.

userbinator1y ago

Prefetching and other pipeline interactions need to be special-cased when any instruction reads or writes it.

joosters1y ago

Yeah, it was that way for all previous ARM processors too, for exactly that reason. Adding special cases would have increased the transistor count, for no great benefit.

DevilStuff1y ago

Hey, I'm the author, sorry it came off that way, I was really just poking fun. I should've definitely phrased that better!

3 more replies

Joker_vD1y ago

> Adding special cases would have increased the transistor count, for no great benefit.

userbinator1y ago· 4 in thread

And just to get this out of the way, the carry flag’s behavior after multiplication isn’t an important detail to emulate at all. Software doesn’t rely on it.

f1shy1y ago

>> the carry flag’s behavior after multiplication isn’t an important detail to emulate at all. Software doesn’t rely on it.

Famous last words:

https://www.hyrumslaw.com/

comex1y ago

While it’s a bit newer than the GBA, there is at least one Wii game with intentional anti-emulation measures:

https://tcrf.net/Cars_2_(PlayStation_3,_Xbox_360,_Windows,_W...

1 more reply

DevilStuff1y ago

In the GBA scene, people didn't actually tend to exploit the carry flag at all. If there was any anti emulation, it was usually flashcart related or cpu timing related.

Dwedit1y ago

It was mostly the Game Pak Prefetch feature that was used to foil GBA emulators. A game can detect if the number of cycles to access ROM (with prefetch enabled) is incorrect.

skrrtwwOP1y ago· 3 in thread

https://shonumi.github.io/blog/nds_rolling.html

More context on how this value affects (at least one) DS game- see post from December 27th, 2019.

userbinator1y ago

bonzini1y ago

It's handwritten and hand-obfuscated assembly, the author didn't know the effect of the carry but knew it was deterministic; and the code worked.

comex1y ago

Are you sure it's handwritten or obfuscated?

In this case, it's subtly wrong to use that pattern, but only on older versions of ARM. That could reflect a very sneaky attempt to break emulators… but it could also just be a compiler bug.

That said, I too don't understand how corruption could be produced unless there was a copy/paste mistake.

flohofwoe1y ago· 3 in thread

> it allows the program counter to be used a general purpose register

A RET (load 16-bit value from stack into PC) is the the same operation as a POP (load 16-bit value from stack into a regular register pair).

adrian_b1y ago

The reason why the PC is not normally exposed as a regular register is that the set of operations needed for the PC is different than for the regular registers.

There are operations required for the PC that are not needed for regular registers, e.g. conditional add (a.k.a. conditional relative jump), add-and-store and load-and-store (a.k.a. procedure call).

On the other hand, there are many operations that are needed for regular registers and which are useless for the PC, e.g. logical operations, shift/rotate, multiplication and division and others.

Because of this, encoding the PC as a regular register is pointless and wasteful of the instruction encoding space.

addaon1y ago

> The reason why the PC is not normally exposed as a regular register is that the set of operations needed for the PC is different than for the regular registers.

dfox1y ago

It does not make sense to dedicate full read port for reading PC. Just run a bus directly from the register cells, which is exactly how it is done in ARM7TDMI in question.

1 more reply

j / k navigate · click thread line to collapse