Z8086: Rebuilding the 8086 from Original Microcode (opens in new tab)

(nand2mario.github.io)

51 pointsnand2mario4mo ago24 comments

24 comments

Despite what the article says, the 68000 was microcoded too. Another difference is that the 68K was a 32b architecture, not 16b, and that required investing more transistors for the register file and datapath.

ErroneousBosh4mo ago

Not only was it microcoded, but it was sufficiently divorced from the assumptions of the 68000 instruction set that IBM were able to have Motorola make custom "68000-based" chips that ran S/370 code directly.

Want a different architecture? Sure, just draw it with a different ROM. Simple (if you've got IBM money to throw around).

tasty_freeze4mo ago

I read (30 years ago) the book "Microprocessor Design" by Nick Tredennick. He was the architect and wrote the microcode for both the 68K and the S/370. The S/370 was based on his recent design experience with the 68K, but it wasn't just a microcode swap. In the book he describes his process where he would write the ucode for each instruction on a 3"x5" card (or was a 4"x7"). At times he'd find sequences that were too clunky and then go back to the circuit design folks and ask for them to add some extra logic for that corner case to make the ucode more efficient.

The book also had a glossary section in the back and a number of the entries were funny. One I recall was his definition for "methodology", which was something like "A word people use when 99% of the time they mean 'method'."

slartibardfast04mo ago

https://archive.org/details/tredennick-microprocessor-logic-...

ErroneousBosh4mo ago

Oh right, nerdsniped into hunting that book down.

1 more reply

jecel4mo ago

The 68000 actually had both microcode and nanocode, so it was even further from hardwired control logic than the 8086. In terms of performance the 68000 was slightly faster than the 286 and way faster than the 8088 (I never used an 8086 machine).

tom_4mo ago

The 286 looks like it ought to be usefully quicker in general? Motorola did a good job on the programming model, but you can tell that the 68000 is from the 1970s. Nearly all the 68000 instructions take like 8+ cycles, and addressing modes can cost extra. On the 286, on the other hand, pretty much everything is like 2-4 cycles, or maybe 5-7 if there's a memory operand. (The manual seems to imply that every addressing mode has the same cost, which feels a bit surprising to me, but maybe it's true.) 286 ordinary call/ret round trip time is also shorter, as are conditional branches and stack push/pop.

adrian_b4mo ago

The timings given in the datasheet of 286 are very optimistic and they can almost never be encountered in a real program.

They assume that instructions have been fetched concurrently without ever causing a stall and that memory accesses are implemented with 0 wait states.

In reality, instruction fetching was frequently a bottleneck and implementing a memory with 0 wait states for 80286 was much more difficult than for MC68000 or MC68010.

With the available DRAM, normally both 80286 and 80386 would have needed a cache memory. Later, after the launch of 80386DX, cache memories became common on 386DX MBs, but I have not seen any 80286 motherboard with cache memory.

They might have existed at an earlier time when 286 was the highest end, but by the time of the coexistence with 386 the 286 became the cheap option, so its motherboards never had cache memory, thus the memory accesses always had wait states, increasing the probability of instruction fetch bottlenecks and resulting in significantly more clock cycles per instruction than in the datasheet.

1 more reply

raphlinus4mo ago

My reading is that there aren't really a lot of addressing modes on 286, as there are on 68000 and friends, rather every address is generated by summing an optional immediate 8 or 16 bit value and from zero to two registers. There aren't modes where you do one memory fetch, then use that as the base address for a second fetch, which is arguably a vaguely RISC flavored choice. There is a one cycle penalty for summing 3 elements ("based indexed mode").

1 more reply

retrac4mo ago

Not for the data path; the 68000 operates on 32 bit values 16 bits at a time, both through its external 16 bit bus and internal 16 bit ALU. Most 32 bit operations take more cycles. But yes, it has a 32 bit programming model.

jecel4mo ago

Actually, the 68000 had one full (all operations) 16 bit ALU and two more simple (add/subtract, so AU might be a better name) 16 bit ALUs so in the best case it could crunch 48 bits per clock cycle. The 8086 had one full 16 bit ALU and one simple 16 bit ALU (the ancestor of todays AGUs - address generator units).

CodeWriter234mo ago

"Oddball string instructions", as an assembler coder bitd, they were a welcome feature as opposed to running out of registers and/or crashing the stack with a Z-80.

tasty_freeze4mo ago

The Z80 had LDIR which was a string copy instructions. The byte at (HL) would be read from memory, then written to (DE), HL and DE would be incremented, and BC decremented and then repeated until BC became zero.

LDDR was the same but decremented HL and DE on each iteration instead.

There were versions for doing IN and OUT as well, and there was an instruction for finding a given byte value in a string, but I never used those so I don't recall the details.

CodeWriter234mo ago

LDIR? We used DMA for that.

I was referring to LODSB/W (x86) which is quite useful for processing arrays.

rasz4mo ago

LDIR sounds great on paper but is implemented terribly making it slower than manual unrolled loop

https://retrocomputing.stackexchange.com/questions/4744/how-...

Repeat is done by decrementing PC by 2 and re-loading whole instruction in a loop. 21 cycles per byte copied :o

To be fair Intel did same fail implementation of REP MOVSB/MOVSW in 8088/8086 reloading whole instruction per iteration, REP MOVSW is ~14 cycles/byte 8088 (9+27/rep) and ~9 cycles/byte 8086 (9+17/rep), ~same cost as non REP versions (28 and 18). NEC V20/V30 improved by almost 2x to 8 cycles/byte V20 or unaligned V30 (11+16/rep) and 4 cycles/byte on fully aligned access V30 (11+8/rep) with non REP cost being 19 and 11 respectively. V30 pretty much matched Intel 80186 4 cycles/byte (8+8/rep, 9 non rep). 286 was another jump to 2 cycles/byte (5+4/rep). 386 same speed, 486 much slower for small rep counts, under a cycle for big rep movsd. Pentium up to 0.31 cycles per byte, MMX 0.27 cycle/byte (http://www.pennelynn.com/Documents/CUJ/HTML/14.12/DURHAM1/DU...), then 2009 AVX doing block moves at full L2 cache speed and so on.

In 6502 corner there was nothing until 1986 WDC W65C816 Move Memory Negative (MVN), Move Memory Positive (MVP) 7 cycles/byte. Slower than unrolled code, 2x slower than unrolled code using 0 page. Similar bad implementation (no loop buffer) re-fetching whole instruction every iteration.

1987 NEC TurboGrafx-16/PC Engine 6502 clone by HudsonSoft HuC6280 Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII) theoretical 6 cycles/byte (17+6rep). I saw one post long time ago claiming block transfer throughput of ~160KB/s on a 7.16 MHz NEC manufactured TurboGrafx-16 (hilarious 43 cycles/byte) so dont know what to think of it considering NEC V20 inside OG 4.77MHz IBM XT does >300KB/s.

    CPU / Instruction   Cycles per Byte
    Z80 LDIR 8-bit              21
    8088 MOVSW 8bit             ~14
    6502 LDA/STA 8bit           ~14
    8086 MOVSW                  ~9
    NEC V20 MOVBKW 8bit         ~8
    W65C816 MVN/MVP 8bit        ~7  block move
    HuC6280 T[DIAX]/TIN 8bit    ~6  block transfer instructions
    80186 MOVSW 16bit           ~4
    NEC V30 MOVSW               ~4
    80286 MOVSW                 ~2
    486 MOVSD                   <1
    Pentium MOVSD               ~0.31
    Pentium MMX MOVSD           ~0.27 http://www.pennelynn.com/Documents/CUJ/HTML/14.12/DURHAM1/DURT1.HTM

rep_lodsb4mo ago

Only the Z80 refetched the entire instruction, x86 never did it this way. Each bus transfer (read or write) takes multiple clocks:

    CPU                        Cycles  per              theoretical minimum per byte for block move
    Z80 instruction fetch      4       byte
    Z80 data read/write        3       byte             6
    80(1)88, V20               4       byte             8
    80(1)86, V30               4       byte/word        4
    80286, 80386 SX            2       byte/word        1
    80386 DX                   2       byte/word/dword  0.5

LDIR (etc.) are 2 bytes long, so that's 8 extra clocks per iteration. Updating the address and count registers also had some overhead.

The microcode loop used by the 8086/8088 also had overhead, this was improved in the following generations. Then it became somewhat neglected since compilers / runtime libraries preferred to use sequences of vector instructions instead.

And with modern processors there are a lot of complications due to cache lines and paging, so there's always some unavoidable overhead at the start to align everything properly, even if then the transfer rate is close to optimal.

1 more reply

MarkusQ4mo ago

Did anyone else read the headline and think....Zilog? WTF?

rep_lodsb4mo ago

Yes, with that name it really should have an emulation mode similar to the NEC V20/V30 (although that one only did 8080, not Z80)

j / k navigate · click thread line to collapse

24 comments

tasty_freeze4mo ago

ErroneousBosh4mo ago

Want a different architecture? Sure, just draw it with a different ROM. Simple (if you've got IBM money to throw around).

tasty_freeze4mo ago

slartibardfast04mo ago

https://archive.org/details/tredennick-microprocessor-logic-...

ErroneousBosh4mo ago

Oh right, nerdsniped into hunting that book down.

1 more reply

jecel4mo ago

tom_4mo ago

adrian_b4mo ago

The timings given in the datasheet of 286 are very optimistic and they can almost never be encountered in a real program.

They assume that instructions have been fetched concurrently without ever causing a stall and that memory accesses are implemented with 0 wait states.

In reality, instruction fetching was frequently a bottleneck and implementing a memory with 0 wait states for 80286 was much more difficult than for MC68000 or MC68010.

1 more reply

raphlinus4mo ago

1 more reply

retrac4mo ago

jecel4mo ago

CodeWriter234mo ago

"Oddball string instructions", as an assembler coder bitd, they were a welcome feature as opposed to running out of registers and/or crashing the stack with a Z-80.

tasty_freeze4mo ago

LDDR was the same but decremented HL and DE on each iteration instead.

There were versions for doing IN and OUT as well, and there was an instruction for finding a given byte value in a string, but I never used those so I don't recall the details.

CodeWriter234mo ago

LDIR? We used DMA for that.

I was referring to LODSB/W (x86) which is quite useful for processing arrays.

rasz4mo ago

LDIR sounds great on paper but is implemented terribly making it slower than manual unrolled loop

https://retrocomputing.stackexchange.com/questions/4744/how-...

Repeat is done by decrementing PC by 2 and re-loading whole instruction in a loop. 21 cycles per byte copied :o

    CPU / Instruction   Cycles per Byte
    Z80 LDIR 8-bit              21
    8088 MOVSW 8bit             ~14
    6502 LDA/STA 8bit           ~14
    8086 MOVSW                  ~9
    NEC V20 MOVBKW 8bit         ~8
    W65C816 MVN/MVP 8bit        ~7  block move
    HuC6280 T[DIAX]/TIN 8bit    ~6  block transfer instructions
    80186 MOVSW 16bit           ~4
    NEC V30 MOVSW               ~4
    80286 MOVSW                 ~2
    486 MOVSD                   <1
    Pentium MOVSD               ~0.31
    Pentium MMX MOVSD           ~0.27 http://www.pennelynn.com/Documents/CUJ/HTML/14.12/DURHAM1/DURT1.HTM

rep_lodsb4mo ago

Only the Z80 refetched the entire instruction, x86 never did it this way. Each bus transfer (read or write) takes multiple clocks:

    CPU                        Cycles  per              theoretical minimum per byte for block move
    Z80 instruction fetch      4       byte
    Z80 data read/write        3       byte             6
    80(1)88, V20               4       byte             8
    80(1)86, V30               4       byte/word        4
    80286, 80386 SX            2       byte/word        1
    80386 DX                   2       byte/word/dword  0.5

LDIR (etc.) are 2 bytes long, so that's 8 extra clocks per iteration. Updating the address and count registers also had some overhead.

1 more reply

MarkusQ4mo ago

Did anyone else read the headline and think....Zilog? WTF?

rep_lodsb4mo ago

Yes, with that name it really should have an emulation mode similar to the NEC V20/V30 (although that one only did 8080, not Z80)

j / k navigate · click thread line to collapse