Most CPU instructions are 1-to-1 with their microcode. I dare say that microcode is nearly irrelevant, any high-performance instruction (ex: multiply, add, XOR, etc. etc.) is but a single instruction anyway.
Load/Store are memory dependent in all architectures. So that's just a different story as CPUs and GPUs have completely different ideas of how caches should work. (CPUs aim for latency, GPUs for bandwidth + incredibly large register spaces with substantial hiding of latency thanks to large occupancies).
-------------
That being said: reorder buffers on CPUs are well over 400-instructions these days, with super-large cores (like Apple's M4) is apparently on the order of 600 to 800 instructions.
Reorder buffers are _NOT_ translation. They're Tomasulo's algorithm (https://en.wikipedia.org/wiki/Tomasulo%27s_algorithm). If you want to know how CPUs do out-of-order, study that.
I'd say CPUs have small register spaces (16 architectural registers, maybe 32), but large register files of maybe 300 or 400+. Tomasulo's algorithm is used to out-of-order access registers.
You should think of instructions like "mov rax, [memory]" as closer to "rax = malloc(register); delayed-load(rax, memory); Out-of-order execute all instructions that don't use RAX ahead of us in instruction stream".
Tomasulo's algorithm means using ~300-register file to _pretend_ to be just 16 architectural registers. The 300 registers keeps the data out-of-order and allows you to execute. Registers in modern CPUs are closer to unique_ptr<int> in C++, assigning them frees (aka: reorder buffer) and also mallocs a new register off the register-file.