From the ISA perspective, however, adding more registers means you have to spend more bits naming a register. With 16 registers, you need 12 bits of your instruction just to name the operands of a typical 3-address instruction (rA = rB op rC). With 128 registers, that is now a whopping 21 bits, which means code density is a more pressing issue.
If the compiler could use the hidden registers, then the cpu would know that it could run this instruction ahead of time.
It is probably not worth it, since it adds a lot of complexity to an already complex system, which is why it isn’t done.
But actually the decision about which general purpose registers to use for what is made at compile time (hence we're discussion a compiler flag here, the frame pointer is not a hardware dictated feature), so the question is actually kind of moot. If the compiler is out of registers to allocate and instead uses the stack, the CPU isn't reasonably going to be able to undo that.
The current practice allows for CPUs to transparently increase their physical register count (to gain performance) and still run old code -- and older CPUs can still run new code. That's usually quite practical...
Adding more register names takes more bits for the register numbers -- which leads to larger instructions. It also leads to more complicated encodings if we want backwards compatibility. AMD64 does that by adding an optional prefix byte that carries a payload of 4 more instruction bits. That's one bit each for the three possible register names encoded in a traditional IA32 instruction + a bit to indicate whether to operate on 32-bit or 64-bit data (the actual rules are a bit more complex). Intel published a whitepaper recently suggesting a future encoding with a different (optional) prefix that encodes 8 more bits -- so each of the three register names can be extended to 5 bits (32 register names). It all ends up being quite complicated + new code won't run on older CPUs, which is not great.
I think you are suggesting not just bigger register names but also doing away with register renaming -- that would be... less than entirely useful because you would lose almost all your out-of-order capability and thereby almost all your ability to hide cache misses. Cache misses are very, very hard to predict statically (before actually running the code on a real CPU with real data) so good luck trying to do magic ahead-of-time allocation of those registers...
he AMD64 architecture only has 15 general-purpose registers (because the stack pointer is mostly treated as if it were a GPR as well). It is customary to use one of those (bp/ebp/rbp depending on mode) as a base pointer register. That leaves 14 GPR register names.
The physical CPU the code runs on might have 200 physical registers -- those are the ones that matter for speculative, out-of-order execution -- but the code itself can only refer to 14 (or 15) GPRs at a time and has to include instructions to transfer values to/from memory or to/from XMM registers if that's not enough. Those extra instructions take up space + might slow the code down.