Not necessarily; it's an interesting idea. Thanks to the branch predictor the CPU already has a virtual view of the instruction stream. If we tolerate a bit of latency, all we have to do is inject a "jump to ISR" magic instruction in the predicted stream. Rather like self-modifying code, except without modifying the code in memory, just at the instruction fetch point. State still has to be saved but that can be done with PUSH instructions in the ISR.
> Also the "do system calls by queuing requests to another CPU" is kind of at odds with "we don't need cache coherency"
Can be done with mailboxes/FIFOs, but yes this requires a dedicated design. And of course the CPU that does the call is then idle I think?