My experience from actually writing low latency schedulers in user space as well as the publicly available material - like in Ardour - suggests different conclusions from yours.
Keep in mind that a naive benchmark like "cpu usage" is entirely meaningless. What you look at is round trip latency required for a threshold of underruns/missed deadlines. Threading requires additional latency, and process synchronization even more. While I'm sure you report fewer underruns when splitting off into sandboxed plugins I'm suspicious if it's hitting the same performance as doubling or tripling the buffer size in terms of latency in the first place.