There's already many fast, portable, standalone C libraries for this. For instance, see table 1 of
https://www.gnu.org/software/pth/rse-pmt.ps. The assembly code for a context switch is pretty minimal (on most platforms, it's just setjmp and longjmp) if you don't try manage the signal state. I would be surprised if there was significant variance in performance (other than whether it chooses to switch the signal mask). Additionally, many language runtimes directly support it without any extra effort on your part. So if you choose to instead use one of those language, you get a fast portable solution without needing to do any extra work to pick a support library (for example, D-Lang LDC, PyPy, Go, Julia).