Only stackless co-routines require state machine transformation. Stackfull co-routines based user mode threading generally just change out the IO primitives to issue an asynchronous version of the operation, and immediately calls into the the user mode scheduler to pick some ready-to-resume co-routine to switch the stack to and resume. They might include a preemption facility (beyond just the OS's preemption of the underlying kernel threads), but that is not required and is largely a language/runtime design decision.
The big headaches with stackfull co-routine based user mode threading come from two sources. One is allocating the stack. If your language requires a contiguous stack then you either need to make the stacks small, and risk running out, or make them big which can be a problem on 32-bit platforms (you can run out of address space), or can be a problem on some platforms (those with strict commit-charge based memory accounting). Both can be mitigated by allowing non-contiguous stacks or re-locatable contiguous stacks (to allow small stacks to grown later without headaches), although obviously that can have performance considerations.
The other stackfull co-routine headache is in calling into code from another language (i.e. FFI) which could be making direct blocking system calls, and end up starving you of your OS threads.
I do agree that in purely CPU or memory bound applications a classical thread pool makes better sense. The main advantages of either type of co-routine based user mode threading primarily apply to IO-heavy or mixed workloads.