They don't have to be. First of all, even ordinary threads are more efficient than you might think. On a really awful low-end Android 4.1 device, I can pthread_create and pthread_join over 5,000 threads per second. On a real computer, my X1 Carbon Gen4, I can create and join over 110,000 threads per second. (And keep in mind that each create-join pair also forces two full context switches.)
For most applications, performance of regular threads is perfectly adequate. In these environments, the maintainability and debuggability advantages of using plain old boring threads makes it really hard to justify using something exotic.
But suppose you do have big performance requirements: you can still use normal-looking threaded code. There's a difference between how we represent threads in source code and how we implement them. It's possible to provide green, userspace-switched threads without requiring "await" and "async" keywords everywhere. GNU Pth did it a long time ago, and there are lots of other fibers implementations.
> the CPU has to shuffle a lot of data every time you switch
Any green-threaded system (with or without explicit preemption points) also does context switches! Such a system maintains in user space a queue of things to work on: as the system switches from one of these work items to another, it's switching contexts! You have the same kind of register reloading and cache coldness problems that switching thread contexts has. There's no particular reason that you can do it much better than the kernel can do it, especially since switching threads in the same address space is pretty efficient.