Stackful coroutines are clearly a viable tool, but they don't work in all use cases. They require either segmented stacks, a precise GC, or memory usage comparable to kernel threads. They are tricky to implement correctly on Windows. Etc.
In my ideal world, we'd have stackless coroutines with great debugger support everywhere, with languages free to experiment with the syntax- explicit suspension points, implicit suspension points, effect polymorphism to make it look like you're yielding across nested function calls, etc...
> They require either segmented stacks, a precise GC
Which is _exactly_ what stackless coroutines and promises do in a very roundabout manner. Some languages are moving toward annotations to automate chaining, but the problem with having to explicitly annotate coroutines is that you no longer have (or can have) first-class functions; at best you now have multiple classes of functions that only interoperate seamlessly with their own kind, which is the opposite of first-class functions. Plus it's much slower than just using a traditional call stack.Implementing transparently growable or moveable stacks can be difficult, yes. Solving C ABI FFI issues is a headache. And languages that stack-allocate variables but cannot easily move them are in a real pickle. Though, they can do as Go and only stack-allocate variables that don't have their address taken, and in any event that only applies to C++ and Rust. There's no excuse for all the other modern languages. Languages like Python and JavaScript don't have stackful coroutines because of short-sighted implementation decisions that are now too costly to revisit. Similarly, Perl 6 doesn't officially have them because they prematurely optimized their semantics for targets like the JVM where efficient implementations were thought to be difficult. (Moar VM implements stackful coroutines to support gather/take, which shows that it was simply easier to implement the more powerful construct in order to support the less powerful gather/take construct.)
If it were easy we wouldn't have stackless coroutines at all because they're objectively inferior in every respect, and absent external constraints (beholden to the C stack) can result in less memory usage and fewer wasted CPU cycles in both the common and edge cases. But both PUC Lua and LuaJIT do it properly and are among the fastest interpreted and JIT'd implementations, respectively, so I think the difficulty is exaggerated.
I understand why these other constructs exist, but I still think it's perverse. At some point we should just revisit and revise the underlying platform ABI so we can get to a place where implementing stackful coroutines is easier for all languages.
For example, the very same debugging information you might add to improve stack traces can be used by implementations to help, say, move objects. Make that mandatory as part of the ABI and a lot of cool things become possible, including easy reflection in compiled languages. DWARF is heavyweight and complex, but Solaris (and now FreeBSD and OpenBSD) support something called Compact C Type Format (CTF) for light-weight type descriptions, which shows how system ABIs could usefully evolve.
Newer languages shouldn't be tying themselves to incidental semantics from 40 years ago. Rather, they should be doing what hardware and software engineers were doing 40 years ago when they defined the primary semantics--properly abstract call stacks/call state into a first-class construct (i.e. thread of control), while simultaneously pushing the implementation details down so they can be performant.
Of course, I've only focused on the performance issues here, and obviously there are also important issues of ergonomics at play. But my point is that there are benefits to stackless coroutines that can't be simply dismissed.
Like I said, stackful coroutines are a useful tool. They just can't be the only implementation strategy, especially in C++/Rust-like languages.
When I say thread in my previous post I'm referring to the abstract construct that represents the flow of control of sequentially executed instructions, regardless of the underlying implementation, and regardless of how you initiate, yield, or resume that control. For languages that support function recursion that necessarily implies some sort of stack (if not multiple stacks) for storing the state of intermediate callers that have invoked another function. Often such a stack is also used to store variables, both as a performance optimization and because many (perhaps most objects) in a program have a lifetime that naturally coincides with the execution of the enclosing function.
Such a "call stack" has historically been implemented as a contiguous region of memory, but some hardware (i.e. IBM mainframes) implement call stacks as linked-lists of activation frames, which effectively is the data structure you're creating when you chain stackless coroutines.
The two best sources I've found that help untangle the mess in both theoretical and practical terms are
* Revisiting Coroutines, a 2004 paper that helped to renew interest in coroutines. (http://www.inf.puc-rio.br/~roberto/docs/MCC15-04.pdf)
* The proposal to add fibers to the JVM. (http://cr.openjdk.java.net/~rpressler/loom/Loom-Proposal.htm...)
1/ The typical C stack is fixed size and code is written accordingly.
2/ `mmap` can allocate virtual memory, so you can allocate a large stack but that's an upper limit, it might be paged out if unused.
Thoughts?