undefined | Better HN

0 pointsRusky8y ago0 comments

There's nothing about stackless coroutines that means you can't have good stack traces. For example, C# already does it, at least to some degree.

Stackful coroutines are clearly a viable tool, but they don't work in all use cases. They require either segmented stacks, a precise GC, or memory usage comparable to kernel threads. They are tricky to implement correctly on Windows. Etc.

In my ideal world, we'd have stackless coroutines with great debugger support everywhere, with languages free to experiment with the syntax- explicit suspension points, implicit suspension points, effect polymorphism to make it look like you're yielding across nested function calls, etc...

0 comments

10 comments · 2 top-level

wahern8y ago· 7 in thread

  > They require either segmented stacks, a precise GC

Which is _exactly_ what stackless coroutines and promises do in a very roundabout manner. Some languages are moving toward annotations to automate chaining, but the problem with having to explicitly annotate coroutines is that you no longer have (or can have) first-class functions; at best you now have multiple classes of functions that only interoperate seamlessly with their own kind, which is the opposite of first-class functions. Plus it's much slower than just using a traditional call stack.

Implementing transparently growable or moveable stacks can be difficult, yes. Solving C ABI FFI issues is a headache. And languages that stack-allocate variables but cannot easily move them are in a real pickle. Though, they can do as Go and only stack-allocate variables that don't have their address taken, and in any event that only applies to C++ and Rust. There's no excuse for all the other modern languages. Languages like Python and JavaScript don't have stackful coroutines because of short-sighted implementation decisions that are now too costly to revisit. Similarly, Perl 6 doesn't officially have them because they prematurely optimized their semantics for targets like the JVM where efficient implementations were thought to be difficult. (Moar VM implements stackful coroutines to support gather/take, which shows that it was simply easier to implement the more powerful construct in order to support the less powerful gather/take construct.)

If it were easy we wouldn't have stackless coroutines at all because they're objectively inferior in every respect, and absent external constraints (beholden to the C stack) can result in less memory usage and fewer wasted CPU cycles in both the common and edge cases. But both PUC Lua and LuaJIT do it properly and are among the fastest interpreted and JIT'd implementations, respectively, so I think the difficulty is exaggerated.

I understand why these other constructs exist, but I still think it's perverse. At some point we should just revisit and revise the underlying platform ABI so we can get to a place where implementing stackful coroutines is easier for all languages.

For example, the very same debugging information you might add to improve stack traces can be used by implementations to help, say, move objects. Make that mandatory as part of the ABI and a lot of cool things become possible, including easy reflection in compiled languages. DWARF is heavyweight and complex, but Solaris (and now FreeBSD and OpenBSD) support something called Compact C Type Format (CTF) for light-weight type descriptions, which shows how system ABIs could usefully evolve.

Newer languages shouldn't be tying themselves to incidental semantics from 40 years ago. Rather, they should be doing what hardware and software engineers were doing 40 years ago when they defined the primary semantics--properly abstract call stacks/call state into a first-class construct (i.e. thread of control), while simultaneously pushing the implementation details down so they can be performant.

pcwalton8y ago

I think you're misunderstanding the benefit of stackless coroutines. Here's the core of the issue: Stacks, by definition, are dynamically growable objects. But sometimes (in most cases?) the state that needs to be stored across yield points can be statically bounded up front. In that case, stackful coroutines (really, threads) add unnecessary overhead, because you pay for that ability to grow. Golang, for instance, has to start stacks small and grow them by copying, because it can't reliably statically analyze the goroutine to determine the needed stack space. A properly-implemented stackless coroutine system (or even the classic C approach of writing state machines by hand), by contrast, can determine the amount of space necessary for each coroutine and allocate it precisely. There are even slab allocation tricks for many coroutines of the same size. Fundamentally, this is why manually-written C code using epoll can outperform any stackful coroutine system: optimal allocation of stackful coroutines in a language with recursion and first-class functions requires higher-order control flow analysis, which is generally considered infeasible.

Of course, I've only focused on the performance issues here, and obviously there are also important issues of ergonomics at play. But my point is that there are benefits to stackless coroutines that can't be simply dismissed.

RuskyOP8y ago

I'm not at all convinced. Stackless coroutines have a statically-known fixed stack size, which can be treated as an anonymous type a la C++ or Rust closures. This is far cheaper and easier to optimize than any possible implementation of stackful coroutines, regardless of what your platform ABI looks like.

Like I said, stackful coroutines are a useful tool. They just can't be the only implementation strategy, especially in C++/Rust-like languages.

zwieback8y ago

This seems like a super-interesting discussion but I can't quite follow due to the mixed-up naming. Is there a good paper or website that clarifies what stackful/less means in the various contexts?

wahern8y ago

The distinction are difficult to grasp, and it doesn't help that meanings evolve. For example, many people today understand "thread" to mean a preemptively scheduled kernel resource with a contiguous region of memory used for storing objects and return addresses, which shares an address space with other such threads. That sort of "thread" is what 10 years ago many people called a LWP (lightweight process), when thread didn't imply preemptive scheduling. And what was a "thread" 10 years ago is often called a "fiber" or "microthread" today. Though I'm coming from the world of Unix. I think connotations had a slightly different evolution in the world of Windows, largely because Unix emphasized the "process" (concurrent execution, non-shared address space) and modern concurrent execution with shared address space were only formalized in the late 1990s, whereas Windows didn't get isolated virtual address spaces until Windows 95, so the primary construct was always a "thread" (with or without preemptive execution, depending on era).

When I say thread in my previous post I'm referring to the abstract construct that represents the flow of control of sequentially executed instructions, regardless of the underlying implementation, and regardless of how you initiate, yield, or resume that control. For languages that support function recursion that necessarily implies some sort of stack (if not multiple stacks) for storing the state of intermediate callers that have invoked another function. Often such a stack is also used to store variables, both as a performance optimization and because many (perhaps most objects) in a program have a lifetime that naturally coincides with the execution of the enclosing function.

Such a "call stack" has historically been implemented as a contiguous region of memory, but some hardware (i.e. IBM mainframes) implement call stacks as linked-lists of activation frames, which effectively is the data structure you're creating when you chain stackless coroutines.

The two best sources I've found that help untangle the mess in both theoretical and practical terms are

* Revisiting Coroutines, a 2004 paper that helped to renew interest in coroutines. (http://www.inf.puc-rio.br/~roberto/docs/MCC15-04.pdf)

* The proposal to add fibers to the JVM. (http://cr.openjdk.java.net/~rpressler/loom/Loom-Proposal.htm...)

1 more reply

lboasso8y ago

A good start could be "Coroutines in Lua" [1]. It explains well the terminology and makes comparisons between designs in different languages.

[1] http://www.inf.puc-rio.br/~roberto/docs/corosblp.pdf

2 more replies

ioquatix8y ago

I appreciate all your points. However, in my experience implementing coroutines and fibers, where I allocated stacks using `mmap`, with guard pages, I'm not completely convinced that fixed-size stacks are an issue:

1/ The typical C stack is fixed size and code is written accordingly.

2/ `mmap` can allocate virtual memory, so you can allocate a large stack but that's an upper limit, it might be paged out if unused.

Thoughts?

WillReplyfFood8y ago

How about fixed size stacks- with a canary of a expansionfunction at the top? That way you can have it both- the speed and fixed size- and expandability on demand- by simply checking for the canary in a pattern and executing it ..

bonzini8y ago· 1 in thread

Stackful coroutines are trivial to implement on Windows---it supports "fibers" natively.

RuskyOP8y ago

Trivial to use the included fiber API; impossible to implement yourself without the risk of a future OS update breaking your implementation.

j / k navigate · click thread line to collapse

0 comments

10 comments · 2 top-level

wahern8y ago· 7 in thread

  > They require either segmented stacks, a precise GC

pcwalton8y ago

RuskyOP8y ago

Like I said, stackful coroutines are a useful tool. They just can't be the only implementation strategy, especially in C++/Rust-like languages.

zwieback8y ago

This seems like a super-interesting discussion but I can't quite follow due to the mixed-up naming. Is there a good paper or website that clarifies what stackful/less means in the various contexts?

wahern8y ago

The two best sources I've found that help untangle the mess in both theoretical and practical terms are

* Revisiting Coroutines, a 2004 paper that helped to renew interest in coroutines. (http://www.inf.puc-rio.br/~roberto/docs/MCC15-04.pdf)

* The proposal to add fibers to the JVM. (http://cr.openjdk.java.net/~rpressler/loom/Loom-Proposal.htm...)

1 more reply

lboasso8y ago

A good start could be "Coroutines in Lua" [1]. It explains well the terminology and makes comparisons between designs in different languages.

[1] http://www.inf.puc-rio.br/~roberto/docs/corosblp.pdf

2 more replies

ioquatix8y ago

1/ The typical C stack is fixed size and code is written accordingly.

2/ `mmap` can allocate virtual memory, so you can allocate a large stack but that's an upper limit, it might be paged out if unused.

Thoughts?

WillReplyfFood8y ago

bonzini8y ago· 1 in thread

Stackful coroutines are trivial to implement on Windows---it supports "fibers" natively.

RuskyOP8y ago

Trivial to use the included fiber API; impossible to implement yourself without the risk of a future OS update breaking your implementation.

j / k navigate · click thread line to collapse