Are stackless C++20 coroutines a problem?

I use stackless coroutines on small, hard-realtime ARM Cortex-M0 targets, with 32kb of RAM, where there's no heap allocator present at all: all memory is statically preallocated. The stackless coroutines are a make-or-break, and stackful coroutines that I had previously used were a pain to get right, and were essentially a hack wholly based on implementation-specific behavior. Going from that mess to standards-compliant, portable C++, was wonderful. I shudder to think that someone might suggest going back.

  • Stackless coroutines don't imply heap use: you have full control over how the coroutine frame is allocated (via void * operator new(size_t) member in promise type).

  • co_await can be nested just fine, in fact it's a common use case.

  • Stackful coroutines have to allocate those stacks somewhere as well, and it's perhaps ironic that they can't use the thread's primary stack for that. These stacks are allocated on the heap, perhaps via a pool allocator that gets a block from heap and then subdivides it.

  • Stackless coroutine implementations can elide frame allocation, such that the promise's operator new is not called at all, whereas stackful coroutines always allocate the stack for the coroutine, whether needed or not, because the compiler can't help the coroutine runtime with eliding it (at least not in C/C++).

  • The allocations can be elided precisely by using the stack where the compiler can prove that the life of the coroutine doesn't leave the scope of the caller. And that's the only way you can use alloca. So, the compiler already takes care of it for you. How cool is that!

    Now, there's no requirement that the compilers actually do this elision, but AFAIK all implementations out there do this, with some sane limits on how complex that "proof" can be - in some cases it's not a decidable problem (IIRC). Plus, it's easy to check whether the compiler did as you expected: if you know that all coroutines with a particular promise type are nested-only (reasonable in small embedded projects but not only!), you can declare operator new in the promise type but not define it, and then the code won't link if the compiler "goofed up".

    A pragma could be added to a particular compiler implementation to declare that a particular coroutine frame doesn't escape even if the compiler isn't clever enough to prove it - I didn't check if anyone bothered to write these yet, because my use cases are reasonable enough that the compiler always does the right thing.

    Memory allocated with alloca cannot be used after you return from the caller. The use case for alloca, in practice, is to be a slightly more portable way of expressing gcc's variable-size automatic array extension.

In essentially all implementations of stackful coroutines in C-like lanaguages, the one and only supposed "benefit" of stackfull-ness is that the frame is accessed using the usual base-pointer-relative addressing, and push and pop where appropriate, so "plain" C code can run on this made-up stack, with no changes to code generator. No benchmarks support this mode of thinking, though, if you have lots of coroutines active - it's a fine strategy if there's a limited number of them, and you have the memory to waste to start with.

Stack has to be overallocated, decreasing locality of reference: a typical stackful coroutine uses a full page for the stack at the minimum, and the cost of making this page available is not shared with anything else: the single coroutine has to bear it all. That's why it was worthwhile to develop stackless python for multiplayer game servers.

If there's a couple of couroutines only - no problem. If you've got thousands of network requests all handled by stackful coroutines, with a light networking stack that doesn't impose overhead that monopolizes the performance, the performance counters for cache misses will make you cry. As Nicol has stated in the other answer, this becomes somewhat less relevant the more layers there are between the coroutine and whatever asynchronous operation it's handling.

It has been long since any 32+-bit CPU had performance benefits inherent to memory access via any particular addressing mode. What matters is cache-friendly access patterns and leveraging prefetch, branch prediction and speculative execution. Paged memory and its backing store are just two further levels of cache (L4 and L5 on desktop CPUs).

  1. Why would C++ choose to use stackless coroutines? Because they perform better, and no worse. On the performance side, there can be only benefits to them. So it's a no-brainer, performance-wise, to just use them.

  2. Can I use alloca() to avoid any heap allocations that would normally be used for the coroutine creation. No. It'd be a solution to a nonexistent problem. Stackful coroutines don't actually allocate on the existing stack: they create new stacks, and those are allocated on the heap by default, just as C++ coroutine frames would be (by default).

  3. Are my assumptions about the c++ coroutines wrong, why? See above.

  4. More verbose code because of the need for custom allocators and memory pooling. If you want stackful coroutines to perform well, you'll be doing the same thing to manage the memory areas for the stacks, and it turns out that it's even harder. You need to minimize memory waste, and thus you need to minimally overallocate the stack for the 99.9% use case, and deal somehow with coroutines that exhaust this stack.

    One way I have dealt with it in C++ was by doing stack checks in branch points where code analysis indicates more stack may be needed, then if the stack would overflow, an exception was thrown, the coroutine's work undone (the design of the system had to support it!), and then the work restarted with more stack. It's an easy way to quickly lose benefits of tightly packed stack-fuls. Oh, and I had to provide my own __cxa_allocate_exception for that to work. Fun, eh?

One more anecdote: I'm playing with using coroutines inside Windows kernel-mode drivers, and there the stacklessness does matter - to the extent that if the hardware allows, you can allocate the packet buffer and the coroutine's frame together, and these pages are pinned when they are submitted to the network hardware for execution. When the interrupt handler resumes the coroutine, the page is there, and if the network card allows, it could even prefetch it for you so it'll be in the cache. So that works well - it's just one use case, but since you wanted embedded - I've got embedded :).

It's perhaps not common to think of drivers on desktop platforms as "embedded" code, but I see lots of similarities, and an embedded mindset is needed. Last thing you want is kernel code that allocates too much, especially if it would add per-thread overhead. A typical desktop PC has a few thousand threads present, and a lot of them are there to handle I/O. Now imagine a diskless system that uses iSCSI storage. On such a system, anything I/O bound that's not bound to USB or GPU will be bound to the network hardware and the networking stack.

Finally: Trust benchmarks, not me, and read Nicol's answer too!. My perspective is shaped by my use cases - I can generalize, but I claim no first-hand experience with coroutines in "generalist" code where performance is of less concern. Heap allocations for stackless coroutines are very often hardly noticeable in performance traces. In general-purpose application code, it's rarely going to be a problem. It does get "interesting" in library code, and some patterns have to be developed to allow the library user to customize this behavior. These patterns will be found and popularized as more libraries use C++ coroutines.


Forward: When this post says just "coroutines", I am referring to the concept of a coroutine, not the specific C++20 feature. When talking about this feature, I will refer to it as "co_await" or "co_await coroutines".

On dynamic allocation

Cppreference sometimes uses looser terminology than the standard. co_await as a feature "requires" dynamic allocation; whether this allocation comes from the heap or from a static block of memory or whatever is a matter for the provider of the allocation. Such allocations can be elided in arbitrary circumstances, but since the standard does not spell them out, you still have to assume that any co_await coroutine may dynamically allocate memory.

co_await coroutines do have mechanisms for users to provide allocation for the coroutine's state. So you can substitute the heap/free store allocation for any particular pool of memory you prefer.

co_await as a feature is well-designed to remove verbosity from the point of use for any co_await-able objects and functionality. The co_await machinery is incredibly complicated and intricate, with lots of interactions between objects of several types. But at the suspend/resume point, it always looks like co_await <some expression>. Adding allocator support to your awaitable objects and promises requires some verbosity, but that verbosity lives outside of the place where those things get used.

Using alloca for a coroutine would be... highly inappropriate for most uses of co_await. While the discussion around this feature tries to hide it, the fact of the matter is that co_await as a feature is designed for asynchronous use. That's its intended purpose: to halt the execution of a function and schedule that function's resumption on potentially another thread, then shepherding any eventually generated value to some receiving code which may be somewhat distant from the code which invoked the coroutine.

alloca is not appropriate for that particular use case, since the caller of the coroutine is allowed/encouraged to go do whatever so that the value can be generated by some other thread. The space allocated by alloca would therefore no longer exist, and that is kind of bad for the coroutine that lives in it.

Also note that allocation performance in such a scenario will generally be dwarfed by other considerations: thread scheduling, mutexes, and other things will often be needed to properly schedule the coroutine's resumption, not to mention the time it takes to get the value from whatever asynchronous process is providing it. So the fact that a dynamic allocation is needed is not really a substantial consideration in this case.

Now, there are circumstances where in-situ allocation would be appropriate. Generator use cases are for when you want to essentially pause a function and return a value, then pick up where the function left off and potentially return a new value. In these scenarios, the stack for the function which invokes the coroutine will certainly still be around.

co_await supports such scenarios (though co_yield), but it does so in a less-than-optimal way, at least in terms of the standard. Because the feature is designed for up-and-out suspension, turning it into a suspend-down coroutine has the effect of having this dynamic allocation that doesn't need to be dynamic.

This is why the standard does not require dynamic allocation; if a compiler is smart enough to detect a generator pattern of usage, then it can remove the dynamic allocation and just allocate the space on the local stack. But again, this is what a compiler can do, not must do.

In this case, alloca-based allocation would be appropriate.

How it got into the standard

The short version is that it got into the standard because the people behind it put in the work, and the people behind the alternatives did not.

Any coroutine idea is complicated, and there will always be questions about implementability with regard to them. For example, the "resumeable functions" proposals looked great, and I would have loved to see it in the standard. But nobody actually implemented it in a compiler. So nobody could prove that it was actually a thing you could do. Oh sure, it sounds implementable, but that doesn't mean it is implementable.

Remember what happened the last time "sounds implementable" was used as the basis for adopting a feature.

You don't want to standardize something if you don't know it can be implemented. And you don't want to standadize something if you don't know if it actually solves the intended problem.

Gor Nishanov and his team at Microsoft put in the work to implement co_await. They did this for years, refining their implementation and the like. Other people used their implementation in actual production code and seemed quite satisfied with its functionality. Clang even implemented it. As much as I personally don't like it, it is undeniable that co_await is a mature feature.

By contrast, the "core coroutines" alternatives that were brought up a year ago as competing ideas with co_await failed to gain traction in part because they were difficult to implement. That's why co_await was adopted: because it was a proven, mature, and sound tool that people wanted and had the demonstrated ability to improve their code.

co_await is not for everyone. Personally, I will likely not use it much, as fibers work much better for my use cases. But it is very good for its specific use case: up-and-out suspension.


stackless coroutines

  • stackless coroutines (C++20) do code transformation (state machine)
  • stackless in this case means, that the application stack is not used to store local variables (for instance variables in your algorithm)
  • otherwise the local variables of the stackless coroutine would be overwritten by invocations of ordinary functions after suspending the stackless coroutine
  • stackless coroutines do need memory to store local variables too, especially if the coroutine gets suspended the local variables need to be preserved
  • for this purpose stackless coroutines allocate and use a so-called activation record (equivalent to a stack frame)
  • suspending from a deep call stack is only possible if all functions in between are stackless coroutines too (viral; otherwise you would get a corrupted stack)
  • some clang developers are sceptical that the Heap Allocation eLision Optimization (HALO) can always be applied

stackful coroutines

  • in its essence a stackful coroutine simply switches stack and instruction pointer
  • allocate a side-stack that works like a ordinary stack (storing local variables, advancing the stack pointer for called functions)
  • the side-stack needs to be allocated only once (can also be pooled) and all subsequent function calls are fast (because only advancing the stack pointer)
  • each stackless coroutines requires its own activation record -> called in a deep call chain a lot activation records have to be created/allocated
  • stackful coroutines allow to suspend from a deep call chain while the functions in between can be ordinary functions (not viral)
  • a stackful coroutine can outlive its caller/creator
  • one version of the skynet benchmarks spawns 1 million stackful coroutines and shows that stackful coroutines are very efficient (outperforming version using threads)
  • a version of the skynet benchmark using stackless coroutiens was not implemented yet
  • boost.context represents the thread's primary stack as a stackful coroutine/fiber - even on ARM
  • boost.context supports on demand growing stacks (GCC split stacks)