Show HN: Coros – A Modern C++ Library for Task Parallelism

129 points by singledigits 9 months ago

Hello Hacker News.

I’m Martin, a graduate student from Prague, and I’ve been working on Coros, a C++ library for task-based parallelism.

After spending some time with OpenMP and oneTBB, I wanted to try building a library using modern features from the C++ standard library. I’ve used coroutines for task encapsulation and C++23 expected for exception handling, while trying to maintain good performance.

Additionally, I’ve implemented monadic-like behavior to allow easy chaining of tasks, similar to the monadic operations in std::expected.

You can check out the project here: https://github.com/mtmucha/coros

While this library isn’t fully-fledged or production-ready, I’d really appreciate your feedback!

throwaway17_17 9 months ago

I am pretty okay with the code (I'm essentially talking about the usage syntax for the library and its type) shown in the examples. However, at this point any parallel computing implementations must address the baseline issues presented in "Scalability! But at what COST? (McSherry,Isard,Murray 2015)" a paper whose central question is can a parallel computation exhibit a Configuration that Outperforms a Single Thread (the COST in the title). [1] There is a good discussion of the paper and its applicability to parallel (and distributed) computation implementations in Richard Feldman's 2024 Distributed Systems talk "Distributed Pure Functions". [2]

At this point in the life-cycle of the concept of parallel computation, I think it has become somewhat imperative that devs in the area begin to honestly evaluate the practicality and benefits/drawbacks of using the techniques for a given application area and attempt to 'sell' their libraries, techniques, idioms, etc using a more transparent approach. Also, I generally think that people that argue for more prevalence of parallel code, especially those arguing for the default being parallel (or concurrent), have to wrestle with and address these same issues.

Again, I don't dislike the premise of the library, think the usage examples seem very sensible and well designed, and I really like parallel computation as an area of study in general. Further, I really think that setting out a task for one's self

'to try building a library using modern features from the C++ standard library. I’ve used coroutines for task encapsulation and C++23 expected for exception handling, while trying to maintain good performance.'

after taking inspiration from two well respected and frequently utilized libraries in the space is great and the internals of the library I saw look clean and well architected.

1 - https://www.usenix.org/system/files/conference/hotos15/hotos... 2 - https://youtu.be/ztY1YRiaSiE?si=npBREw9vdF5dHcJh&t=350

SolarNet 9 months ago

I think you are misapplying that paper? This as a library is the "batteries" to C++'s no-batteries-included standard library which does not implement asynchronous coroutines at all.
The paper is much more on the side of application and system performance. But you couldn't even write such a system without a library like this providing you the tools to do so. This is much more in the domain of "basic tool for ecosystem" than "library for specific tasks". It's on the user of the tool to address the paper's question, not the builder of the tools.
- throwaway17_17 9 months ago
  
  You are not incorrect in stating that the primary focus of the paper is more on the application side. However, I think providers of a parallel computation infrastructure would benefit from profiling a wide range of potential use cases across several work load sizes. This could then lead to a section in a README where the baseline overhead was broken down per workload/worksize measurements and a back of the envelope estimate by an application developer would be more particularly motivated when deciding which infrastructure tool may be the best fit for their application's specific requirements.
  - phaedrus 9 months ago
    
    "Concurrency != parallelism" is an important distinction in this context. The base C++ coroutines feature is not about threads or parallel processing, but rather is a generalization of the concept of "subroutine" with respect to control flow and stack usage. An example using coroutines to service many tasks (if not otherwise involving threading features) is not much different (at the level of what the CPU sees) from a single threaded implementation using a loop or continuation passing to process concurrent tasks. Performance should be identical between both the language-supported coroutines code and the manually implemented single threaded loop if the work is batched the same.
    
    throwaway17_17 9 months ago
    
    I would tend to agree with your open assertion that Concurrency != Parallelism. However, I'm not sure that it is really germane in this situation. I am aware that library in question here uses C++ coroutines for 'task encapsulation' according to the developer; however, this library is being compared directly to TBB and OpenMP which are two of the 'go-to' implementations for parallel computation. So I don't think the focus on parallelism in this comment chain is in any way inaccurate or misapplied.
singledigits 9 months ago

Thank you for your thoughtful feedback.
I've just skimmed through the paper, and it raises interesting and valid point about scalability in parallel computing. I'll definitely look into it more thoroughly, as well as the talk you mentioned.
I'm glad you find the usage examples well-designed and appreciate your positive remarks about the library's architecture. Thank you again for your insights.

Koshkin 9 months ago

There's also a high-quality, sophisticated Threading Building Blocks by Intel (which I wish would become a part of the C++ standard library).

https://en.wikipedia.org/wiki/Threading_Building_Blocks

jcelerier 9 months ago

TBB was already far from the state-of-the-art 7/8 years ago, and there are continuously new approaches that outperform it such as https://github.com/taskflow/taskflow ; https://github.com/google/marl ; and the most recent contender https://github.com/dpuyda/scheduling
- krapht 9 months ago
  
  I know TBB will still be supported 5 years from now, though.
secondcoming 9 months ago

TBB is a required dependency on some systems if you use ‘std::execution::parallel_policy’ functions
Zitrax 9 months ago

You can see in the repository that it was benchmarked against oneTBB.

dkasper 9 months ago

At Meta folly::Coro is used pretty heavily. Have you taken a look at it? Wondering if there are any advantages. The api seems fairly similar to me at a glance.

https://github.com/facebook/folly/tree/main/folly/experiment...

singledigits 9 months ago

Thank you for bringing this up.
I hadn’t come across folly::Coro until now. It does seem quite similar at first glance, and some of the utility functions they have are ones I’m also planning to implement, as others have also pointed out are currently missing.
One difference is that they use a custom Try<T> type for handling exceptions and values, I’ve opted for std::expected introduced in C++23. I’ve also added "monadic-like" chaining of tasks.
Overall, it’s a very similar library, and I’ll definitely look into it for inspiration and potential improvements.

jcarrano 9 months ago

It would be great to have examples with I/O and timers/sleeps, since this is what people will likely use the library for. Also, communication between tasks.

I tried getting into C++ coroutines in the past and I was put off because of the complexity and the lack of an I/O system that was understandable by a human being.

gsliepen 9 months ago

It would be nice if there was a function to wait for tasks and to return the results at the same time, so that you could write something like:

    auto [a, b] = co_await coros::wait_tasks(fib(n - 1), fib(n - 2));
    return a + b;

singledigits 9 months ago

Thank you for your feedback.
I understand that working with tasks and retrieving values can feel a bit clunky. The main reason I've structured it this way is that individual tasks are RAII objects, and their coroutine state is destroyed once they go out of scope. However, I could modify the awaitable returned from wait_tasks to store tasks, and then return values directly to the user. This could definitely be a more ergonomic overload for the function. I'll look into it!
fooker 9 months ago

If you need this interface, use threads.

throwaway_94404 9 months ago

I just can't get my brain around coroutines.

Can anyone recommend a good tutorial or resource for me to read.

I find it so frustrating as I don't think it's necessarily a complex subject but my brain just doesn't get it.

Related perhaps but many (many, many) years ago, when learning BASIC, I assumed GOSUB went off and started executing the code in the subroutine as well as the rest of the inline code. That suggests to me that I should perhaps have a deeper understanding of this but I really don't...

singledigits 9 months ago

I feel you! Coroutines can be tricky at first. I recommend Lewis Baker's blog about coroutines [1], which is detailed and insightful. Additionally, cppreference [2] is a great resource to understand how coroutines work in C++.
In a nutshell, C++ coroutines are almost like regular functions, except that they can be "paused" (suspended), and their state is stored on the heap so they can be resumed later. When you resume a coroutine, its state is loaded back, and execution continues from where it left off.
The complicated part comes from the interfaces through which you use coroutines in C++. Each coroutine needs to be associated with a promise object, which defines how the coroutine behaves (for example, what happens when you co_return a value). Then, there are awaiters, which define what happens when you co_await them. For example, C++ provides a built-in awaiter called suspend_always{}, which you can co_await to pause the coroutine.
If you take your time and go thoroughly through the blog and Cppreference, you'll definitely get the hang of it.
Hope this helps.
[1] https://lewissbaker.github.io/ [2] https://en.cppreference.com/w/cpp/language/coroutines
- loeg 9 months ago
  
  They're just green threads with some nice syntax sugar, right? Instead of an OS-level "pause" with a futex-wait or sleep (resumed by the kernel scheduler), they do an application-level pause and require some application-level scheduler. (But coroutines can still call library or kernel functions that block/sleep, breaking userspace scheduling?)
  - singledigits 9 months ago
    
    Yes, exactly. Coroutines are one possible implementation of green threads. Once they are scheduled/loaded on an OS thread, they behave just like regular functions with their own call stack. This means they can indeed call blocking operations at the OS level. A possible approach to handle such operations would be to wrap the blocking call, suspend the coroutine, and then resume it once the operation is complete, perhaps by polling(checking for completion).
marhee 9 months ago

Coroutines are just multiple call stacks. If coroutine A calls coroutine B then B excutes on its on stack and can ‘yield’ a value back to A. Yielding is just a return across stacks without destroying the current stack. So A continued with the yielded valie on its own stack and when ready calls B again which continues on its own stack with the next statement after the previous yield. Etc.
Notice that this does not necessarily involve parallelism, although it can. For example, Lua has non parallel (cooperative) co-routines. Go had parallel coroutins, called goroutines, but theoretically only if they they use channels to exchange values. Otherwise, if they’re not exchanging information they would not becoroutins in the sense that they work together in solving something.
jpc0 9 months ago

Dumbed down way too far.
They are a function that can remember where they are in their own execution so when they are called later they continue execution where they left of.
There are many many ways of implementing that functionality, C++ standard coroutines are only one such implementation.
What you do with them is whatever you want, it's pretty common to handle IO using them but generators are also a pretty common example. But that is generally high level.
C++ coroutines are basic building blocks and are very low level, there is no executor ( rust tokio / python asyncio ) so don't be worried if it seems hard to use, it is hard to use.
Look at std::generator for how coroutines are used to implement a generator, cppcoro is also a pretty popular library that builds abstractions on top of coroutines and also has some executors if I remember correctly.
SolarNet 9 months ago

Co-routines can be a nebulous sort of concept because it means different things in different places and not all of them have the same features. But some of the big points are:
- Heap allocated call frame. Instead of being pushed onto the stack, co-routines tend to have their call frame (local variables, arguments, etc.) placed into heap memory (or at least may be place-able into heap memory). This often enables the other features.
- Control can leave co-routines in more ways than standard function calls. Generally this means returning (often called "yield") to the caller without completing the whole function. It can then be later resumed, returning to where the function originally left off. Generators are a common pattern enabled by co-routines that rely on only this part (and so many systems can optimize out the heap usage, for example).
- A co-routine is usually an object with an interface that allows you to move it around and resume it in different places than it was originally called. This can include on different threads, or depending on the sophistication of the system, different processes or machines.
Those are the three big points in my mind. I'd recommend trying lua coroutines, personally (I like minmalist engines like defold to use it in) to really get a feel for how these are on the edge between "language feature" and "library feature".
dxuh 9 months ago

Coroutines themselves are a really simple concept. But in practice they give you all the headaches async stuff generally gives you. And in C++ there is a ton of extra complication, especially because there is no support library. I wrote this in a tutorial a while ago:
> they are functions that can suspend themselves, meaning they stop themselves without returning, even in the middle of their body and then can later can be resumed, continuing execution at the point they suspended from earlier.
If you want to use coroutines in C++ specifically you can have a look at this tutorial, if you want: https://theshoemaker.de/posts/yet-another-cpp-coroutine-tuto... I don't know of anyone that read it, but I spent a lot of time on it.
It essentially tries to explain how to build a coroutine support library yourself, but if you don't care about that, skip it and just use libcoro or cppcoro. They have examples too. My little async io library has some examples as well if you want to get an idea.
Koshkin 9 months ago

One way to get a sense of coroutines is to consider the behavior presented by the async/await design pattern [1], where 'await' suspends the execution of the currently running code and yields control to the 'async' task. (As an adage goes, "async is not asynchronous, and await does not await anything.") Yet another pattern is "promise/future", where the code execution is (or may be) suspended as soon as the code tries to obtain the promised result.
[1] https://learn.microsoft.com/en-us/dotnet/csharp/asynchronous...
dataflow 9 months ago

Do you mean C++ coroutines, or coroutines in general? If you're new to the concept I would try to start with Python's, then Javascript or C#. C++'s is way more complicated.
- pjmlp 9 months ago
  
  Note that C# and C++ are quite similar, the biggest difference are the lifetime gotchas and not having coroutines runtime on the standard library.
  Their design has a common source, and the magic methods for awaitables as well.
binary132 9 months ago

I didn't understand C++ coroutines until I learned to use Lua coroutines. It's basically not that different from gotos, if goto saved local state.
baq 9 months ago

imagine a virtual (green) thread which the kernel doesn't run in parallel until you tell it it's ok to do so (when you explicitly yield control) and then can continue from that place when you explicitly tell it to.
you can even try to run those virtual threads on real threads. much fun to be had.
IshKebab 9 months ago

Yeah there aren't many good resources on it unfortunately. One thing to note is C++20's coroutine support is really low level. It's designed for library authors so that they can build the kinds of things "normal" people want - tasks, generators, futures, promises, etc.
This video is the best intro I've found. It actually explains what is happening in memory, which is the only way to really understand anything in C++.
https://youtu.be/aibjUHx7vew
Also this is decent:
https://www.scs.stanford.edu/~dm/blog/c++-coroutines.html
But don't try and write a coroutine library yourself. Use something like libcoro.

leeter 9 months ago

Looks like a good start. I'm not actually sure I'd use it on windows however. CPPWinrt has a really decent coroutine support library with tools like winrt::resume_background() [1], I use it extensively even in desktop apps because it makes using the windows threadpool (which is active by default for all windows processes since at least windows 7) trivial. I've basically moved most of my threading code onto that unless I need a dedicated thread to hold a context for some reason. But, that's a windows specific thing as far as I know.

[1] https://learn.microsoft.com/en-us/uwp/cpp-ref-for-winrt/resu...

singledigits 9 months ago

Thank you for your response!
I don't have experience with WinRT, but it does seem quite similar at first glance. One of the key reasons I focused on modern C++ was to ensure cross-platform compatibility. However, I completely understand that if you're working on Windows and are already familiar with WinRT, sticking with it makes perfect sense. I'll take a closer look at WinRT to see if there are any significant differences.
- leeter 9 months ago
  
  My suggestion is aim for compatibility with cppwinrt, but not anything else. That way devs can freely intermix and get the best of the utilities of both.

viralsink 9 months ago

Is there a way to prevent callback hell in C++ when doing asynchronous communication with C++ before 20? Coroutines seem to be the only clean solution. Promises can work, but they tend to be difficult to reason about if branching is involved.

darknavi 9 months ago

Traditionally the way to prevent "callback hell" is to use something like async/await syntax. Without that there aren't a ton of good options. Like you mentioned, you could switch to promises with polling.
gpderetta 9 months ago

There are library-only stackful coroutines options, like in boost.

BenFrantzDale 9 months ago

Have you compared perf with the reference implementation of the P2300 “Senders” proposal? https://github.com/NVIDIA/stdexec

mrkent27 9 months ago

I came here to comment the same thing. Senders/receivers have the added advantage of not allocating unlike coroutines that have to (usually) allocate on the heap.
- singledigits 9 months ago
  
  I haven't compared performance with the P2300 proposal yet. It seems like it's trying to unify asynchronous and parallel execution for C++, which is much broader in scope than my library.
  It's true that coroutines can avoid heap allocation, but I haven't tested when or if that happens in my implementation. From the papers, it's clear that certain conditions must be met for the compiler to optimize this. If you know of any good sources on this, please let me know.
  I think it's definitely worth looking into this optimization and possibly using custom allocators for specific cases. I'll also compare performance with the proposal's implementation[1] to see the difference.
  Thank you for your feedback.
  [1] https://github.com/NVIDIA/stdexec

ska 9 months ago

Martin, interesting; have you had a look at https://github.com/taskflow/taskflow ?

singledigits 9 months ago

Thank you for asking!
I've skimmed through Taskflow, and from what I understand, its main focus is on graph parallelism, allowing users to express computations as a graph.
I haven't done extensive benchmarking against 3rd-party libraries yet, which others have also mentioned. I'll definitely do more performance testing in the future to better assess and optimize performance.
- ska 9 months ago
  
  I think that’s about right on focus, I brought it up more for the “modern c++” aspect…
  - singledigits 9 months ago
    
    Thank you for clearing that up.
    Regarding modern features, for example, the return value from the executor in TaskFlow is a custom tf::Future derived from std::future. This means if you want to check for the result, you need to use a try-catch block with the get() method on the future.
    Personally, I prefer using std::expected for value/exception handling. It allows checking for errors with a simple if statement, and if you don't want to handle the exception, you can just return an error value from the coroutine.
    As for "monadic-chaining," TaskFlow can achieve the same thing with its easy graph construction, you can set up a graph and execute it, which is comparable to and_then chaining in my library.
    Another point is related to performance and ease of use. In a simple example like Fibonacci, TaskFlow requires using subflows, which I feel is less "ergonomic".
    Overall, for simple task parallelism, if you don't need the graph expressiveness of TaskFlow, I believe my library is a more "ergonomic" choice (though I may be biased here). I also find value handling simpler in my library with std::expected. That said, TaskFlow is a much larger library with more features like GPU integration.
    I hope this better addresses your point!

tlb 9 months ago

In your dequeue/circular buffer implementation, how is it able to grow the queue without locking?

The code seems to rely on atomics for head & tail, but grows the queue without any special provisions I can see.

https://github.com/mtmucha/coros/blob/ee30d3c1d0602c3071aa26...

singledigits 9 months ago

The concept behind the deque is explained in Correct and Efficient Work-Stealing for Weak Memory Models [1].
The idea is that only the owning thread can push tasks into the deque. If the owning thread detects that the deque is full, it creates a new one and copies the original values. Once the copy is ready, the owning thread "publishes" it by storing it in the buffer variable. Pointers to the deque are atomic, as well as the indices. Other threads can manipulate only the indices, and even if a stealing thread has an old pointer, it still points to valid data.
I hope I understood your question correctly and that this answer is helpful. You can find more details in the paper mentioned above.
[1] https://inria.hal.science/hal-00802885/document

cxx 9 months ago

This looks very promising, it's refreshing to see a library with a sane interface.

One thing I'd like to see is the possibility to run the coroutines in the main thread, without spawning any new threads in the thread pool. It might seem strange but sometimes you just need to do I/O stuff concurrently in a place where you're not allowed to spawn other threads.

Other than that congrats on the release, I hope you keep working on it!

singledigits 9 months ago

Thank you for your feedback and appreciation.
During development, I initially tried implementing coroutines in a way that executing them without spawning a new thread would be possible. However, it introduced complications, so I eventually scrapped that approach.
Now, with eye on potential improvements, I can revisit this idea from the perspective of I/O operations.

ldb 9 months ago

A great project.

FYI: I guess there is a minor typo in the README example: the argument of the second call to fib() in the non-coros version of the code should be "n-2" and not "n-1".

singledigits 9 months ago

Thanks for spotting that! I appreciate the feedback and will update the README.

neonsunset 9 months ago

This looks exactly like .NET's task abstraction.

If it works anywhere near as good, I'm definitely giving this a try next time I need to work on a C++ project. Thanks!

pjmlp 9 months ago

As historical note for those that don't follow C++, C++20 co-routines grew up from the work done with asynchronous programming on WinRT for C# and C++, inspired by Midori and .NET async/await.
Most of the magic methods expected by C++ compilers in awaitable types, are also present in the structured typing used by C# for awaitables.
The preview implementation for VC++ and clang were done by a Microsoft employee, Gor Nishanov, his talks are always quite interesting.
singledigits 9 months ago

Hopefully, you find it useful! If you have any ideas or suggestions for improvement, feel free to open an issue or let me know. Thanks for considering it!

OnlyMortal 9 months ago

Have you had a look at SeaStar and how it works with coroutines?

germandiago 9 months ago

How is this library different from Boost.Cobalt and cppcoro?

singledigits 9 months ago

Thank you for your question.
I've included a link to Lewis Baker's blog (the author of CppCoro) in my repository as an excellent explanation of coroutines. From my understanding, after reviewing his library, it is no longer in active development and hasn’t been updated for a couple of years. CppCoro was an experimental library intended to explore coroutines while they were still an experimental feature. For example, CppCoro uses a custom type for storing values, similar to std::optional from the standard library (if I'm not mistaken).
For my implementation, I've opted to leverage std::expected from C++23 for storing values. I've also implemented monadic-like chaining. CppCoro, however, seems to focus more on asynchronous operations, whereas my library focuses more on task-based parallelism.
I don't have experience with Boost.Cobalt, so I can't provide insights there, but I will definitely look into it now that you've mentioned it.
Hope this helps.
- germandiago 9 months ago
  
  Thanks for the update! Sorry, lately I had not been much around.
  Boost.Cobalt can be found here: https://www.boost.org/doc/libs/1_85_0/libs/cobalt/doc/html/i...
feverzsj 9 months ago

I think Op's lib is for fork-join style parallel algorithms. It's like TBB but is based on continuation stealing. Boost.Cobalt and cppcoro are general coroutine libs. They are mostly used for async IO programming.