Newbie's questions for migrating from oneTBB #137

solbjorn · 2025-08-31T11:21:58Z

solbjorn
Aug 31, 2025

Can I ask a few silly questions since I'm new to coroutines, but eager to switch my oneTBB-based engine to TMC?

Do I get it correctly that I can't place an aw_spawn_fork in a global struct and co_await it later from some other coroutine, not the one that spawned it?
(my assumption comes from that aw_spawn_fork doesn't have a default constructor, so putting it in a struct is tricky. Moreover, you mentioned in the docs that they contain pointers to this, which I got at "you must always co_await the results of fork() within the coroutine that spawned them").

What would be the best way for the following scenario:

from one function, I run two tasks and return
in another function, which happens several code blocks later, I need to wait for them

Currently, I run two task_groups from the former and them wait for them from the latter. task_groups lay in a shared struct, I don't pass pointers for them around the code.
I'd like to not use post_awaitable() and them block on the futures later, since that's not what coroutines are about.

I thought of something like:

in the first function, spawn 2 coroutines/tasks and detach() them immediately
put either an async version of condvar or barrier in the shared struct and toggle them the end of those coroutines
later, when I want to wait for them, co_await the condvar/barrier to suspend instead of blocking in case they are not ready yet.

But will this guarantee that the coroutines which I'm waiting for will execute for sure even if someone has 1 core and everything executes serially there?
And this doesn't look like an intended/obvious way...

Since aw_mutex and ex_braid do nearly the same stuff, which one is faster from your code PoV? Mutex seems to be a bit heavier since it builds an awaiters list and need to repost every awaiter on each unlock? But if the contention is really narrow and it's unlikely for this mutex to be blocked, ex_braid can incur more overhead?
aw_mutex has unlock() and co_unlock(). The latter can do sync transfer. Can sync transfer lead to that let's say 8 threads suspended on the same lock, but, if using co_unlock(), these 8 coroutines will continue execution on only one thread serially even after this mutex' scope?
If I have the following tree of function calls:

tmc::task f1() // coroutine
  void f2() // regular function called by the coroutine above
    void f3() // called by f2()

Does it mean that if I want to run a coroutine from f3(), I need to call post*() from it, even though it's still run on ex_cpu already, but doesn't have a suspension point? Or, the preferred alternative would be to convert each of those functions to a coroutine?

My main idea is to not block any of the coroutines I want to introduce with serial stuff like generic mutexes/futures/etc, so that only one thread (which runs main synchronous code) could be blocked at a time.

Unrelated: BTW, you can use std::hardware_destructive_interference_size instead of hardcoding to 64, it's a constexpr IIRC.
But I've also seen that some developers started multiplying it by two since modern CPUs (at least x86_64) often tend to fetch 2 CLs at a time instead of one which could still provoke false-sharing.
Anyway, only benchmarking could give a reliable answer here.

Answered by tzcnt

Aug 31, 2025

Can I ask a few silly questions since I'm new to coroutines, but eager to switch my oneTBB-based engine to TMC?

Do I get it correctly that I can't place an aw_spawn_fork in a global struct and co_await it later from some other coroutine, not the one that spawned it?
(my assumption comes from that aw_spawn_fork doesn't have a default constructor, so putting it in a struct is tricky. Moreover, you mentioned in the docs that they contain pointers to this, which I got at "you must always co_await the results of fork() within the coroutine that spawned them").

What would be the best way for the following scenario:

from one function, I run two tasks and return

in another function, which hap…

View full answer

tzcnt · 2025-08-31T17:34:35Z

tzcnt
Aug 31, 2025
Maintainer

Can I ask a few silly questions since I'm new to coroutines, but eager to switch my oneTBB-based engine to TMC?

Do I get it correctly that I can't place an aw_spawn_fork in a global struct and co_await it later from some other coroutine, not the one that spawned it?
(my assumption comes from that aw_spawn_fork doesn't have a default constructor, so putting it in a struct is tricky. Moreover, you mentioned in the docs that they contain pointers to this, which I got at "you must always co_await the results of fork() within the coroutine that spawned them").

What would be the best way for the following scenario:

from one function, I run two tasks and return

in another function, which happens several code blocks later, I need to wait for them

Currently, I run two task_groups from the former and them wait for them from the latter. task_groups lay in a shared struct, I don't pass pointers for them around the code. I'd like to not use post_awaitable() and them block on the futures later, since that's not what coroutines are about.

I thought of something like:

in the first function, spawn 2 coroutines/tasks and detach() them immediately

put either an async version of condvar or barrier in the shared struct and toggle them the end of those coroutines

later, when I want to wait for them, co_await the condvar/barrier to suspend instead of blocking in case they are not ready yet.

But will this guarantee that the coroutines which I'm waiting for will execute for sure even if someone has 1 core and everything executes serially there? And this doesn't look like an intended/obvious way...

aw_spawn_fork contains 2 things which are non-movable:

a control block which allows synchronization of the completion between the awaiter (parent), the forked task (child), and any other forked tasks that might be awaited as part of its group (siblings)
storage for all results of the group (if non-void)

Since the child tasks begin executing immediately and capture a pointer to this control block (this happens during the call to fork()), its location must be pinned before they begin execution.

There are several possible workarounds using detached tasks:

use a barrier or latch initialized to N+1 that is shared between the child tasks and eventual awaiter
use a channel and push the results into it from each task - this is the most flexible approach since you can kick off new tasks at any time. However, like the barrier you do need to know exactly how many results to expect, at least until channel::try_pull() #69 is implemented.

The need for these workarounds is a weakness of the current API that will be rectified in the future by #62 , which will enable std::future-like behaviors for tasks. Another option would be #75.

As long as you don't use any blocking waits (tmc::post_waitable, std::future, std::mutex), all of the work will be executed eventually, even on a single core machine.

0 replies

tzcnt · 2025-08-31T17:35:50Z

tzcnt
Aug 31, 2025
Maintainer

Since aw_mutex and ex_braid do nearly the same stuff, which one is faster from your code PoV? Mutex seems to be a bit heavier since it builds an awaiters list and need to repost every awaiter on each unlock? But if the contention is really narrow and it's unlikely for this mutex to be blocked, ex_braid can incur more overhead?

In my benchmarks the performance is similar. ex_braid should have higher throughput/bandwidth under high contention, but mutex is more lightweight and should have lower latency under low contention. I'd say the choice between them lies in how long-lived the object is, since ex_braid incurs several allocations on creation, but mutex is allocation-free. In the next release, ex_braid is being rewritten on top of channel which will reduce this difference.

One major difference is that ex_braid is tied to a particular parent executor, and only threads from that executor will participate in executing tasks on the braid. However, a mutex can be shared by tasks that run on different executors, and those tasks will each execute on their own original executor.

An additional reason to use ex_braid is the ability to spawn().run_on() a child task on the braid and have just that task run on the braid. Whereas with mutex you would need to have an explicit reference to the mutex in the child task function parameter list, and then do the lock/unlock inside the function body.

0 replies

tzcnt · 2025-08-31T17:36:19Z

tzcnt
Aug 31, 2025
Maintainer

aw_mutex has unlock() and co_unlock(). The latter can do sync transfer. Can sync transfer lead to that let's say 8 threads suspended on the same lock, but, if using co_unlock(), these 8 coroutines will continue execution on only one thread serially even after this mutex' scope?

Threads don't suspend on locks in TMC. Only tasks. If you have an ex_cpu with 8 threads, they will all steal work from each other. I think you are asking about the scenario below. Thread 1 takes care of the problem you are asking about. As long as work is available in the ex_cpu queue, all threads will participate in work-stealing.

thread 0 running task 0 acquires the mutex
task 0 co_unlock()s the mutex
thread 0 finds task 1 in the mutex queue
thread 0 posts task 0 back to its own ex_cpu queue
- this awakens thread 1 from the ex_cpu thread pool
thread 0 transfers the mutex lock to task 1 and begins executing task 1
thread 1 steals task 0 from thread 0's ex_cpu queue and resumes it

0 replies

tzcnt · 2025-08-31T17:38:14Z

tzcnt
Aug 31, 2025
Maintainer

If I have the following tree of function calls:
tmc::task f1() // coroutine
  void f2() // regular function called by the coroutine above
    void f3() // called by f2()
Does it mean that if I want to run a coroutine from f3(), I need to call post*() from it, even though it's still run on ex_cpu already, but doesn't have a suspension point? Or, the preferred alternative would be to convert each of those functions to a coroutine?

My main idea is to not block any of the coroutines I want to introduce with serial stuff like generic mutexes/futures/etc, so that only one thread (which runs main synchronous code) could be blocked at a time.

This is one of the core issues with stackless coroutines - that they introduce the "function coloring problem". If you want to call tmc::task<void> f4() from f3() and await for a result, then f3() and by extension f2() all need to be coroutines.

You could call post_waitable() and then block the thread waiting for a result, which is known as the "sync over async" antipattern. This is a bad idea to do on the primary thread pool as it can cause resource exhaustion / deadlocks quite easily. I see that you are planning to do this only on a dedicated single thread / main thread, and if you need to interact with legacy external components that offer std::mutex/std::future-based interfaces then that may be a good way to handle it.

0 replies

tzcnt · 2025-08-31T17:38:52Z

tzcnt
Aug 31, 2025
Maintainer

Unrelated: BTW, you can use std::hardware_destructive_interference_size instead of hardcoding to 64, it's a constexpr IIRC.
But I've also seen that some developers started multiplying it by two since modern CPUs (at least x86_64) often tend to fetch 2 CLs at a time instead of one which could still provoke false-sharing.
Anyway, only benchmarking could give a reliable answer here.

Yes... like you said, on some systems std::hardware_destructive_interference_size really should be 128, but the compiler still returns 64 everywhere... so I didn't bother to introduce a changeable constant. But it's the right thing to do to get rid of this magic number and make it easier to experiment with different values in the future.

0 replies

solbjorn · 2025-08-31T18:48:13Z

solbjorn
Aug 31, 2025
Author

OMG, thanks a lot for such detailed explanations! I'll keep the issue opened for a while in case I have more quesitons, ok?
From your explanations, it seems like I got everything right from your documentation, which is another + for you :>

0 replies

solbjorn · 2025-09-01T11:43:23Z

solbjorn
Sep 1, 2025
Author

I was designing some drafts yesterday (in my head) how I could convert the main rendering function to coroutines.
It would go mostly smoothly as it has only a couple mutexes (to be replaced with an async mutex and/or braid), however, DirectX 11.2 has several stupid APIs which can be called only the following way (one particular example):

    while ((hr = GetData(used[ID].Q.Get(), &fragments, sizeof(fragments))) == S_FALSE) { }

It's not THAT bad in the engine I work on (this piece was not written by me)

    while ((hr = GetData(used[ID].Q.Get(), &fragments, sizeof(fragments))) == S_FALSE)
    {
        if (!SwitchToThread())
            Sleep(ps_r2_wait_sleep);

        if (T.GetElapsed_ms() > ps_r2_wait_timeout)
        {
            fragments = (occq_result)-1; // 0xffffffff;
            break;
        }
    }

but still is a synchronous while-yield-blah waste of resources.

I haven't found ANY DirectX API that would allow to do that asynchronously (post a request and get notified when the data is ready). Ideally we could suspend the rendering coroutine here and resume it once notified.

What would be the best approach for this? I don't want to do sync-over-async crap.

Something like, move such pieces to the main/legacy thread, suspend the rendering coro once it hits the point where we need to get the result, then wait for this loop in the main thread and post a coro resuming the rendering one?

Are the compilers able to inline coroutines when they're small? Like

inline tmc::task<int> fn() { co_return 2; }

tmc::task<void> fn2()
{
    ...
    int x = co_await tmc::spawn(fn);
    ...
}

Wouldn't this generate more code / yield worse optimization than a regular sync call to an inline which returns 2?

(I use Clang 21 with -O3 -flto -fwhole-program-vtables)

In case when I want to call an async coroutine from legacy code and immediately wait for the result, there are at least 2 options:

use the future that tmc::post_awaitable() returns;
use atomic wait-notify like you do in tmc::async_main().

I suspect it would be more efficient to use the second approach?

0 replies

tzcnt · 2025-09-02T15:36:01Z

tzcnt
Sep 2, 2025
Maintainer

I was designing some drafts yesterday (in my head) how I could convert the main rendering function to coroutines.
It would go mostly smoothly as it has only a couple mutexes (to be replaced with an async mutex and/or braid), however, DirectX 11.2 has several stupid APIs which can be called only the following way (one particular example):
    while ((hr = GetData(used[ID].Q.Get(), &fragments, sizeof(fragments))) == S_FALSE) { }
It's not THAT bad in the engine I work on (this piece was not written by me)
    while ((hr = GetData(used[ID].Q.Get(), &fragments, sizeof(fragments))) == S_FALSE)
    {
        if (!SwitchToThread())
            Sleep(ps_r2_wait_sleep);

        if (T.GetElapsed_ms() > ps_r2_wait_timeout)
        {
            fragments = (occq_result)-1; // 0xffffffff;
            break;
        }
    }
but still is a synchronous while-yield-blah waste of resources.

I haven't found ANY DirectX API that would allow to do that asynchronously (post a request and get notified when the data is ready). Ideally we could suspend the rendering coroutine here and resume it once notified.

What would be the best approach for this? I don't want to do sync-over-async crap.

Something like, move such pieces to the main/legacy thread, suspend the rendering coro once it hits the point where we need to get the result, then wait for this loop in the main thread and post a coro resuming the rendering one?

If the API requires you to poll periodically, you will need to use a timer. Some coroutine libraries offer async timer facilities, but TMC does not offer them directly - rather you can use them via the Asio integration in tmc-asio. There are a couple examples of using the asio timer facilities here:

https://github.com/tzcnt/tmc-examples/blob/main/examples/asio/timer_mem_bench.cpp
https://github.com/tzcnt/tmc-examples/blob/main/examples/asio/delay.cpp

However it's worth noting that there is no truly async timer. Under the hood, a thread is blocking on the OS timer syscall, and then posting the results back to the executor queue when ready. Your approach of doing this manually using the main thread is equivalent. I like the SwitchToThread() call - this may actually help performance in some cases as it's more lightweight than actually blocking on the timer.

0 replies

tzcnt · 2025-09-02T15:40:06Z

tzcnt
Sep 2, 2025
Maintainer

Are the compilers able to inline coroutines when they're small? Like

inline tmc::task<int> fn() { co_return 2; }

tmc::task<void> fn2()
{
    ...
    int x = co_await tmc::spawn(fn);
    ...
}
Wouldn't this generate more code / yield worse optimization than a regular sync call to an inline which returns 2?

(I use Clang 21 with -O3 -flto -fwhole-program-vtables)

Compilers are supposed to be able to inline coroutines, but often fail to do so at this time. The Clang 20 attributes (#61) are supposed to help with this. This particular item is near the top of my priority list... so it will be coming soon(tm).

Notably you don't need to use tmc::spawn unless you want to customize the behavior of the coroutine. In the snippet you provided, you don't need it, and you're more likely to get a successful inlining if you just do a direct call:

inline tmc::task<int> fn() { co_return 2; }

tmc::task<void> fn2()
{
    ...
    int x = co_await fn();
    ...
}

0 replies

tzcnt · 2025-09-02T15:48:01Z

tzcnt
Sep 2, 2025
Maintainer

In case when I want to call an async coroutine from legacy code and immediately wait for the result, there are at least 2 options:

use the future that tmc::post_awaitable() returns;

use atomic wait-notify like you do in tmc::async_main().

I suspect it would be more efficient to use the second approach?

I'm not sure about this. It depends on the implementation of std::future::wait vs. std::atomic::wait in your stdlib. Maybe using std::atomic::wait would be slightly faster as it doesn't introduce additional allocations, which std::promise does. However, I'd expect both of them to call the same syscall under the hood (WaitForSingleObject?? I'm mostly a Linux developer.) and for that syscall to dominate the runtime. You can try rolling your own synchronization primitive if you want - post_waitable() was provided only as an easy entry point for this kind of thing. Please let me know if you find a significant performance win from doing this.

0 replies

solbjorn · 2025-09-02T17:58:19Z

solbjorn
Sep 2, 2025
Author

I'm not sure about this. It depends on the implementation of std::future::wait vs. std::atomic::wait in your stdlib. Maybe using std::atomic::wait would be slightly faster as it doesn't introduce additional allocations, which std::promise does. However, I'd expect both of them to call the same syscall under the hood (WaitForSingleObject?? I'm mostly a Linux developer.) and for that syscall to dominate the runtime. You can try rolling your own synchronization primitive if you want - post_waitable() was provided only as an easy entry point for this kind of thing. Please let me know if you find a significant performance win from doing this.

In LLVM libc++, which I use, future::wait uses a condition_variable + mutex under the hood, std::atomic uses C11 atomics. I didn't dig deeper to the OS calls. Worth trying and comparing I guess.

I'm a Linux dev, too, but this game is Windows-exclusive unfortunately :D (and porting its render from Dx 11.2 to Vulkan would be hell to me)

0 replies

solbjorn · 2025-09-04T17:28:46Z

solbjorn
Sep 4, 2025
Author

Some classes like mutex provide await() and co_await(), with the difference that the latter allows synchronous transfer.
Why is it optimal, are there any downsides or cases where synchronous transfer might be unwanted?

0 replies

tzcnt · 2025-09-05T05:17:56Z

tzcnt
Sep 5, 2025
Maintainer

co_unlock will maximize the throughput of the mutex, but the current (unlocking) task is suspended and posted back to the executor queue for continuation.

unlock or the scope destructor will continue executing the current (unlocking) task afterward. The task that gets the lock next has to be posted to the executor queue first. Until it is taken off the queue and resumed, the mutex is "locked" but nothing running with it.

The advantage of the synchronous unlock are if the unlocking task is latency sensitive and you don't want it to suspend, or if you want to make use of the RAII lock scope object.

0 replies

tzcnt · 2025-10-20T03:06:18Z

tzcnt
Oct 20, 2025
Maintainer

I'm going to convert this to a discussion; feel free to continue with any further questions there.

0 replies

solbjorn · 2025-11-15T17:50:16Z

solbjorn
Nov 15, 2025
Author

Not really related to TMC, rather to TMC-ASIO/ASIO...
I hope you don't mind such dumb questions, if so -- just let me know ;)

As for timers, you/we do:

co_await asio::steady_timer{tmc::asio_executor(), std::chrono::seconds(10)}.async_wait(tmc::aw_asio);

What is the best approach to run arbitrary functions on ASIO and then get the TMC completion token?
IOW something like this:

void execute_this_on_asio_io()
{
    // do stuff
}

co_await asio::something(tmc::asio_executor(), &execute_this_on_asio_io, tmc::aw_asio);

The ASIO documentation is very cryptic and there are no good tutorials TBH.
I found 2 approaches:

Use asio::post():

co_await asio::post(tmc::asio_executor(), &execute_this_on_asio_io, tmc::aw_asio);

I'm confused tho that post() does some internal allocations (if I got its code correctly -- it's not very friendly/intuitive to read)?

Implement own class and then call async_initiate():

class my_async
{
    // here you implement methods which are required by ASIO to be able to run it
    // and also store tmc::asio_executor() pointer/reference here
};

co_await asio::async_initiate(my_async{tmc::asio_executor(), /* some args */}, tmc::aw_asio);

Any other ways?

Why I need this (I think I mentioned it somewhere already):

The whole game engine is planned to be run as coroutines on the tmc::ex_cpu() pool.
However, it has a couple DX calls where you need to poll for the completion and also Lua bindings. These have to be called outside of coroutine context (I don't want to block any coroutine even if I'm sure there'll be no deadlocks etc.).

So, my idea is: when I need to execute something outside of coroutines, I delegate it to ASIO and let it resume the calling coroutine upon completion.
These calls will be mostly short and fast. The ones that will need to poll for DX ops completion, won't be blocking -- I can poll them once in a while and use ASIO timers until the result is ready; no "classic" mutexes or any other blocking calls.
There also will be no simultaneous calls of these ASIO functions from several coroutine threads.

Of course, I could just create a standalone jthread and avoid using ASIO completely. But this would imply implementing an awaitable similar to aw_asio manually and more boilerplate stuff.

Offtop P. S. Great news regarding atomics on Windows, as least when using LLVM's libc++: in LLVM 22, they finally implemented std::atomic::notify*() and std::atomic::wait() using Windows internal API instead of simple polling.

This API is not tied to scheduler slices, unlike WaitForSingleObject(), Sleep(), and the related crap -- so it's at least claimed that notify*() will wake the waiter immediately and that the waiters won't wake to poll until notified.
This means that the logic used in e.g. tmc::async_main() (the synchronous code waits, the coroutine notifies) should finally start working as fast and efficiently as on Linux, and much faster and efficiently than using std::future (futures are implemented using WaitForSingleObject() on Windows).

1 reply

tzcnt Nov 15, 2025
Maintainer

In general, you should only need to use functions in the asio namespace to do I/O, file, or timer operations. These operations are also the only place you'll need the tmc::aw_asio completion token.

For task/executor/awaitable manipulation, everything you need should be in the tmc namespace. To make a regular function into an awaitable, use spawn_func. To run it on the Asio executor, use the run_on awaitable customizer.

extern void DirectXBlockingCall(int Frame);

tmc::task<void> your_main() {
  for(int frameNo = 0;;++frameNo) {
    // prepare the new frame on ex_cpu

    // The blocking call will occur on the Asio thread. ex_cpu threads will not be blocked.
    co_await tmc::spawn_func(DirectXBlockingCall, frameNo).run_on(tmc::asio_executor());
  }
}

It's also worth noting that if you are going to call a blocking function, it doesn't really matter if you're calling it from a coroutine or a regular function - the issue is that the calling thread is blocked, and can't process any other work. So we just want to make sure that the thread that gets blocked is the Asio thread. Thus, the following 3 invocations are equivalent:

    co_await tmc::spawn_func(DirectXBlockingCall, frameNo).run_on(tmc::asio_executor());
    co_await tmc::spawn_func([frameNo]() -> void {DirectXBlockingCall(frameNo);}).run_on(tmc::asio_executor());
    co_await tmc::spawn([](int FrameNo) -> tmc::task<void> {DirectXBlockingCall(FrameNo); co_return;}(frameNo)).run_on(tmc::asio_executor());

The only advantage of the first one is that it is quite a bit shorter, since spawn_func() supports binding the function args for delayed invocation, which spawn() does not.

I'm glad to hear that the implementation of std::atomic::wait()/notify() is being improved. I also use this internally in ex_cpu to notify threads to wake up... so this should improve overall performance on Windows (which is currently a bit lackluster compared to Linux).

solbjorn · 2025-12-08T04:06:51Z

solbjorn
Dec 8, 2025
Author

Usecase: in many places in the game engine, you need to update object state only once a frame, but then this state can be read concurrently from N threads when rendering.
Usually we compare the current frame number with the number which the state was updated on, and if they don't match, lock the object for writing (which means all readers will wait until it's done) and update the state; the rest of threads will then lock it only for reading (i.e. wait until the update finishes and then continue in parallel instead of one-by-one as with a regular mutex).

Async version of std::shared_mutex plus some atomic magic or maybe there's a better approach?

IOW (pseudo-code),

if (curr_frame != obj.frame)
{
    lock_readers();
    update();
    wake_readers();
}
else
{
    co_await suspend_if_update_in_progress();
    read_without_locking();
}

The tricky part's that if (curr_frame != obj.frame) lock_readers() should be atomic ideally, otherwise even with compare-and-exchange + lock there's a small chance that a thread would sneak in between these two and either run a second update or read the old or uncomplete state.

I see that currently most of such places is done like

    std::mutex lock;

    void update_if_needed()
    {
        std::scoped_lock slock{lock};

        if (frame != curr_frame)
            update();
    }

    ...
    update_if_needed();
    read_and_process();

but with this approach, that each thread tries to take a lock even for a few nanosecs which means redundant contention/suspending/resuming.

If we know in advance when to run an update and we're sure we'll fire it before any reader, then I think I can just use a manual reset event, but it's not a common case.

UPD:

Maybe something like

std::atomic<u64> frame;
tmc::atomic_condvar<u64> gate;

if (compare_exchange(frame, curr_frame) != curr_frame)
{
    update();
    gate = curr_frame;
    co_await gate.wake_all();
} 
else
{
    co_await gate.wait(curr_frame - 1);
}

7 replies

tzcnt Dec 9, 2025
Maintainer

Using a regular mutex (async or not), you could still enhance this to reduce the likelihood of locking if you make obj.frame atomic and use double-checked locking. This will get you close to what a RWLock would achieve.

if (curr_frame != obj.frame.load(std::memory_order_relaxed))
{
    std::scoped_lock slock{lock};
    if (curr_frame != obj.frame.load(std::memory_order_relaxed)) {
      update();
      obj.frame.store(curr_frame, std::memory_order_release);
    }
}
read_without_locking();

Further reduction in contention could be achieved by making lock per-object, or sharded into multiple locks based on low bits of some object element hash or identifier? If they have an ID then you could use that, otherwise object pointer address should be fine if you don't move it during this operation. Maybe not the actual low bits of the pointer; if it's 8 or 16-aligned you'd want to start at bit 3 or 4.

Finally, you could dispatch individual tasks for each object, so that waiting for a lock would only impact that task, but not the rest.

std::array<tmc::mutex, 64> objectLocks;

tmc::task<void> renderJob(Object& obj) {
  if (curr_frame != obj.frame.load(std::memory_order_relaxed)) {
    size_t hash = (reinterpret_cast<size_t>(&obj)) >> 4) % 64;
    auto& lock = objectLocks[hash];
    co_await lock;
    if (curr_frame != obj.frame.load(std::memory_order_relaxed)) {
      update();
      obj.frame.store(curr_frame, std::memory_order_release);
    }
    // using co_unlock instead of scoped lock here to maximize mutex throughput
    co_await lock.co_unlock();
  }
  read_and_process_without_locking();
}

tmc::task<void> renderer(std::vector<Object>& objects) {
  co_await tmc::spawn_many(objects | std::ranges::views::transform(renderJob));
}

solbjorn Dec 9, 2025
Author

This lock is per-object already, right (stored inside the object class itself), sorry for confusion.

Double-checking is a nice hint, thanks!

One more thing,

    // using co_unlock instead of scoped lock here to maximize mutex throughput
    co_await lock.co_unlock();

Did I get it right that if we unlock something at the end of a task, co_unlock() will perform better? Like (abstract example)

std::unordered_map<std::string, std::string> map;
tmc::mutex lock;

tmc::task<void> read_one(xml_node& node)
{
    // some non-blocking code that takes most of task's time

    co_await lock;
    map.emplace(key, val);
    co_await lock.co_unlock();
}

co_await tmc::spawn_many(xml | std::views::transform(read_one));

In this case, co_unlock() will go to the next waiter right away and repost the current task to the executor. But the current task came to its end, so it will basically repost the task destruction? And/or co_awaits right at the end of a task are optimized somehow by the compiler?

tzcnt Dec 10, 2025
Maintainer

If you are unlocking at the end of the task, co_unlock() will perform worse, for the reasons you stated. There is no compiler optimization for this - you would be reposting the task destruction. So it's better to use a scoped lock in this case.

This map emplace example is what I call a "shared output" - the tasks can process independently in parallel until they need to store their output in a single place. For this, you might find that you get better performance by separating the parallel and synchronous steps. Then you won't need a lock at all.

std::unordered_map<std::string, std::string> map;

tmc::task<std::pair<Key, Val>> read_one(xml_node& node)
{
    // some non-blocking code that takes most of task's time

    co_return std::pair<Key,Val>(key,val);
}

std::vector<std::pair<Key, Val>> pairs = co_await tmc::spawn_many(xml | std::views::transform(read_one));
for (auto& p : pairs) {
    map.emplace(p.first, p.second);
}

This may have worse latency since you have to wait for all the subtasks to finish before you can start emplacing in the map. Or it may have better latency due to avoiding all the atomic overhead and cache line sharing. You should benchmark it.

However, it's definitely more efficient in terms of system resource usage, so if there's any other tasks running in parallel with this, then it's a win. As you start to parallelize your system further, I recommend structuring your shared-output code like this when you can to maximize your overall bandwidth. IMO it's also easier to read and reason about.

solbjorn Dec 10, 2025
Author

Oh, now I understand the idea. Yeah, I think a lot of places in the engine could be rewritten that way.

Your example also shows that the code could be more optimal sometimes if it was possible to customize the output of spawn_many() and friends. I.e. it could do map.emplace(pair) and return a map instead of doing vector.emplace_back(pair) and returning a vector, so that the conversion (=> additional heap allocations + n times per-element move) wouldn't be needed.
But I didn't read deep how its multi-return is implemented, so I can't say if it's possible at all or wouldn't require a lot of work for little profit.

(unless I missed something and there's a helper in std::ranges or std::views that would do that on the fly and the compiler will be able to optimize out the additional vector overhead)

tzcnt Dec 10, 2025
Maintainer

All TMC awaitables pre-allocate fixed size space for results. Then each task receives a pointer into the results array to store its value. I couldn't do this with a map unless I knew the keys in advance.

One thing that might be more efficient in minimizing overall latency would be #73 (comment) ... another TODO...

solbjorn · 2026-01-05T18:27:13Z

solbjorn
Jan 5, 2026
Author

Not sure where to comment so I'll leave it here.

According to this file:

https://github.com/llvm/llvm-project/blob/main/libcxx/src/atomic.cpp

Every OS has own restrictions on the size of an atomic object to implement efficient wait/notify API.

Linux requires the object to be 4 bytes, Apple allows waiting on 4 and 8 bytes, while FreeBSD and Windows only implement efficient waiting on 8-byte objects (on x64).

I'm mentioning this as IIRC you use the atomic wait/notify API in TMC. It's probably not a good idea to introduce platform/OS-specific macros/branches in the code, but unfortunately seems like there's no one-size-fits-all.

1 reply

tzcnt Jan 6, 2026
Maintainer

I see... I specifically switched to 4-byte because that's what Linux requires. Didn't realize the opposite is true on Windows. Thanks for bringing this to my attention. Fixed in #166

solbjorn · 2026-01-10T15:08:49Z

solbjorn
Jan 10, 2026
Author

Quick question about lambda captures...

Even before I started reworking the engine to TMC, I heard that it's not safe to use capturing lambda coroutines.
Yesterday I made sure myself by accidentally forgetting to rework this piece of code (after converting it from oneTBB):

    auto second = co_await tmc::fork_clang(
        [this, &SecondThreadTasksEndTime] [[nodiscard]] -> tmc::task<void> {
            // ...

            SecondThreadTasksEndTime = std::chrono::high_resolution_clock::now() - SecondThreadTasksStartTime;
        }(),
        tmc::current_executor(), xr::tmc_priority_any); // offtop: tmc_priority_any is shared between P-Cores and E-Cores

    // some code that continues in parallel, including several co_awaits

   co_await std::move(second);

This expectedly crashes when trying to write to SecondThreadTasksEndTime. IIRC from the CPP Core Guidelines, lambda captures are valid only before the first co_await.

On the other hand, sometimes we can't do something without capturing. Let's say tmc::spawn_many() -- it takes only a range, so you almost always need to capture something inside std::views::transform().
In this example:

https://github.com/tzcnt/tmc-examples/blob/main/examples/alignment.cpp#L67

you use captures (by reference) and then call co_await tmc::spawn_many(). I assume it's safe here?

Maybe you could give a quick list of TMC "spawners" and say which ones can use lambda captures and which shouldn't?

I have an assumption that if a coroutine is immediately awaited, like in the example above with tmc::spawn_many(), then it's safe, but in case of .fork() (or fork_clang() / fork_group()), it is not -- is that correct?

First impressions after several days of reworking the engine from oneTBB to TMC:

TMC rocks;
WinAPI sucks.

The second statement is due to that when you're creating a window, Windows records the thread ID which created it and then doesn't allow you to do anything with the window from any other thread.
So initially, I thought I'll use tmc::ex_cpu_st only for operations that can block their thread for a long time (Dx's Present(), waiting for occlusion results etc.), but now I use also to create and manipulate the game window.

My current TMC settings:

    auto topo = tmc::topology::query();
    auto& cpu = tmc::cpu_executor();

    if (topo.is_hybrid())
    {
        tmc::topology::topology_filter p_cores, e_cores;
        p_cores.set_cpu_kinds(tmc::topology::cpu_kind::PERFORMANCE);
        e_cores.set_cpu_kinds(tmc::topology::cpu_kind::EFFICIENCY1);

        // xr::tmc_priority_high{0};
        // xr::tmc_priority_any{1};
        // xr::tmc_priority_low{2};
        cpu.add_partition(p_cores, xr::tmc_priority_high, xr::tmc_priority_any + 1).add_partition(e_cores, xr::tmc_priority_any, xr::tmc_priority_low + 1);
    }

    cpu.fill_thread_occupancy().init();

I think it's the same as in your example: prio 0 is exclusive to P-Cores, prio 1 is shared, prio 2 is exclusive to E-Cores.
Log output:

[10.01.26 03:03:16.807] [PE00P0] * NUMA nodes: 1
[10.01.26 03:03:16.807] [PE00P0] * Hybrid architecture: yes
[10.01.26 03:03:16.807] [PE00P0] * Physical cores: 16
[10.01.26 03:03:16.807] [PE00P0] *  Performance cores: 8
[10.01.26 03:03:16.807] [PE00P0] *  Efficiency cores: 8
[10.01.26 03:03:16.807] [PE00P0] * Logical processors: 24
[10.01.26 03:03:16.807] [PE00P0] *  Container CPU quota: 0
[10.01.26 03:03:16.807] [PE00P0] * Core groups: 3
[10.01.26 03:03:16.807] [PE00P0] *  Group 0: NUMA 0, kind: performance, SMT: 2
[10.01.26 03:03:16.807] [PE00P0] *   Cores: 0, 1, 2, 3, 4, 5, 6, 7
[10.01.26 03:03:16.807] [PE00P0] *  Group 1: NUMA 0, kind: efficiency, SMT: 1
[10.01.26 03:03:16.807] [PE00P0] *   Cores: 8, 9, 10, 11
[10.01.26 03:03:16.807] [PE00P0] *  Group 2: NUMA 0, kind: efficiency, SMT: 1
[10.01.26 03:03:16.807] [PE00P0] *   Cores: 12, 13, 14, 15
[10.01.26 03:03:16.807] [PE00P0] * TMC threads: 24 (main) + 1 (ST)
[10.01.26 03:03:16.807] [PE00P0] *  Performance: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
[10.01.26 03:03:16.807] [PE00P0] *  Efficiency: 16, 17, 18, 19, 20, 21, 22, 23

(it's interesting tho why 8 E-Cores have 2 cache groups on my Alderlake)
TMC topology API also allowed me to replace meaningless thread IDs in the logs to something more useful:

[PExxPy] -- performance TMC thread xx, priority y
[EFxxPy] -- efficiency TMC thread xx, priority y
[ST00Py] -- single-threaded executor, priority y

The rework is far from done, but is ongoing. The engine initialization is done (inc. concurrent parts), the main menu works.

4 replies

tzcnt Jan 10, 2026
Maintainer

Lambda captures are valid before the first co_await if the coroutine is eager, since the coroutine starts executing before the lambda returns. However, tmc::task is lazy, so the lambda completes and returns (and may be destroyed) before the task begins executing. This means that lambda captures are never valid in tmc::task.

The workaround for this is to always pass external references/pointers as parameters to the coroutine function.

The pattern you observed in alignment.cpp is a wrapper that is a lambda, that calls a coroutine function and returns the coroutine object. The purpose of this lambda is to transform those references into parameters of the coroutine so they are captured properly. But notice that the lambda itself is not a coroutine (doesn't use co_await or co_return).

    std::ranges::views::transform([&](size_t idx) -> tmc::task<void> {
      return run_one(idx, &r1[idx], &r2[idx]);
    });

If I were to change return to co_return, now the lambda becomes a coroutine and bad things would happen.

    // this will cause a use-after-free
    std::ranges::views::transform([&](size_t idx) -> tmc::task<void> {
      co_return co_await run_one(idx, &r1[idx], &r2[idx]);
    });

There is another example of this that makes it a bit more clear at https://github.com/tzcnt/tmc-examples/blob/fc45fc96576c35965ac90a6a7ae1bcc3d2adfe5f/tests/test_executors.ipp#L156 notice that there are two levels of lambdas, with the outer one taking the reference parameter and loop count, and passing them all on as parameters to the inner one. This pattern is really ugly so I usually prefer to use a named function on the inside.

tzcnt Jan 10, 2026
Maintainer

BTW I'm considering the creation of a spawn_for / spawn_for_each wrapper that will simplify this pattern (at least won't need to use ranges perhaps) in v1.4. If you run into any places in your code where this would be helpful, I'd appreciate any feedback. Also, there is a type tmc::iter_adapter that I don't publicly advertise, since I'm considering moving it into a different header/namespace in v1.4, but you might find it helpful as a lightweight replacement for ranges in the most common case of a single transform. It's used in the test_executors link above.

Intel E-cores so far have come in clusters of 4 which share an L2. You can see the lstopo for an equivalent system here: https://www.open-mpi.org/projects/hwloc/lstopo/images/RaptorLake-hybrid.v2.10.png Theoretically this should mean that inter-core latency between E-cores that share an L2 is lower than to the other E-core group, or to the P-cores. I'm not sure whether this is actually true for Alder Lake generation, but I've been told recently that it's definitely true for Lunar Lake.

Probably worth testing without using SMT on the P-cores, or without using full SMT (.set_thread_occupancy(1.5f)) at some point. Since this gives space for other non-TMC threads to run. For example the graphics driver will use multiple threads internally.

I'm glad to hear that things are going well so far. I did try to design features with game engines in mind. If you run into anything that really doesn't work, I'd like to hear about it.

I'd also love to try playing this version of the game at some point, if you'd be willing to give me some guidance about how to setup the necessary game / mod files. It would probably be better to communicate about that through a different medium (Discord?).

solbjorn Jan 10, 2026
Author

    std::ranges::views::transform([&](size_t idx) -> tmc::task<void> {
      return run_one(idx, &r1[idx], &r2[idx]);
    });

Aaah, right, I didn't notice the outer wrapper, thanks!

Also i noticed that the previously failing task above, after rewriting, works fine, despite that

    auto second = co_await tmc::fork_clang(
        [this] [[nodiscard]] -> tmc::task<void> {

this is still a capture, not a parameter. this points to an object that was created long before this task and outlives it.
Is this still dangerous and it was just luck or capturing this is okay?

I'm glad to hear that things are going well so far. I did try to design features with game engines in mind. If you run
into anything that really doesn't work, I'd like to hear about it.

For now, I had like 2 places where I'm loosing the function coloring due to external callback crap. For example, WinAPI uses "wndproc" callbacks which you need to set to respond to window/OS events.

I created a simple vector (since I know it can be changed from only one thread at a time) of delegates (which I mentioned some time ago) and from the callbacks where I don't have the coroutine context anymore, I just add a new item there. The vector is then processed once a frame (just before rendering) from the normal coroutine context. Nothing performance-critical there, so a tiny delay is fine.

tzcnt Jan 10, 2026
Maintainer

Capturing anything is unsafe. The capture list of a coroutine lambda must be completely empty. The core guidelines recommend just using named functions instead https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#rcoro-capture . If you need to capture this you could also make it a coroutine member function (which works because this is the implicit first parameter of the function)

BTW there is another guideline in this section which you can ignore https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#rcoro-reference-parameters as it's not an issue if you wait for the coroutine to finish before the referenced object is destroyed (assuming it lives in the caller's scope)

solbjorn · 2026-01-12T16:55:55Z

solbjorn
Jan 12, 2026
Author

Any good solution to preserve [[nodiscard]] of the original function when converting it to a coroutine?

Imagine this:

    [[nodiscard]] bool key_press(gsl::index key);

The compiler will warn if key_press()'s been called, but the return value hasn't been checked.
Now I'm converting it:

    [[nodiscard]] tmc::task<bool> key_press(gsl::index key);

The compiler still warns and that's good: a good protection and a reminder that it's a coroutine, so it should either be co_awaited or spawned or so.
But seems like if I don't check the result of co_await, the compiler doesn't warn anymore.

tmc::task<typename> only stores the return type, right? There's no option like tmc::task<[[nodiscard]] bool>.
Any ideas? Or did I miss something?

5 replies

tzcnt Jan 12, 2026
Maintainer

Difficult problem. I tried returning a wrapper object template<typename T> struct [[nodiscard]] nodiscard_result { T value; }; from the task, but that didn't work.

The only thing that worked was applying [[nodiscard]] to aw_task::await_resume() in the library code. Unfortunately that makes every single result nodiscard.

I'm not sure if there is a less broad solution. Perhaps a nodiscard_task type? However I'm leery of introducing too many variants to task at this point... this would probably go better with the Context type parameter integration I'm considering for v1.5. I need to make it so the user can configure a variety of options on the Task without them clashing with each other.

solbjorn Jan 12, 2026
Author

Yeah, nodiscard_task is not an option.

Regarding [[nodiscard]] aw_task::await_resume() -- if this doesn't break tmc::task<void>, it would be acceptable to me (dunno about you and other users), as I'm trying to follow the Core Guidelines and mark everything non-void with [[nodiscard]].

tzcnt Jan 16, 2026
Maintainer

I've come around to this idea, mostly because there's no other way to accomplish this for primitive types - it has to be done inside the library. And I am a big fan of compile-time safety. You want to try out this branch and let me know if it works for you? https://github.com/tzcnt/TooManyCooks/tree/awaitable_result_nodiscard
You will need to define TMC_AWAITABLE_RESULT_NODISCARD in the build script.

Also I'm open to ideas on shorter macro names... that one is a bit of a mouthful, but I want to clarify that it only affects the result (most of the awaitables themselves are already nodiscard).

solbjorn Jan 16, 2026
Author

I believe it's a good name. I've just tested it and it indeed works -- I've changed the return type of one local coroutine from tmc::task<void> to tmc::task<bool> without changing the call site and the compiler yielded a -Wunused-result error. tmc::task<void> still works as expected.
So if you decide to pull this to the main branch, you already have one user for this :) I also try to catch as many problems during the compilation as possible and -Wunused-result was already promoted to a compilation error in my project some time ago, so I'm glad to see a possible solution.
I also find a compile-time configuration macro a good choice for this -- the user can decide whether he wants it, but it doesn't involve additional types/branches/whatever in the library.

Thanks a lot!

tzcnt Jan 17, 2026
Maintainer

Merged into main. I shortened the macro to enable it to TMC_NODISCARD_AWAIT

solbjorn · 2026-01-17T13:08:49Z

solbjorn
Jan 17, 2026
Author

Yet another question...

I already refactored the main rendering function to be able to use fork_clang instead of tbb::task_group for secondary contexts (which render sun cascades, rain, and lights). However, we don't always need to run them -- if there's no rain in the current weather config, no sun (cloudy weather) etc.
Since fork awaitables are non-movable and must be assigned immediately (i.e. auto task = tmc::fork_clang(...), no var task; task = tmc::fork_clang(...), it doesn't seem that forks can be spawned conditionally.
How could I approach it? Surely I can fork a dummy empty coroutine instead of the rendering one in case there's no rain/sun/etc., but this would be a bit of overhead (very little though I believe). Any other possible solutions?

Pseudo-code which could help understand this:

tmc::task<void> CRender:render()
{
    auto main = tmc::fork_clang(run_main());

    const bool sun = // here we check if we need to render sun;
    if (sun)
        // here we need to fork sun_run();

    ... // render() continues here, draws the main scene etc.

    if (sun)
        // here we need to co_await sun_run()'s fork;

    ...
}

5 replies

tzcnt Jan 17, 2026
Maintainer

fork_group should solve it, and also lets you avoid having multiple conditionals at the end. I did some testing to double check that this works, and the compiler statically allocates enough space in the parent coroutine for the case where all of the conditions are true.

auto secondaryRenders = tmc::fork_group();
if (sun) { co_await secondaryRenders.fork_clang(sun_run()); }
if (rain) { co_await secondaryRenders.fork_clang(rain_run()); }
if (lights) { co_await secondaryRenders.fork_clang(lights_run()); }

// draw main scene

co_await std::move(secondaryRenders);

If you need to return results from these, then you'll need to set the MaxCount or RuntimeMaxCount in the constructor, and then check the size() at the end to see how many tasks were actually forked.

solbjorn Jan 17, 2026
Author

Good idea. Unfortunately, I need to wait for the rain and sun in different places (rain needs to be rendered earlier than the sun, and the sun rendering is heavier, so we wait for it later than for the rain) =\

(https://github.com/solbjorn/reaper-engine/blob/tmc/ogsr_engine/Layers/xrRenderPC_R4/r4_R_render.cpp#L208, rain is awaited on line 208, sun is on line 265)

Would I waste a lot of cycles if I fork them one by one, but the forked coroutine will actually do nothing besides checking that there's no rain/sun?

tzcnt Jan 17, 2026
Maintainer

For a coro that doesn't create an allocation and isn't stolen by another thread the time taken is under 20ns. I wouldn't worry about it in general. Even if there was an allocation happening, since you're using mimalloc the total wastage would be less than 100ns. These are conservative numbers, I think the actual overhead is much lower.

With that said, you can just use a separate fork_group for each task and that will still let you do it without actually executing the child task if not needed. fork_group is a lightweight structure - the same size as the regular spawn().fork() or fork_clang() structure with nearly identical runtime behavior.

solbjorn Jan 17, 2026
Author

IOW this is allowed and will work as expected? Cool one, I didn't even know you can create a fork group and leave it empty if needed.
Thanks!

auto sun_fg = tmc::fork_group();
if (sun) { co_await sun_fg.fork_clang(sun_run()); }

auto rain_fg = tmc::fork_group();
if (rain) { co_await rain_fg.fork_clang(rain_run()); }

auto lights_fg = tmc::fork_group();
if (lights) { co_await lights_fg.fork_clang(lights_run()); }

// main scene

co_await std::move(rain_fg);

// ...

co_await std::move(sun_fg);

// ...

co_await std::move(lights_fg);

tzcnt Jan 17, 2026
Maintainer

All of the awaitable grouping functions (spawn_tuple, spawn_many, spawn_group, fork_group) work when empty. Even when setting the Count template parameter, Count is only the maximum, but you can pass less if needed (by also passing a count or iterator end runtime parameter)

solbjorn · 2026-01-20T11:03:03Z

solbjorn
Jan 20, 2026
Author

Might be helpful:

llvm/llvm-project@24131e9

Too pity it didn't make it to LLVM 22.

2 replies

tzcnt Jan 20, 2026
Maintainer

Check the discussion on the PR - unfortunately it's not sufficient for real world use cases yet. But it's a step in the right direction.

solbjorn Jan 20, 2026
Author

Ooops, I didn't read the comments -- you were already aware of this. I hope they'll continue working on it.

solbjorn · 2026-01-21T01:39:25Z

solbjorn
Jan 21, 2026
Author

Breh, I completely forgot about the utility awaitables and was wrapping oneliners in lambdas...

When I need to execute something on the standalone tmc::ex_cpu_st, I previously did it that way:

    co_await tmc::spawn([](auto& last) -> tmc::task<void> {
        PIX_EVENT(DEFER_FLUSH_OCCLUSION);

        for (auto light : last)
        {
            if (light == nullptr)
                continue;

            for (auto& svis : light->svis)
                svis.flushoccq();
        }

        last.clear();
        co_return;
    }(Lights_LastFrame)).run_on(xr::tmc_cpu_st_executor());

Now I'm thinking of just

    {
        auto scope = co_await tmc::enter(xr::tmc_cpu_st_executor());
        PIX_EVENT(DEFER_FLUSH_OCCLUSION);

        for (auto light : Lights_LastFrame)
        {
            if (light == nullptr)
                continue;

            for (auto& svis : light->svis)
                svis.flushoccq();
        }

        Lights_LastFrame.clear();
        co_await scope.exit();
    }

The second should be more optimal I guess?

IIRC .run_on() prevents Clang from doing HALO -- what if we just switch the executor in-place (the second snippet)? Does .run_on() make sense to use when I immediately co_await it or it's meant to be used with .fork()/.detach() only?

P. S. In case you haven't seen my comment in #175:

I'm planning to switch to a more efficient Chase-Lev queue in the future.

Just out of curiosity: how soon? Will it be wsq from TaskFlow or something new from scratch?

:3

It's funny nonetheless that TMC is 3x faster than TF even without Chase-Lev queues :D

8 replies

solbjorn Jan 22, 2026
Author

I've used xoshiro in the past and had good results. For this I don't care about "cryptographically secure", so I'll just pick something that is really fast (for benchmarking) and has "good enough" entropy. Only if the benchmarks demonstrate that integrating randomness is worth it, then I would dig a little deeper into the quality of the generator output.

No PRNG (Pseudo RNG) is "cryptografically secure", so it's only a matter of performance and distribution. The second can have huge impact on results.
For example, in my mod, I have 8 weather groups ("clear", "cloudy" etc), each of them consists of 6-7 cycles (variations of the same weather), and a matrix of possible transitions. When several transitions are available, it can happen (or not happen) based on random. Before I switched both the C++ code from std::mt19937 and Lua from the built-in PRNG to Abseil's random (PCG-based), I had less variety in weather, which means I had poorer random distribution before the switch.
xoshiro is way better in both performance and distribution than std::* as well. plf_rand uses PCG for newer C++ standards and falls back to xoshiro for the older ones (and all that fits into 105 locs... Dunno why the "official" PCG for C++ is a couple thousand locs).

I've noticed you played a bit with snmalloc -- any interesting results comparing to tcmalloc? From their docs and articles, I got an impression that it performs better on clusters and 60+ thread Xeons rather than on consumer CPUs... But I could've misinterpreted it.

I'm limited to mimalloc as of now (but there's nothing bad in it, especially after the 3.2.7 release), as only this lib can work as a proxy for the entire process space -- sorta poor man's LD_PRELOAD (which I really miss on Windows). The rest usually only provide global new/delete replacement for the functions statically linked into the exe.

tzcnt Jan 23, 2026
Maintainer

Snmalloc was close to tcmalloc but still slightly slower on my 64 core machine. The design seems theoretically good so it's one to keep an eye on.

It seems that tcmalloc at least has the patch functionality to override the global allocator even for other DLLs, although I didn't find clear documentation on how to do so. I suppose if you are able to achieve your goal of a fully statically linked exe then you could experiment with any allocator you like.

solbjorn Jan 23, 2026
Author

It seems that tcmalloc at least has the patch functionality to override the global allocator even for other DLLs, although I didn't find clear documentation on how to do so. I suppose if you are able to achieve your goal of a fully statically linked exe then you could experiment with any allocator you like.

tcmalloc from gperftools yes, but I believe you use the new tcmalloc which Google develops for Linux exclusively?

I currently have fully statically linked exe (now that I got rid of oneTBB), but with injection/redirection you also have 3rd party allocator taking care of any allocation that happens inside the process space. Let's say you can't link in Windows DLLs or DirectX DLLs or any proprietary DLLs which don't have source code statically. When you use mimalloc-redirect or any other similar hacks, you also have allocations from these DLLs managed by your allocator. If you link the allocator statically, you only have allocations that come from your binary managed by the allocator.

I'm fine with mimalloc anyway, Tracy didn't find any allocator-related bottlenecks. I've also found some completely new allocator which also injects entry points similarly to mimalloc-redirect, but it's on early WIP stage currently. I might play with it later when it's more mature.

tzcnt Jan 24, 2026
Maintainer

I use the Debian and Arch packages which are based on the old gperfools version. Since installing them is dead simple I felt it was fair to recommend their usage with the library.

solbjorn Jan 24, 2026
Author

Hmmm, you should definitely give this a try:

https://github.com/google/tcmalloc

It is Linux-exclusive, so I can't test it with the engine, but IIUC this is a "new generation" of TCMalloc.

Also since you have slightly better results with the "classic" version than with mimalloc, maybe I'll also try the one from gperftools soon... It supports Windows and have DLL patching as you mentioned, although building it for Windows is a separate painful story.

solbjorn · 2026-01-24T00:54:45Z

solbjorn
Jan 24, 2026
Author

I've found at least one such pattern in the code:

m_playing_sounds.erase(std::remove_if(m_playing_sounds.begin(), m_playing_sounds.end(), CInappropriateSoundPredicate(sound_mask)), m_playing_sounds.end());

where I need to change the operator() (of the predicate) to a coroutine in order to keep the function coloring (it calls a function which I converted to a coroutine recently).

For now it looks like I need to convert this to a manual loop in order to do that, there is no other way?

8 replies

solbjorn Jan 24, 2026
Author

The reason why I'm converting a lot of functions to coroutines now is that xrSound loads next blocks of oggs asynchronously: when its state machine decides that we'll need the next block soon, it tosses a task to the executor. In order to correctly stop or destroy the sound, I need to make sure its task is done. That's how stop/destroy became coroutines (they co_await the task), but now I see that in many places I'm not able to use them directly (mostly due to Lua bindings, which can't be coroutines), so I created a queue of sounds to be stopped/destroyed to handle this a bit later from the coroutine context.

BTW for such async tasks (which don't fit the fork-join model) I use spawn().detach() + manual_reset_event to wait for the results later. Dunno if this is the most optimal way, let me know if there's anything better. The reset event struct is pretty tiny, I just reset it before detaching the task and set it at the end of the task.

(example: https://github.com/solbjorn/reaper-engine/blob/tmc/ogsr_engine/Layers/xrRender/DetailManager.cpp#L380)

tzcnt Jan 24, 2026
Maintainer

If you only need to await the task in one place, then I'd say use a fork_group for that. It can be used in the same way as a member variable, and it also has a reset() function so you would call fg.reset(); fg.fork(calc_async()); The advantage of this is that you don't need to manually set the event at the end of calc_async(). The synchronization is controlled entirely in the wrapper. It also allows the task to symmetric transfer to an awaiter while destroying itself at the same time (doesn't have the issue we discussed here #137 (reply in thread) or the opposite issue with calling event.set() which must always post the awaiter)

(I should clarify the doc comment on fork_group to say it's fine to call reset() before the first use as well - all it does is set the internal counters to 0).

However if you might need multiple other tasks to wait for one event, then you should continue to use manual_reset_event, as fork_group can only be awaited by a single awaiter before it needs to be reset.

solbjorn Jan 24, 2026
Author

As for the example above (details calculation), then there's only one awaiter (details render task, by "details" the original devs meant grass).
You're saying that fork_group can be a member variable, but we await it like co_await std::move(fg). Doesn't it mean that the member is not usable after that?

Sounds can be awaited from multiple places, so yes, there can be multiple co_awaits.

tzcnt Jan 24, 2026
Maintainer

Calling fork_group::reset() restores it to a usable state so you can dispatch and await another group of tasks. I'm trying to figure out how to better document this. I thought I might use the standard library as an example - e.g. a std::vector that's been moved-from can be clear()ed after which it becomes a usable empty vector. But I can't find comprehensive documentation on this for std types - just scattered references.

Actually I think the documentation does clearly state this: https://fleetcode.com/oss/tmc/docs/v1.3/awaitables/fork_group.html#_CPPv4N3tmc13aw_fork_group5resetEv . Should I also note that it's reusable in this manner at the top level of the docs? Since this resettable behavior is special to spawn_group and fork_group, I can see how it could get lost if it's buried on the function doc only.

solbjorn Jan 24, 2026
Author

I think it's just me not reading the docs deeply =\

So if I got it correctly,

from start_async(), I call fg.reset() and then fg.fork_clang();
from the rendering function I just do co_await std::move(fg)

It should behave similarly to detach + reset event, but more optimized.

I've just recalled that the details rendering function is called from several places, so it might not be a good candidate. But I have a couple others that for sure are (async rain calculation, async HOM calculation — each is awaited exactly once; maybe something else will come to my mind later).

Thanks!

solbjorn · 2026-01-24T20:04:15Z

solbjorn
Jan 24, 2026
Author

Re "why I convert so many functions to coroutines" (note to myself mostly).

For sure, I could just leave everything as it is, just use tmc::post() + either std::future or atomic wait/notify to simply replace oneTBB, but what's the point then? I wanted a challenge, wanted to know whether it's possible / I'm able to make such huge engine coroutine-based and lock/sleep-free (without standard mutexes etc.). I also want to keep tmc::cpu_executor() fully non-blocking; if I can't make something non-blocking/async, I delegate it to xr::cpu_st_executor() (tmc::ex_cpu_st) which is the only thread I allow to block/sleep -- thankfully it's only a couple DirectX functions that require blocking/sleeping.
Mutexes and async tasks are scattered all over the engine, sometimes the call chain from the main loop to a task/mutex is ~30 functions and I need to keep the coloring. Sometimes I see that some function I converted to a coroutine is called from a constructor or a destructor. Then I make them private and create a friend coroutine co_create()/co_destroy() (it was your advice as well ;)).
Sometimes it's simply not possible to call a coroutine if the related function has a Lua binding (which can't be a coroutine) or is a callback called from a 3rd party library. For these, I have a queue of async tasks, I simply push a new task there and process this queue from the coroutine context at least once a frame (before the frame rendering, so that the delay between the call that pushed the task and the execution of this task is usually < 1 ms -- if this is not enough (testing will show), I'll process it more than once a frame. But for now, all such tasks are not time-critical -- let's say I queued a sound to stop, even if it's stopped 1 ms after the call, nobody will ever notice it).

Lots of folks kept telling me that it's impossible to replace Luabind with Sol in the engine. It took me half a year and several thousand locs, but I made it. So I'm pretty sure this challenge is doable as well, especially given that I receive so huge and helpful support from you.

(the same folks told me it's not possible to switch the engine from MSVC to Clang/clang-cl due to the legacy code that is too broken to be fixed -- lol that was EZ honestly, even though I wasn't good in C++ back then)

BTW I'm curious how Tracy will work after the engine is coroutine-based -- I haven't read its code deeply, but my impression was that Tracy expects every function to start and end on the same thread and doesn't expect that a function can suspend.

5 replies

tzcnt Jan 24, 2026
Maintainer

Yes, I think it is an admirable goal to make the entire thing non-blocking and should be achievable. I'm happy to help and appreciate your useful feedback on the library.

I just meant that only the outermost (awaiting) and innermost (first level that needs to be parallel or async) actually need to be coroutines. You can make everything in between a regular function if that simplifies development. The innermost regular function can construct or fork the coroutines without needing to await them, and then provide the awaitable back to the outermost coroutine which awaits it later. This can be done in a few ways:

using spawn_group / fork_group passed by reference through the call stack, or stored in a member variable
just pushing tasks to a vector/queue and then spawn_many or await them one-by-one afterward

Re: the sound example - after you get this working with the "queued to main thread" approach, a next step might be to decouple sound timing from the main thread entirely. If sound controls still need to be serialized, this might be a good place to use ex_braid - you can just post the work to the braid and let it run in the background. If you need both serialization and to eventually wait for some synchronization point in the main loop, then you could use a fork_group but also run it on the braid. Or you could use mutex if the part that needs to be synchronized is only a small portion of the code.

tzcnt Jan 24, 2026
Maintainer

Re: Tracy I found this which sounds like a problem wolfpld/tracy#936 . Not 100% sure I understand but the suggestion appears to be to track the current coroutine in the worker thread, before every resume call. Not sure how I can support that.

I found this which is very cool https://github.com/tokio-rs/console but it implies that they built the entire thing from the ground up for Tokio. So we probably need a new tracing framework that is coroutine-aware for C++.

Slightly related: debugging coroutine stacks is not well supported at the moment. I found this script https://clang.llvm.org/docs/DebuggingCoroutines.html#async-stack-traces but since the addition of llvm/llvm-project#166664 / llvm/llvm-project#161870 it seems that we should be able to provide a script to generate a nicer looking frame, which could perhaps be visible in a debugger GUI (I use LLDB DAP VSCode extension). Not sure. This is a bit of a rabbit hole but I'll probably take a stab at it at some point. It would be another "killer feature" for this project to be able to provide a fully working debugger integration.

solbjorn Jan 25, 2026
Author

Re: Tracy I found this which sounds like a problem wolfpld/tracy#936 . Not 100% sure I understand but the suggestion appears to be to track the current coroutine in the worker thread, before every resume call. Not sure how I can support that.

Uff pity. I've read about their macros related to fibers, but I'd say they're poorly applicable to coroutines.
But yeah, I believe it should be neither your nor my problem. Your library shouldn't integrate with Tracy by default (it's a per-project decision) and from my side, I can't do much with the worker threads etc.
It's been almost 6 years since the introduction of the coroutines not as an experimental feature but a part of the standard, but most people still consider them experimental / not applicable to the real world usecases / etc >_<

Tracy zones are scope-based, it's a structure on the frame which has a constructor and a destructor which take care of the tracing info. But coroutines don't break the object scopes, otherwise it would be a hell,

    if (a)
    {
        scope_based_obj b;
        co_await do_smth_async();
        do_smth_sync(b);
    }
    // b gets destroyed only once and only when exiting the scope, the co_await doesn't affect its lifetime anyhow

so I don't really understand how Tracy gets multiple exits from the same zone when using coroutines.

I found this which is very cool https://github.com/tokio-rs/console but it implies that they built the entire thing from the ground up for Tokio. So we probably need a new tracing framework that is coroutine-aware for C++.

Correct, or at least that the devs of the already existing and popular frameworks like Tracy would start treating coroutines seriously and supporting it.

Slightly related: debugging coroutine stacks is not well supported at the moment. I found this script https://clang.llvm.org/docs/DebuggingCoroutines.html#async-stack-traces but since the addition of llvm/llvm-project#166664 / llvm/llvm-project#161870 it seems that we should be able to provide a script to generate a nicer looking frame, which could perhaps be visible in a debugger GUI (I use LLDB DAP VSCode extension). Not sure. This is a bit of a rabbit hole but I'll probably take a stab at it at some point. It would be another "killer feature" for this project to be able to provide a fully working debugger integration.

I had a couple crashes inside coroutines and got the following: the stack trace that gets printed to the log (generated by dbghelp.dll) showed me something like: TMC executor internals -> the call chain since the last suspension point. Same with minidumps generated by the same DLL. So the debug info is indeed poorer, but on the other hand, when I ran the exe under the VS debugger, all breakpoints and step-by-step execution worked as expected, so when a bug can be reproduced stably, it's not a problem to me.
^ all that was on a release binary built with -O3 -flto -fomit-frame-pointers and other crazy optimization flags, so it might be a bit better on debug builds.

(might be related / help you: https://developers.facebook.com/blog/post/2021/10/14/async-stack-traces-c-plus-plus-coroutines-folly-walking-async-stack/)

tzcnt Jan 25, 2026
Maintainer

FYI since I've been recommending the use of fork_group: #186

solbjorn Jan 25, 2026
Author

I provided some info to the Tracy authors: wolfpld/tracy#936 (comment)
Will see whether they pay attention and treat that seriously.

solbjorn · 2026-01-25T23:46:54Z

solbjorn
Jan 25, 2026
Author

Just curious:

  tmc::post(
    tmc::cpu_executor(),
    tmc::detail::client_main_awaiter(
      static_cast<tmc::task<int>&&>(ClientMainTask), &exitCode
    ),
    0, 0
  );

Why is the root coroutine run with thread hint == 0 in async_main()?

3 replies

tzcnt Jan 26, 2026
Maintainer

Since concurrentqueue creates an implicit producer queue for each unknown thread that posts, posting from the main thread creates one of these queues. If the app runs entirely on executors after that, that queue would remain empty but still need to be checked every iteration.

By using the thread hint, it goes to the group 0 inbox which is an already-existing queue instead. Since the new waking algorithm prefers to wake thread 0 when there are no working threads anyway, this isn't a functional change. And there is a check that the hinted thread is actually allowed to execute the chosen priority; if it can't, the work item goes into the normal submission path (via the implicit producer queue)

tzcnt Jan 26, 2026
Maintainer

The other quirk of the async_main implementation is that it uses std::atomic::wait instead of a future. It would be cleaner if I just included tmc/sync.hpp and called post_waitable. But I wanted to avoid introducing a dependency on <future> for users that never want to actually use it. So I tried to keep that constrained to the sync.hpp file.

Edit: the above doesn't make sense any more since async_main is defined in ex_cpu.ipp which already includes sync.hpp. So I think I could make this change now. Posting from my phone, sorry

Someone also suggested having the main thread become a worker thread. I could do this too and it would eliminate that issue, but comes with its own complexity.

solbjorn Jan 26, 2026
Author

Ah I see. I did the same (thread hint) in the engine, since I needed to open-code async_main() to be able to pass more than one code via this atomic (the splash window is created by the "main" thread which runs WinMain(), so the WinAPI allows to destroy the splash window only from this thread).

(https://github.com/solbjorn/reaper-engine/blob/tmc/ogsr_engine/xr_3da/x_ray.cpp#L401)

I wouldn't change atomic wait/notify to a future in async_main(). Apart from the future overhead, I don't know exactly how future waiting is implemented in different C++ STLs / for different OSes (IIRC some STLs implement future waiting using a condition_variable + a mutex). OTOH I know for sure that atomic wait/notify is efficient now on both Linux and Windows. Moreover, as you said, it's all hidden in the .ipp and the user doesn't see or need to deal with this atomic.

As for making the main thread a worker thread, it could be me as well. But I remember you explained this would require inverting some logic upside down and wouldn't give any reasonable benefits. Apart from the logic details, I think you have less control from the code over the "main" thread than over the threads you create manually? This could also complicate things.

tzcnt · 2026-01-29T01:32:07Z

tzcnt
Jan 29, 2026
Maintainer

Since it looks like you've finished the initial pass of the migration, I'd love to try playing the game. I was able to get your branch to build in Release, but no luck in Debug. However I'm not able to run it - the splash screen only pops up for a second and then disappears. When running in the debugger I see it's unable to find fsgame.ltx. However it doesn't emit any logs so I'm not sure what the next step is. I tried moving the files around and passing -fsgame parameter but no luck.

I'm using the Steam 1.006 (?) version and was able to run the upstream OSGR engine just fine using the -steam parameter. Would you be willing to take some time to help me debug this? I joined the OGSR Engine and Open XRay discords so you can contact me there. If not I would appreciate some tips on getting it to find the fsgame / generate meaningful logs.

Also is there a specific mod / graphical overhaul that I should be using?

3 replies

solbjorn Jan 29, 2026
Author

Hi, it was the initial pass and I'm glad it's done, but there are still lots of mutexes etc. which I need to convert, otherwise there'll most likely be deadlocks (the current tmc branch of my repo has a weird crash when loading the level).

I tried to build the Debug config in the past, but it was broken long before I forked the project and I didn't try to fix it properly (a Release build + game logs and sometimes the VS debugger have always been more than enough for me).

My fork is not compatible with the original assets in a couple places: all.spawn file, XMLs with translations, etc, but the most problematic is Luabind -> Sol transition (lots of Lua scripts rewritten). The easiest way would be to give you my mod which is already adjusted, but unfortunately it has only Russian localization (most of scenario etc. expansions which I took as a base had only it). That's why I didn't reply when you previously asked for this -- just didn't have an idea how to solve this =\

If you have a good automated way how to translate several Mbs of dialogs/texts with a decent quality (AI or so, I'm fluent in English, but a manual translation would take me like a month maybe, really tons of text), then the idea is that here:

https://github.com/solbjorn/reaper-build/tree/master/config/text

you have XMLs with translations. Each string has a node for a specific language ("rus", "eng" etc.). So you basically need to take the value from "rus", translate it and add a second node "eng" with the translated text.

After this is done, you'll also need several additional big files (with textures etc.) which I don't upload to the GH repo, and probably a couple adjustments in the main config, but it's not a big deal.

I had a small "fixup" archive in the past which you could just drop to the S.T. installation and the vanilla game would work with my fork; but after I converted the engine from Luabind to Sol, it doesn't work anymore and I didn't update it since it would be too much work (and I can't just use the scripts from the mod since they are changed and improved heavily and rely in other features from the mod).

What do you think?

tzcnt Jan 29, 2026
Maintainer

I can AI translate the XMLs, should be good enough to make it playable. And if you are able to point me at where the other files are hosted and where to put them, then that should be all I need? After you get the crashes fixed, of course.

Just curious, are you planning on doing a public release of this at some point?

solbjorn Jan 29, 2026
Author

(BTW I'm not from Russia if that matters :D)

I didn't have any plans on public releases, just doing it for myself and having fun. The mod is also in ever-WIP state, the first map is playtested, but I don't promise there won't be script crashes / glitches etc. I got back to S.T. modding in 2023 after a long break (I was active in modding in 2008-2011, then in 2017-2018) and at first, I only wanted to play some old mod I felt nostalgic to, but then I started improving feature X, then Y, then... Step by step, deeper and deeper, and now my fork differs from OGSR by a couple times a thousand locs :D

(we had no engine source code back in 2008 and had to work around engine bugs in the scripts as good as we could and often resign from something that couldn't be done on the original exe, now that the source code is open and there are several public projects like OGSR and OpenXRay, it's so tempting to do crazy stuff and features)

Sure, I'll let you know when the tmc branch becomes stable, I hope there'll be no hard blockers.

solbjorn · 2026-02-06T10:40:46Z

solbjorn
Feb 6, 2026
Author

I've seen you're planning to introduce an option to make TMC fully header-only, could you maybe make it a tri-state, where the third option would make it header-only, but leave the hwloc-related functionality in ipp which I'd need to define in a .cpp file? I wouldn't probably be so critical against it if hwloc headers didn't include this horrendous windows.h. I'm planning to get rid of including it project-wide in future (vanilla engine code legacy), but give a chance to header-only TMC. As project-wide windows.h is a very bad idea.

0 replies

solbjorn · 2026-02-08T20:09:20Z

solbjorn
Feb 8, 2026
Author

I know it's not your problem and I (and other users) should be more careful, so not asking for any changes, just curious...

Is operator bool() mandatory for tmc::task<>?

When I refactor something like

void process_events()
{
    // ...
}

// to

tmc::task<void> process_events()
{
    // ...
}

then the compiler will notify me if I missed a co_await somewhere.

But when changing

bool net_spawn()
{
    // ...
}

// to

tmc::task<bool> net_spawn()
{
    // ...
}

// later

if (!net_spawn())
    //

then unfortunately the compiler is not able to catch a missing co_await as the [[nodiscard]] attribute of tmc::task<> triggers only when the task struct is completely unused, but calling operator bool() as in the example above silences this warning even though the intention was to do if (!co_await net_spawn()).

I shot myself in the foot a couple times already, fortunately it was easy to find and fix.

5 replies

tzcnt Feb 8, 2026
Maintainer

Yeah, I already ran into some issues with task::operator bool being implicit, but it turns out explicit is still troublesome as you've shown. I think I will remove this operator bool, and the user can manually check for nullptr if they want to see if the task has been initialized.

pfeatherstone Feb 8, 2026

I have seen operator bool be the source of so many subtle bugs.

solbjorn Feb 9, 2026
Author

I made operator bool() explicit everywhere in my code some time ago, but unfortunately this barely helps most of times. I thought that operator bool() in tmc::task<> is either required by the coroutine API or you use it a lot in the TMC internals and it would be less convenient to convert this operator to a method (like valid() or initialized() or empty() or so) or to remove it. But if this is only a user API, then yeah, I'd say it does more harm than good.

I wish C++ had more advanced [[nodiscard]] features. Then you could write tmc::task<> the way that the operator bool() would still be there, but the compiler would warn if the user didn't call co_await or .fork() or so (I mean, ability to specify particular methods/functions which MUST always be called).

tzcnt Feb 9, 2026
Maintainer

I was just trying to match the API of std::coroutine_handle which has operator bool.

Yes, there's no way to create a linear type in C++ at compile time. However there is a debug assert that would fire at runtime in that scenario. I know your project's debug build is broken. Somewhere on my offline TODO list was a custom assert macro so the user can opt-in to having runtime asserts in release builds. I can promote this to an enhancement issue.

solbjorn Feb 9, 2026
Author

The engine contains a whole set of different debug macros; even if the Debug config doesn't work, I can undefine NDEBUG time to time on the Release builds to make sure everything works as expected, no worries ;)

solbjorn · 2026-04-05T16:14:40Z

solbjorn
Apr 5, 2026
Author

Hey,

I've noticed the development of TMC has slowed down a bunch lately.
No worries, everything works great on my side, despite that I haven't converted everything to async control structures yet (but the actual functions which use standard mutexes etc. are not coroutines, so the engine works stably without deadlocks), and I've been (and still) busy with some other crazy stuff currently.

Just wanted to make sure you are fine, no burnouts, no motivation loss, no problems in real life etc. Take care!

(also seems like libfork has finally woken up, I'm curious what the new version will offer, although I'll stay with TMC with no doubts)

1 reply

tzcnt Apr 8, 2026
Maintainer

Taking a bit of a break after v1.4 release. I've started a new (secret) project that uses TMC so it's good to be dogfooding a bit - helps me find the rough edges or missing functionality. I'll come around and finish v1.5 (mostly finalizing the build system) at some point.

Newbie's questions for migrating from oneTBB #137

Uh oh!

solbjorn Aug 31, 2025

Replies: 35 comments · 61 replies

Uh oh!

Uh oh!

tzcnt Aug 31, 2025 Maintainer

Uh oh!

Uh oh!

tzcnt Aug 31, 2025 Maintainer

Uh oh!

Uh oh!

tzcnt Aug 31, 2025 Maintainer

Uh oh!

tzcnt Aug 31, 2025 Maintainer

Uh oh!

tzcnt Aug 31, 2025 Maintainer

Uh oh!

solbjorn Aug 31, 2025 Author

Uh oh!

solbjorn Sep 1, 2025 Author

Uh oh!

tzcnt Sep 2, 2025 Maintainer

Uh oh!

Uh oh!

tzcnt Sep 2, 2025 Maintainer

Uh oh!

Uh oh!

tzcnt Sep 2, 2025 Maintainer

Uh oh!

solbjorn Sep 2, 2025 Author

Uh oh!

solbjorn Sep 4, 2025 Author

Uh oh!

Uh oh!

tzcnt Sep 5, 2025 Maintainer

Uh oh!

tzcnt Oct 20, 2025 Maintainer

Uh oh!

solbjorn Nov 15, 2025 Author

Uh oh!

tzcnt Nov 15, 2025 Maintainer

Uh oh!

Uh oh!

solbjorn Dec 8, 2025 Author

Uh oh!

Uh oh!

tzcnt Dec 9, 2025 Maintainer

Uh oh!

solbjorn Dec 9, 2025 Author

Uh oh!

tzcnt Dec 10, 2025 Maintainer

Uh oh!

solbjorn Dec 10, 2025 Author

Uh oh!

tzcnt Dec 10, 2025 Maintainer

Uh oh!

solbjorn Jan 5, 2026 Author

Uh oh!

tzcnt Jan 6, 2026 Maintainer

Uh oh!

Uh oh!

solbjorn Jan 10, 2026 Author

Uh oh!

solbjorn
Aug 31, 2025

Replies: 35 comments 61 replies

tzcnt
Aug 31, 2025
Maintainer

tzcnt
Aug 31, 2025
Maintainer

tzcnt
Aug 31, 2025
Maintainer

tzcnt
Aug 31, 2025
Maintainer

tzcnt
Aug 31, 2025
Maintainer

solbjorn
Aug 31, 2025
Author

solbjorn
Sep 1, 2025
Author

tzcnt
Sep 2, 2025
Maintainer

tzcnt
Sep 2, 2025
Maintainer

tzcnt
Sep 2, 2025
Maintainer

solbjorn
Sep 2, 2025
Author

solbjorn
Sep 4, 2025
Author

tzcnt
Sep 5, 2025
Maintainer

tzcnt
Oct 20, 2025
Maintainer

solbjorn
Nov 15, 2025
Author

tzcnt Nov 15, 2025
Maintainer

solbjorn
Dec 8, 2025
Author

tzcnt Dec 9, 2025
Maintainer

solbjorn Dec 9, 2025
Author

tzcnt Dec 10, 2025
Maintainer

solbjorn Dec 10, 2025
Author

tzcnt Dec 10, 2025
Maintainer

solbjorn
Jan 5, 2026
Author

tzcnt Jan 6, 2026
Maintainer

solbjorn
Jan 10, 2026
Author