Replies: 35 comments 61 replies
-
aw_spawn_fork contains 2 things which are non-movable:
Since the child tasks begin executing immediately and capture a pointer to this control block (this happens during the call to fork()), its location must be pinned before they begin execution. There are several possible workarounds using detached tasks:
The need for these workarounds is a weakness of the current API that will be rectified in the future by #62 , which will enable std::future-like behaviors for tasks. Another option would be #75. As long as you don't use any blocking waits (tmc::post_waitable, std::future, std::mutex), all of the work will be executed eventually, even on a single core machine. |
Beta Was this translation helpful? Give feedback.
-
In my benchmarks the performance is similar. One major difference is that An additional reason to use |
Beta Was this translation helpful? Give feedback.
-
Threads don't suspend on locks in TMC. Only tasks. If you have an
|
Beta Was this translation helpful? Give feedback.
-
This is one of the core issues with stackless coroutines - that they introduce the "function coloring problem". If you want to call You could call |
Beta Was this translation helpful? Give feedback.
-
Yes... like you said, on some systems |
Beta Was this translation helpful? Give feedback.
-
|
OMG, thanks a lot for such detailed explanations! I'll keep the issue opened for a while in case I have more quesitons, ok? |
Beta Was this translation helpful? Give feedback.
-
It's not THAT bad in the engine I work on (this piece was not written by me) but still is a synchronous while-yield-blah waste of resources. I haven't found ANY DirectX API that would allow to do that asynchronously (post a request and get notified when the data is ready). Ideally we could suspend the rendering coroutine here and resume it once notified. What would be the best approach for this? I don't want to do sync-over-async crap. Something like, move such pieces to the main/legacy thread, suspend the rendering coro once it hits the point where we need to get the result, then wait for this loop in the main thread and post a coro resuming the rendering one?
Wouldn't this generate more code / yield worse optimization than a regular sync call to an inline which returns 2? (I use Clang 21 with
I suspect it would be more efficient to use the second approach? |
Beta Was this translation helpful? Give feedback.
-
If the API requires you to poll periodically, you will need to use a timer. Some coroutine libraries offer async timer facilities, but TMC does not offer them directly - rather you can use them via the Asio integration in tmc-asio. There are a couple examples of using the asio timer facilities here: https://github.com/tzcnt/tmc-examples/blob/main/examples/asio/timer_mem_bench.cpp However it's worth noting that there is no truly async timer. Under the hood, a thread is blocking on the OS timer syscall, and then posting the results back to the executor queue when ready. Your approach of doing this manually using the main thread is equivalent. I like the SwitchToThread() call - this may actually help performance in some cases as it's more lightweight than actually blocking on the timer. |
Beta Was this translation helpful? Give feedback.
-
Compilers are supposed to be able to inline coroutines, but often fail to do so at this time. The Clang 20 attributes (#61) are supposed to help with this. This particular item is near the top of my priority list... so it will be coming soon(tm). Notably you don't need to use inline tmc::task<int> fn() { co_return 2; }
tmc::task<void> fn2()
{
...
int x = co_await fn();
...
} |
Beta Was this translation helpful? Give feedback.
-
I'm not sure about this. It depends on the implementation of |
Beta Was this translation helpful? Give feedback.
-
In LLVM libc++, which I use, I'm a Linux dev, too, but this game is Windows-exclusive unfortunately :D (and porting its render from Dx 11.2 to Vulkan would be hell to me) |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
|
The advantage of the synchronous unlock are if the unlocking task is latency sensitive and you don't want it to suspend, or if you want to make use of the RAII lock scope object. |
Beta Was this translation helpful? Give feedback.
-
|
I'm going to convert this to a discussion; feel free to continue with any further questions there. |
Beta Was this translation helpful? Give feedback.
-
|
Not really related to TMC, rather to TMC-ASIO/ASIO... As for timers, you/we do: What is the best approach to run arbitrary functions on ASIO and then get the TMC completion token? The ASIO documentation is very cryptic and there are no good tutorials TBH.
I'm confused tho that
Why I need this (I think I mentioned it somewhere already): The whole game engine is planned to be run as coroutines on the So, my idea is: when I need to execute something outside of coroutines, I delegate it to ASIO and let it resume the calling coroutine upon completion. Of course, I could just create a standalone jthread and avoid using ASIO completely. But this would imply implementing an awaitable similar to Offtop P. S. Great news regarding atomics on Windows, as least when using LLVM's libc++: in LLVM 22, they finally implemented This API is not tied to scheduler slices, unlike |
Beta Was this translation helpful? Give feedback.
-
|
Usecase: in many places in the game engine, you need to update object state only once a frame, but then this state can be read concurrently from N threads when rendering. Async version of IOW (pseudo-code), The tricky part's that I see that currently most of such places is done like but with this approach, that each thread tries to take a lock even for a few nanosecs which means redundant contention/suspending/resuming. If we know in advance when to run an update and we're sure we'll fire it before any reader, then I think I can just use a manual reset event, but it's not a common case. UPD: Maybe something like |
Beta Was this translation helpful? Give feedback.
-
|
Not sure where to comment so I'll leave it here. According to this file: https://github.com/llvm/llvm-project/blob/main/libcxx/src/atomic.cpp Every OS has own restrictions on the size of an atomic object to implement efficient wait/notify API. Linux requires the object to be 4 bytes, Apple allows waiting on 4 and 8 bytes, while FreeBSD and Windows only implement efficient waiting on 8-byte objects (on x64). I'm mentioning this as IIRC you use the atomic wait/notify API in TMC. It's probably not a good idea to introduce platform/OS-specific macros/branches in the code, but unfortunately seems like there's no one-size-fits-all. |
Beta Was this translation helpful? Give feedback.
-
Even before I started reworking the engine to TMC, I heard that it's not safe to use capturing lambda coroutines. auto second = co_await tmc::fork_clang(
[this, &SecondThreadTasksEndTime] [[nodiscard]] -> tmc::task<void> {
// ...
SecondThreadTasksEndTime = std::chrono::high_resolution_clock::now() - SecondThreadTasksStartTime;
}(),
tmc::current_executor(), xr::tmc_priority_any); // offtop: tmc_priority_any is shared between P-Cores and E-Cores
// some code that continues in parallel, including several co_awaits
co_await std::move(second);This expectedly crashes when trying to write to On the other hand, sometimes we can't do something without capturing. Let's say https://github.com/tzcnt/tmc-examples/blob/main/examples/alignment.cpp#L67 you use captures (by reference) and then call Maybe you could give a quick list of TMC "spawners" and say which ones can use lambda captures and which shouldn't? I have an assumption that if a coroutine is immediately awaited, like in the example above with
The second statement is due to that when you're creating a window, Windows records the thread ID which created it and then doesn't allow you to do anything with the window from any other thread. My current TMC settings: auto topo = tmc::topology::query();
auto& cpu = tmc::cpu_executor();
if (topo.is_hybrid())
{
tmc::topology::topology_filter p_cores, e_cores;
p_cores.set_cpu_kinds(tmc::topology::cpu_kind::PERFORMANCE);
e_cores.set_cpu_kinds(tmc::topology::cpu_kind::EFFICIENCY1);
// xr::tmc_priority_high{0};
// xr::tmc_priority_any{1};
// xr::tmc_priority_low{2};
cpu.add_partition(p_cores, xr::tmc_priority_high, xr::tmc_priority_any + 1).add_partition(e_cores, xr::tmc_priority_any, xr::tmc_priority_low + 1);
}
cpu.fill_thread_occupancy().init();I think it's the same as in your example: prio 0 is exclusive to P-Cores, prio 1 is shared, prio 2 is exclusive to E-Cores. (it's interesting tho why 8 E-Cores have 2 cache groups on my Alderlake) The rework is far from done, but is ongoing. The engine initialization is done (inc. concurrent parts), the main menu works. |
Beta Was this translation helpful? Give feedback.
-
|
Any good solution to preserve Imagine this: [[nodiscard]] bool key_press(gsl::index key);The compiler will warn if [[nodiscard]] tmc::task<bool> key_press(gsl::index key);The compiler still warns and that's good: a good protection and a reminder that it's a coroutine, so it should either be
|
Beta Was this translation helpful? Give feedback.
-
|
Yet another question... I already refactored the main rendering function to be able to use Pseudo-code which could help understand this: tmc::task<void> CRender:render()
{
auto main = tmc::fork_clang(run_main());
const bool sun = // here we check if we need to render sun;
if (sun)
// here we need to fork sun_run();
... // render() continues here, draws the main scene etc.
if (sun)
// here we need to co_await sun_run()'s fork;
...
} |
Beta Was this translation helpful? Give feedback.
-
|
Might be helpful: Too pity it didn't make it to LLVM 22. |
Beta Was this translation helpful? Give feedback.
-
|
Breh, I completely forgot about the utility awaitables and was wrapping oneliners in lambdas... When I need to execute something on the standalone co_await tmc::spawn([](auto& last) -> tmc::task<void> {
PIX_EVENT(DEFER_FLUSH_OCCLUSION);
for (auto light : last)
{
if (light == nullptr)
continue;
for (auto& svis : light->svis)
svis.flushoccq();
}
last.clear();
co_return;
}(Lights_LastFrame)).run_on(xr::tmc_cpu_st_executor());Now I'm thinking of just {
auto scope = co_await tmc::enter(xr::tmc_cpu_st_executor());
PIX_EVENT(DEFER_FLUSH_OCCLUSION);
for (auto light : Lights_LastFrame)
{
if (light == nullptr)
continue;
for (auto& svis : light->svis)
svis.flushoccq();
}
Lights_LastFrame.clear();
co_await scope.exit();
}The second should be more optimal I guess? IIRC P. S. In case you haven't seen my comment in #175:
:3 It's funny nonetheless that TMC is 3x faster than TF even without Chase-Lev queues :D |
Beta Was this translation helpful? Give feedback.
-
|
I've found at least one such pattern in the code: m_playing_sounds.erase(std::remove_if(m_playing_sounds.begin(), m_playing_sounds.end(), CInappropriateSoundPredicate(sound_mask)), m_playing_sounds.end());where I need to change the For now it looks like I need to convert this to a manual loop in order to do that, there is no other way? |
Beta Was this translation helpful? Give feedback.
-
|
Re "why I convert so many functions to coroutines" (note to myself mostly). For sure, I could just leave everything as it is, just use Lots of folks kept telling me that it's impossible to replace Luabind with Sol in the engine. It took me half a year and several thousand locs, but I made it. So I'm pretty sure this challenge is doable as well, especially given that I receive so huge and helpful support from you. (the same folks told me it's not possible to switch the engine from MSVC to Clang/clang-cl due to the legacy code that is too broken to be fixed -- lol that was EZ honestly, even though I wasn't good in C++ back then) BTW I'm curious how Tracy will work after the engine is coroutine-based -- I haven't read its code deeply, but my impression was that Tracy expects every function to start and end on the same thread and doesn't expect that a function can suspend. |
Beta Was this translation helpful? Give feedback.
-
|
Just curious: tmc::post(
tmc::cpu_executor(),
tmc::detail::client_main_awaiter(
static_cast<tmc::task<int>&&>(ClientMainTask), &exitCode
),
0, 0
);Why is the root coroutine run with thread hint == 0 in |
Beta Was this translation helpful? Give feedback.
-
|
Since it looks like you've finished the initial pass of the migration, I'd love to try playing the game. I was able to get your branch to build in Release, but no luck in Debug. However I'm not able to run it - the splash screen only pops up for a second and then disappears. When running in the debugger I see it's unable to find fsgame.ltx. However it doesn't emit any logs so I'm not sure what the next step is. I tried moving the files around and passing -fsgame parameter but no luck. I'm using the Steam 1.006 (?) version and was able to run the upstream OSGR engine just fine using the -steam parameter. Would you be willing to take some time to help me debug this? I joined the OGSR Engine and Open XRay discords so you can contact me there. If not I would appreciate some tips on getting it to find the fsgame / generate meaningful logs. Also is there a specific mod / graphical overhaul that I should be using? |
Beta Was this translation helpful? Give feedback.
-
|
I've seen you're planning to introduce an option to make TMC fully header-only, could you maybe make it a tri-state, where the third option would make it header-only, but leave the hwloc-related functionality in ipp which I'd need to define in a .cpp file? I wouldn't probably be so critical against it if hwloc headers didn't include this horrendous windows.h. I'm planning to get rid of including it project-wide in future (vanilla engine code legacy), but give a chance to header-only TMC. As project-wide windows.h is a very bad idea. |
Beta Was this translation helpful? Give feedback.
-
|
I know it's not your problem and I (and other users) should be more careful, so not asking for any changes, just curious... Is When I refactor something like void process_events()
{
// ...
}
// to
tmc::task<void> process_events()
{
// ...
}then the compiler will notify me if I missed a But when changing bool net_spawn()
{
// ...
}
// to
tmc::task<bool> net_spawn()
{
// ...
}
// later
if (!net_spawn())
//then unfortunately the compiler is not able to catch a missing I shot myself in the foot a couple times already, fortunately it was easy to find and fix. |
Beta Was this translation helpful? Give feedback.
-
|
Hey, I've noticed the development of TMC has slowed down a bunch lately. Just wanted to make sure you are fine, no burnouts, no motivation loss, no problems in real life etc. Take care! (also seems like libfork has finally woken up, I'm curious what the new version will offer, although I'll stay with TMC with no doubts) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Can I ask a few silly questions since I'm new to coroutines, but eager to switch my oneTBB-based engine to TMC?
aw_spawn_forkin a global struct and co_await it later from some other coroutine, not the one that spawned it?(my assumption comes from that
aw_spawn_forkdoesn't have a default constructor, so putting it in a struct is tricky. Moreover, you mentioned in the docs that they contain pointers tothis, which I got at "you must alwaysco_awaitthe results offork()within the coroutine that spawned them").What would be the best way for the following scenario:
Currently, I run two
task_groups from the former and them wait for them from the latter.task_groups lay in a shared struct, I don't pass pointers for them around the code.I'd like to not use
post_awaitable()and them block on the futures later, since that's not what coroutines are about.I thought of something like:
detach()them immediatelyco_awaitthe condvar/barrier to suspend instead of blocking in case they are not ready yet.But will this guarantee that the coroutines which I'm waiting for will execute for sure even if someone has 1 core and everything executes serially there?
And this doesn't look like an intended/obvious way...
Since
aw_mutexandex_braiddo nearly the same stuff, which one is faster from your code PoV? Mutex seems to be a bit heavier since it builds an awaiters list and need to repost every awaiter on each unlock? But if the contention is really narrow and it's unlikely for this mutex to be blocked,ex_braidcan incur more overhead?aw_mutexhasunlock()andco_unlock(). The latter can do sync transfer. Can sync transfer lead to that let's say 8 threads suspended on the same lock, but, if usingco_unlock(), these 8 coroutines will continue execution on only one thread serially even after this mutex' scope?If I have the following tree of function calls:
Does it mean that if I want to run a coroutine from
f3(), I need to callpost*()from it, even though it's still run onex_cpualready, but doesn't have a suspension point? Or, the preferred alternative would be to convert each of those functions to a coroutine?My main idea is to not block any of the coroutines I want to introduce with serial stuff like generic mutexes/futures/etc, so that only one thread (which runs main synchronous code) could be blocked at a time.
std::hardware_destructive_interference_sizeinstead of hardcoding to 64, it's a constexpr IIRC.But I've also seen that some developers started multiplying it by two since modern CPUs (at least x86_64) often tend to fetch 2 CLs at a time instead of one which could still provoke false-sharing.
Anyway, only benchmarking could give a reliable answer here.
Beta Was this translation helpful? Give feedback.
All reactions