runtime: completion fences + async‑friendly client helpers #878

ariawisp · 2025-09-11T00:59:21Z

This PR introduces precise, non‑blocking completion fences across CubeCL runtimes, with small client conveniences that make async integration (e.g., Kotlin coroutines via UniFFI) natural and ergonomic—without disrupting existing backends or APIs.

Why

Let users wait precisely for submitted work completion without blocking ingestion or busy‑waiting.
Provide a portable primitive that maps to native events on CUDA/HIP and queue completion on WGPU.
Unlock clean suspend‑friendly bindings in downstream hosts with no extra glue.

Behavior

New ComputeServer::work_done() -> DynFut<()> (default: sync()), returning a future that resolves when all work submitted up to the call completes.
ComputeChannel::work_done()
- MPSC: non‑blocking path; server loop stays responsive, wait happens off‑thread.
- Mutex/Cell: forwards to server implementation.
ComputeClient helpers
- fence() -> DynFut<()> — awaitable completion.
- fence_handle() -> ClientFence — optional blocking handle with wait().
- execute_async(..) -> DynFut<()> and write_async(..) -> DynFut<Result<(), IoError>> — convenience wrappers that submit and then await completion.

Backends

WGPU: replaces the old readback fallback with Queue::on_submitted_work_done (wgpu#6395 landed in wgpu 26). We call flush() first so the fence observes all previously submitted work, then wrap the callback into a Future—fully non‑blocking and precise.
CUDA: records a CUevent on the stream; work_done() returns a future that waits for the event. Adds Drop on fence to avoid event leaks on cancellation.
HIP: same pattern via hipEvent/hipStreamWaitEvent; also adds Drop for safety.

Compatibility & scope

Backwards‑compatible: default work_done() defers to sync(). No behavior changes required in existing backends to adopt this API.

Validation

Adds fence_completes_submitted_work on the dummy backend to verify ordering and completion semantics.
The MPSC path keeps ingestion non‑blocking; long waits do not stall the server loop.
Runs clean under cargo xtask validate (audit, fmt, clippy, build, doc‑tests).

Notes

The WGPU TODO referencing queue.on_submitted_work_done being unavailable is now obsolete; we use the native callback on wgpu 26 as intended.
CUDA/HIP fence futures are safe to drop without leaking events.
This enables clean, suspend‑friendly host bindings (e.g., UniFFI‑based Kotlin coroutines) without additional runtime glue.

PR has been validated with Burn — no compilation or test errors.

nathanielsimard

work_done and sync are essentially the same. We could update the sync implementations in both cuda and hip to the new way, using events, but I would remove work_done altogether.

nathanielsimard · 2025-09-11T12:57:06Z

crates/cubecl-cuda/src/compute/sync/fence.rs

+impl Drop for Fence {
+    fn drop(&mut self) {
+        if !self.event.is_null() {
+            unsafe {
+                let _ = cudarc::driver::result::event::destroy(self.event);
+                self.event = core::ptr::null_mut();
+            }
+        }
+    }
+}


The fence is destroyed right after we wait on it. The methods wait_async and wait_sync take ownership, so you can't wait more than once on an event. I don't mind removing the destroy call in both functions, but we should also change the method signatures to take &self instead, so that we can wait multiple times on the same event.

- Add ComputeServer::work_done() (default to sync) - Channels: non-blocking WorkDone (MPSC); forwarders on Mutex/Cell - WGPU: use Queue::on_submitted_work_done for precise, non-blocking completion - CUDA/HIP: native event fences; add Drop to avoid event leaks on cancellation - ComputeClient: fence(), fence_handle() (blocking wait), execute_async(), write_async() - Test: fence_completes_submitted_work (dummy backend) Single commit; vendor-only (no softmax or unrelated changes)

… Drop. Remove work_done API and route client helpers through sync(); keep WGPU sync via on_submitted_work_done

…avior change

…hods take &self and support multi-wait semantics

ariawisp · 2025-09-11T20:17:04Z

Per feedback, I simplified the API and adjusted fence semantics:

Removed work_done from ComputeServer/ComputeChannel (and Message::WorkDone). Client helpers now await sync():
- fence() → channel.sync()
- execute_async(..) → submit, then await sync()
- write_async(..) → write, then await sync()
CUDA/HIP Fence supports multi‑wait:
- wait_sync/wait and wait_async take &self; cleanup moved to Drop (idempotent).
- Updated call sites; no leaks on cancellation.
WGPU sync() uses Queue::on_submitted_work_done (wgpu 26) after flush().
Added compile‑time tests (CUDA/HIP) asserting fence waits take &self (multi‑wait) without requiring a device.
Workspace build and clippy are green.

ArthurBrussee · 2025-09-12T13:46:20Z

Is it really needed to add execute_async and write_async? In the common case you shouldn't execute one kernel and then immediatly wait. People can call sync() when they want to themselves when they've submitted all the work.

write_async doesn't make sense to me either, why would you have to wait for the write to be complete? To be clear, any work queued up after the write() will observe the write to have happened, there's no need to explictly synchronize the GPU each time.

ariawisp force-pushed the cubecl-fence-all branch from 0ead4ec to ed52052 Compare September 11, 2025 01:16

nathanielsimard reviewed Sep 11, 2025

View reviewed changes

ariawisp added 3 commits September 11, 2025 14:10

runtime(fence): make CUDA/HIP fences multi-wait via &self; destroy in…

f4b94f7

… Drop. Remove work_done API and route client helpers through sync(); keep WGPU sync via on_submitted_work_done

cleanup: remove placeholder comments after removing work_done; no beh…

84f2025

…avior change

ariawisp force-pushed the cubecl-fence-all branch from 2a9e9e9 to 84f2025 Compare September 11, 2025 20:10

test(fence): add compile-time tests asserting CUDA/HIP fence wait met…

1e883af

…hods take &self and support multi-wait semantics

ArthurBrussee mentioned this pull request Sep 22, 2025

Fix wasm, remove sync buffer #908

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: completion fences + async‑friendly client helpers #878

runtime: completion fences + async‑friendly client helpers #878

Uh oh!

ariawisp commented Sep 11, 2025 •

edited

Loading

Uh oh!

nathanielsimard left a comment

Uh oh!

nathanielsimard Sep 11, 2025

Uh oh!

ariawisp commented Sep 11, 2025

Uh oh!

ArthurBrussee commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

runtime: completion fences + async‑friendly client helpers #878

Are you sure you want to change the base?

runtime: completion fences + async‑friendly client helpers #878

Uh oh!

Conversation

ariawisp commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nathanielsimard left a comment

Choose a reason for hiding this comment

Uh oh!

nathanielsimard Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

ariawisp commented Sep 11, 2025

Uh oh!

ArthurBrussee commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ariawisp commented Sep 11, 2025 •

edited

Loading