Skip to content

Conversation

@ariawisp
Copy link

@ariawisp ariawisp commented Sep 11, 2025

This PR introduces precise, non‑blocking completion fences across CubeCL runtimes, with small client conveniences that make async integration (e.g., Kotlin coroutines via UniFFI) natural and ergonomic—without disrupting existing backends or APIs.

Why

  • Let users wait precisely for submitted work completion without blocking ingestion or busy‑waiting.
  • Provide a portable primitive that maps to native events on CUDA/HIP and queue completion on WGPU.
  • Unlock clean suspend‑friendly bindings in downstream hosts with no extra glue.

Behavior

  • New ComputeServer::work_done() -> DynFut<()> (default: sync()), returning a future that resolves when all work submitted up to the call completes.
  • ComputeChannel::work_done()
    • MPSC: non‑blocking path; server loop stays responsive, wait happens off‑thread.
    • Mutex/Cell: forwards to server implementation.
  • ComputeClient helpers
    • fence() -> DynFut<()> — awaitable completion.
    • fence_handle() -> ClientFence — optional blocking handle with wait().
    • execute_async(..) -> DynFut<()> and write_async(..) -> DynFut<Result<(), IoError>> — convenience wrappers that submit and then await completion.

Backends

  • WGPU: replaces the old readback fallback with Queue::on_submitted_work_done (wgpu#6395 landed in wgpu 26). We call flush() first so the fence observes all previously submitted work, then wrap the callback into a Future—fully non‑blocking and precise.
  • CUDA: records a CUevent on the stream; work_done() returns a future that waits for the event. Adds Drop on fence to avoid event leaks on cancellation.
  • HIP: same pattern via hipEvent/hipStreamWaitEvent; also adds Drop for safety.

Compatibility & scope

  • Backwards‑compatible: default work_done() defers to sync(). No behavior changes required in existing backends to adopt this API.

Validation

  • Adds fence_completes_submitted_work on the dummy backend to verify ordering and completion semantics.
  • The MPSC path keeps ingestion non‑blocking; long waits do not stall the server loop.
  • Runs clean under cargo xtask validate (audit, fmt, clippy, build, doc‑tests).

Notes

  • The WGPU TODO referencing queue.on_submitted_work_done being unavailable is now obsolete; we use the native callback on wgpu 26 as intended.
  • CUDA/HIP fence futures are safe to drop without leaking events.
  • This enables clean, suspend‑friendly host bindings (e.g., UniFFI‑based Kotlin coroutines) without additional runtime glue.

PR has been validated with Burn — no compilation or test errors.

Copy link
Member

@nathanielsimard nathanielsimard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

work_done and sync are essentially the same. We could update the sync implementations in both cuda and hip to the new way, using events, but I would remove work_done altogether.

Comment on lines +66 to +73
impl Drop for Fence {
fn drop(&mut self) {
if !self.event.is_null() {
unsafe {
let _ = cudarc::driver::result::event::destroy(self.event);
self.event = core::ptr::null_mut();
}
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fence is destroyed right after we wait on it. The methods wait_async and wait_sync take ownership, so you can't wait more than once on an event. I don't mind removing the destroy call in both functions, but we should also change the method signatures to take &self instead, so that we can wait multiple times on the same event.

- Add ComputeServer::work_done() (default to sync)
- Channels: non-blocking WorkDone (MPSC); forwarders on Mutex/Cell
- WGPU: use Queue::on_submitted_work_done for precise, non-blocking completion
- CUDA/HIP: native event fences; add Drop to avoid event leaks on cancellation
- ComputeClient: fence(), fence_handle() (blocking wait), execute_async(), write_async()
- Test: fence_completes_submitted_work (dummy backend)

Single commit; vendor-only (no softmax or unrelated changes)
… Drop. Remove work_done API and route client helpers through sync(); keep WGPU sync via on_submitted_work_done
…hods take &self and support multi-wait semantics
@ariawisp
Copy link
Author

Per feedback, I simplified the API and adjusted fence semantics:

  • Removed work_done from ComputeServer/ComputeChannel (and Message::WorkDone). Client helpers now await sync():

    • fence()channel.sync()
    • execute_async(..) → submit, then await sync()
    • write_async(..) → write, then await sync()
  • CUDA/HIP Fence supports multi‑wait:

    • wait_sync/wait and wait_async take &self; cleanup moved to Drop (idempotent).
    • Updated call sites; no leaks on cancellation.
  • WGPU sync() uses Queue::on_submitted_work_done (wgpu 26) after flush().

  • Added compile‑time tests (CUDA/HIP) asserting fence waits take &self (multi‑wait) without requiring a device.

  • Workspace build and clippy are green.

@ArthurBrussee
Copy link
Contributor

Is it really needed to add execute_async and write_async? In the common case you shouldn't execute one kernel and then immediatly wait. People can call sync() when they want to themselves when they've submitted all the work.

write_async doesn't make sense to me either, why would you have to wait for the write to be complete? To be clear, any work queued up after the write() will observe the write to have happened, there's no need to explictly synchronize the GPU each time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants