Skip to content

Conversation

ariawisp
Copy link

@ariawisp ariawisp commented Sep 11, 2025

Scope

  • Adds row‑pitched (strided) buffer I/O on WGPU (read/write) for inner‑contiguous tensors with rank ≥ 2; contiguous path unchanged.
  • CPU parity: pitched read/write with the same constraints; contiguous path unchanged.
  • Centralizes stride logic in cubecl_runtime::stride and adopts it across CPU/HIP/CUDA/STD; removes duplicate helpers in STD.
  • Adds client‑side size preflight and a clear error for data/shape mismatches.

Key Changes

  • Stride helpers: add InnerContiguousRows, is_inner_contiguous_rows, and row_pitch_elems; keep contiguous_strides and is_contiguous.
  • Allocation: for AllocationKind::Optimized and rank > 1, WGPU/CPU/HIP/CUDA allocate pitched rows (row pitch aligned to the backend’s alignment); Contiguous unchanged.
  • Gating: WGPU/HIP/CUDA/CPU can_read_tensor accept fully contiguous or inner‑contiguous rows.
  • Client: read and write preflight for stride compatibility; write_async/write helpers; to_client_tensor picks preferred allocation; size preflight returns IoError::InvalidSize.
  • Utilities: expose pitched_rows_layout(..) and preferred_allocation_kind(..) for consistent planning.

Limits

  • Only inner‑most contiguous, rank ≥ 2 layouts are supported; others return UnsupportedStrides.

Notes

  • Prefer cubecl_runtime::stride::{contiguous_strides, is_contiguous, is_inner_contiguous_rows, row_pitch_elems}.
  • cubecl_std::tensor::{is_contiguous, compact_strides} were removed.
  • New error: IoError::InvalidSize { expected, actual } for create‑with‑data mismatches.

PR has been validated with Burn — no compilation or test errors.

@ariawisp ariawisp marked this pull request as draft September 11, 2025 04:01
@ariawisp ariawisp force-pushed the cubecl-wgpu-strided-io branch from 416ed9f to ef10850 Compare September 11, 2025 04:09
@ariawisp ariawisp changed the title wgpu: 2D pitched (strided) tensor read/write for buffers runtimes: strided (pitched) buffer I/O + centralized stride helpers Sep 11, 2025
@ariawisp ariawisp force-pushed the cubecl-wgpu-strided-io branch 10 times, most recently from 5b24b30 to ce5fd54 Compare September 11, 2025 05:11
@ariawisp ariawisp marked this pull request as ready for review September 11, 2025 05:12
@ariawisp ariawisp force-pushed the cubecl-wgpu-strided-io branch 2 times, most recently from a1ba972 to db1bd8f Compare September 11, 2025 05:22
Copy link
Collaborator

@wingertge wingertge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in general looks extremely clean, honestly one of the best PRs I've seen in a while. The describe function also seems extremely useful for inferring the correct variant of LinearLayout.

General comment, but I wonder if preferred_strides should be a device property. I haven't benchmarked the wgpu copy, but it seems like it might be slower compared to CUDA where the striding is handled by specialized GPU hardware and incurs no meaningful performance penalty. That way, a caller could decide on the allocation type based on their needs (i.e. optimized for long-lived tensors like weights, but the runtime preferred value for immediate inputs like images for a recognition model).

@ariawisp ariawisp force-pushed the cubecl-wgpu-strided-io branch from 1f9c499 to 0132fa8 Compare September 11, 2025 20:21
@ariawisp
Copy link
Author

Implemented device-level preference for allocation layout:

  • Added default_alloc_rank_gt1: AllocationKind to DeviceProperties.
    • WGPU/CPU default to Contiguous.
    • CUDA/HIP default to Optimized (pitched rows).
  • ComputeClient::empty_tensor now picks the device default for rank > 1.
  • ComputeClient::to_client_tensor uses the device default when the source is inner‑contiguous rows; otherwise falls back to contiguous.
  • Centralized stride helpers remain the same (is_contiguous, is_inner_contiguous_rows, row_pitch_elems, pitched_rows_layout); IO still only accepts contiguous or inner‑contiguous rows.
  • Keeps behavior focused: no changes to pitched copy logic; just device‑aware defaults. We can refine defaults with benchmarks later if needed.
  • Workspace build and clippy are green.

@nathanielsimard
Copy link
Member

@ariawisp The merge conflicts would need to be solved.

@ariawisp ariawisp force-pushed the cubecl-wgpu-strided-io branch 2 times, most recently from c050144 to e111e07 Compare September 19, 2025 21:56
Summary
- WGPU pitched I/O (rank>=2 inner-contiguous rows) and pitched allocation for Optimized.
- CPU pitched read/write parity and pitched allocation for Optimized.
- Centralize stride logic with runtime helpers: contiguous_strides, is_contiguous, is_inner_contiguous_rows, row_pitch_elems; remove duplicates in STD.
- Backend gating: WGPU/HIP now accept inner-contiguous rows; no change to CUDA logic.
- Docs: update book examples to use runtime::stride; rename PR doc.

Details
- WGPU: stream.rs read/write uses row_pitch_elems; server.rs create uses pitched rows when Optimized.
- CPU: compute/server.rs create/read/write implement pitched behavior.
- Runtime: stride.rs adds InnerContiguousRows, is_inner_contiguous_rows, row_pitch_elems.
- HIP/WGPU runtime.rs can_read_tensor accepts pitched rows.
- STD: remove is_contiguous/compact_strides; adjust LinearLayout to use runtime is_contiguous.

runtime(stride/io): add device-level default allocation preference for rank>1 (contiguous vs pitched). WGPU/CPU default to contiguous; CUDA/HIP to pitched. Use preference in empty_tensor and to_client_tensor.

� Conflicts:
�	crates/cubecl-convolution/src/kernels/layered/algorithm/simple.rs
�	crates/cubecl-cpu/src/compute/server.rs
�	crates/cubecl-cpu/src/runtime.rs
�	crates/cubecl-cuda/src/compute/server.rs
�	crates/cubecl-cuda/src/runtime.rs
�	crates/cubecl-hip/src/compute/server.rs
�	crates/cubecl-hip/src/runtime.rs
�	crates/cubecl-runtime/src/server.rs
�	crates/cubecl-runtime/tests/dummy/compute.rs
�	crates/cubecl-wgpu/src/compute/server.rs
�	crates/cubecl-wgpu/src/compute/stream.rs
�	crates/cubecl-wgpu/src/runtime.rs
@ariawisp ariawisp force-pushed the cubecl-wgpu-strided-io branch from e111e07 to 0b326e3 Compare September 19, 2025 22:05
@ariawisp
Copy link
Author

Branch has been rebased on main (cf40b4e) and confirmed CI workflow passes on my fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants