runtimes: strided (pitched) buffer I/O + centralized stride helpers #879

ariawisp · 2025-09-11T03:15:56Z

Scope

Adds row‑pitched (strided) buffer I/O on WGPU (read/write) for inner‑contiguous tensors with rank ≥ 2; contiguous path unchanged.
CPU parity: pitched read/write with the same constraints; contiguous path unchanged.
Centralizes stride logic in cubecl_runtime::stride and adopts it across CPU/HIP/CUDA/STD; removes duplicate helpers in STD.
Adds client‑side size preflight and a clear error for data/shape mismatches.

Key Changes

Stride helpers: add InnerContiguousRows, is_inner_contiguous_rows, and row_pitch_elems; keep contiguous_strides and is_contiguous.
Allocation: for AllocationKind::Optimized and rank > 1, WGPU/CPU/HIP/CUDA allocate pitched rows (row pitch aligned to the backend’s alignment); Contiguous unchanged.
Gating: WGPU/HIP/CUDA/CPU can_read_tensor accept fully contiguous or inner‑contiguous rows.
Client: read and write preflight for stride compatibility; write_async/write helpers; to_client_tensor picks preferred allocation; size preflight returns IoError::InvalidSize.
Utilities: expose pitched_rows_layout(..) and preferred_allocation_kind(..) for consistent planning.

Limits

Only inner‑most contiguous, rank ≥ 2 layouts are supported; others return UnsupportedStrides.

Notes

Prefer cubecl_runtime::stride::{contiguous_strides, is_contiguous, is_inner_contiguous_rows, row_pitch_elems}.
cubecl_std::tensor::{is_contiguous, compact_strides} were removed.
New error: IoError::InvalidSize { expected, actual } for create‑with‑data mismatches.

PR has been validated with Burn — no compilation or test errors.

wingertge

The code in general looks extremely clean, honestly one of the best PRs I've seen in a while. The describe function also seems extremely useful for inferring the correct variant of LinearLayout.

General comment, but I wonder if preferred_strides should be a device property. I haven't benchmarked the wgpu copy, but it seems like it might be slower compared to CUDA where the striding is handled by specialized GPU hardware and incurs no meaningful performance penalty. That way, a caller could decide on the allocation type based on their needs (i.e. optimized for long-lived tensors like weights, but the runtime preferred value for immediate inputs like images for a recognition model).

ariawisp · 2025-09-11T20:22:51Z

Implemented device-level preference for allocation layout:

Added default_alloc_rank_gt1: AllocationKind to DeviceProperties.
- WGPU/CPU default to Contiguous.
- CUDA/HIP default to Optimized (pitched rows).
ComputeClient::empty_tensor now picks the device default for rank > 1.
ComputeClient::to_client_tensor uses the device default when the source is inner‑contiguous rows; otherwise falls back to contiguous.
Centralized stride helpers remain the same (is_contiguous, is_inner_contiguous_rows, row_pitch_elems, pitched_rows_layout); IO still only accepts contiguous or inner‑contiguous rows.
Keeps behavior focused: no changes to pitched copy logic; just device‑aware defaults. We can refine defaults with benchmarks later if needed.
Workspace build and clippy are green.

nathanielsimard · 2025-09-15T13:15:41Z

@ariawisp The merge conflicts would need to be solved.

Summary - WGPU pitched I/O (rank>=2 inner-contiguous rows) and pitched allocation for Optimized. - CPU pitched read/write parity and pitched allocation for Optimized. - Centralize stride logic with runtime helpers: contiguous_strides, is_contiguous, is_inner_contiguous_rows, row_pitch_elems; remove duplicates in STD. - Backend gating: WGPU/HIP now accept inner-contiguous rows; no change to CUDA logic. - Docs: update book examples to use runtime::stride; rename PR doc. Details - WGPU: stream.rs read/write uses row_pitch_elems; server.rs create uses pitched rows when Optimized. - CPU: compute/server.rs create/read/write implement pitched behavior. - Runtime: stride.rs adds InnerContiguousRows, is_inner_contiguous_rows, row_pitch_elems. - HIP/WGPU runtime.rs can_read_tensor accepts pitched rows. - STD: remove is_contiguous/compact_strides; adjust LinearLayout to use runtime is_contiguous. runtime(stride/io): add device-level default allocation preference for rank>1 (contiguous vs pitched). WGPU/CPU default to contiguous; CUDA/HIP to pitched. Use preference in empty_tensor and to_client_tensor. � Conflicts: � crates/cubecl-convolution/src/kernels/layered/algorithm/simple.rs � crates/cubecl-cpu/src/compute/server.rs � crates/cubecl-cpu/src/runtime.rs � crates/cubecl-cuda/src/compute/server.rs � crates/cubecl-cuda/src/runtime.rs � crates/cubecl-hip/src/compute/server.rs � crates/cubecl-hip/src/runtime.rs � crates/cubecl-runtime/src/server.rs � crates/cubecl-runtime/tests/dummy/compute.rs � crates/cubecl-wgpu/src/compute/server.rs � crates/cubecl-wgpu/src/compute/stream.rs � crates/cubecl-wgpu/src/runtime.rs

ariawisp · 2025-09-19T22:17:42Z

Branch has been rebased on main (cf40b4e) and confirmed CI workflow passes on my fork.

ariawisp marked this pull request as draft September 11, 2025 04:01

ariawisp force-pushed the cubecl-wgpu-strided-io branch from 416ed9f to ef10850 Compare September 11, 2025 04:09

ariawisp changed the title ~~wgpu: 2D pitched (strided) tensor read/write for buffers~~ runtimes: strided (pitched) buffer I/O + centralized stride helpers Sep 11, 2025

ariawisp force-pushed the cubecl-wgpu-strided-io branch 10 times, most recently from 5b24b30 to ce5fd54 Compare September 11, 2025 05:11

ariawisp marked this pull request as ready for review September 11, 2025 05:12

ariawisp force-pushed the cubecl-wgpu-strided-io branch 2 times, most recently from a1ba972 to db1bd8f Compare September 11, 2025 05:22

nathanielsimard requested a review from wingertge September 11, 2025 13:02

wingertge reviewed Sep 11, 2025

View reviewed changes

ariawisp force-pushed the cubecl-wgpu-strided-io branch from 1f9c499 to 0132fa8 Compare September 11, 2025 20:21

wingertge approved these changes Sep 12, 2025

View reviewed changes

ariawisp force-pushed the cubecl-wgpu-strided-io branch 2 times, most recently from c050144 to e111e07 Compare September 19, 2025 21:56

ariawisp force-pushed the cubecl-wgpu-strided-io branch from e111e07 to 0b326e3 Compare September 19, 2025 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtimes: strided (pitched) buffer I/O + centralized stride helpers #879

runtimes: strided (pitched) buffer I/O + centralized stride helpers #879

Uh oh!

ariawisp commented Sep 11, 2025 •

edited

Loading

Uh oh!

wingertge left a comment •

edited

Loading

Uh oh!

ariawisp commented Sep 11, 2025

Uh oh!

nathanielsimard commented Sep 15, 2025

Uh oh!

ariawisp commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

runtimes: strided (pitched) buffer I/O + centralized stride helpers #879

Are you sure you want to change the base?

runtimes: strided (pitched) buffer I/O + centralized stride helpers #879

Uh oh!

Conversation

ariawisp commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wingertge left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ariawisp commented Sep 11, 2025

Uh oh!

nathanielsimard commented Sep 15, 2025

Uh oh!

ariawisp commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ariawisp commented Sep 11, 2025 •

edited

Loading

wingertge left a comment •

edited

Loading