Reorganize Intel CuTe docs (index.rst, intel_overview.md, intel_gemm_companion.md, intel_performance_guide.md) by manvendrasingh21 · Pull Request #745 · intel/sycl-tla

manvendrasingh21 · 2026-03-11T19:05:53Z

Description

This PR adds an Intel‑first CuTe documentation layer to improve navigation and usability for Intel Xe GPU targets. It restructures the CuTe docs entry point, introduces Intel‑specific overview, performance, and GEMM companion guides

…mpanion, update index.rst Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>

…mance_guide.md Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>

… launch API) Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>

…yout algebra callout, experimental launch, epilogue wiring, FP8 detail, CollectiveBuilder) Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>

…lectiveBuilder near top, cross-links, de-duplicate tile/pipeline section Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>

…acy.cpp in companion Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>

vidyasiv

Thanks for the documentation, very useful!
Just some minor comments, will do another pass shortly.

media/docs/cpp/cute/intel_performance_guide.md

manvendrasingh21 · 2026-03-12T11:02:59Z

media/docs/cpp/cute/intel_performance_guide.md

+| Pitfall | Description | Fix |
+|---------|-------------|-----|
+| **Alignment** | `XE_2D_*` loads require the base pointer to be 64-byte aligned and the row stride to be a multiple of 16 elements. Unaligned access silently produces garbage. | Pad matrices to alignment boundaries. |
+| **Over-synchronisation** | Inserting `barrier()` after every copy wastes throughput. The epilogue often only needs one barrier. | Audit barrier placement; consolidate where possible. |


manvendrasingh21 · 2026-03-12T11:00:51Z

media/docs/cpp/cute/intel_performance_guide.md

+   Consider a "residue" kernel or smaller tiles for non-multiple sizes.
+
+3. **Pipeline depth sufficient?**
+   If memory latency is high and XMX utilisation is low, increase `PipelineStages` by one and


media/docs/cpp/cute/intel_performance_guide.md

+
+**Rules of thumb:**
+- Start with `(256, 256, 32)` for BF16 on BMG/PVC.
+- Reduce M or N if register spill is observed (check with Intel VTune or compiler `-v` output).


vidyasiv · 2026-03-13T18:31:43Z

media/docs/cpp/cute/intel_gemm_companion.md

+  │           │         │                             XE_2D_U16x32x32_LD_V  (B)
+  │           │         └──────────────────────────── make_shape(Int<256>{}, Int<256>{}, Int<32>{})
+  │           └────────────────────────────────────── make_tensor(gmem_ptr, shape, stride)
+  └────────────────────────────────────────────────── make_stride(Int<1>{}, ldA)


Why not make_layout() for layout as layout is typically shape and stride?

vidyasiv · 2026-03-13T18:46:41Z

media/docs/cpp/cute/intel_overview.md

+> **Prerequisite:** If you are brand new to CuTe, read the
+> [quickstart](00_quickstart.md) first for a high-level orientation.
+> Note: the quickstart currently uses CUDA/NVCC terminology inherited from upstream CUTLASS —
+> the concepts apply identically to SYCL. Substitute `sub_group` for `warp`,


optional: if possible can we add a table similar to below but with updated Intel terms?

Antonyvance · 2026-03-13T20:56:51Z

media/docs/cpp/cute/intel_overview.md

+
+> **Legacy vs. new 2D copy API:** The table above lists both the legacy and new copy headers.
+>
+> - **Legacy API** (`copy_xe_legacy_U16.hpp`, `copy_xe_legacy_U32.hpp`): Uses named structs per


Remove legacy links. Okay to have another section for legacy and note that the APIs might get deprecated in future.

Antonyvance · 2026-03-13T20:59:42Z

media/docs/cpp/cute/intel_overview.md

+
+### Intel Xe MMA atoms
+
+Xe MMA atoms follow the naming convention `XE_8x16x16_<AccumType><AType><BType><CType>_<Layout>`.


I believe this is wrong information. Please refer the xe_architecture.md and refer our MMA(XE_DPAS ) and Copy atoms.

Antonyvance · 2026-03-13T21:02:01Z

media/docs/cpp/cute/intel_overview.md

+2. **This page** — Intel-specific context and concept map
+3. **[01_layout.md](01_layout.md)** → **[02_layout_algebra.md](02_layout_algebra.md)** — The foundation (layout algebra is the most critical concept)
+4. **[03_tensor.md](03_tensor.md)** → **[04_algorithms.md](04_algorithms.md)** — Tensors and copy/gemm algorithms
+5. **[0x_gemm_tutorial.md](0x_gemm_tutorial.md)** — How GEMM works in CuTe


Need to point to Intel Example

Antonyvance · 2026-03-13T21:03:32Z

media/docs/cpp/cute/intel_overview.md

+> For new kernel development, check whether the new API covers your use case.  For understanding
+> existing code and examples, refer to the legacy headers.
+
+### Intel Xe MMA atoms


May be better to change this section and add Information for All Xe based Atoms, Helper functions necessary for GEMMs and Kernels.

Antonyvance · 2026-03-13T21:05:47Z

media/docs/cpp/cute/intel_gemm_companion.md

@@ -0,0 +1,218 @@
+# Intel SYCL GEMM Companion


I believe this section(whole .md) do not have much flow. I would recommend one Gemm Example and create md. file to explain the APIs and Flow based on Xe Architecture and Available APIs.

Antonyvance · 2026-03-13T21:09:31Z

media/docs/cpp/cute/intel_performance_guide.md

+| FP8 | `Shape<_256, _256, _32>` | See `include/cute/arch/mma_xe_legacy.hpp` for available FP8 atoms |
+| INT8 | `Shape<_32, _128, _32>` | Mixed-precision; smaller tile is common |
+
+> **FP8 implementation detail:** There is no dedicated FP8 MMA atom in the current codebase.


Remve FP8 section

Antonyvance · 2026-03-13T21:10:46Z

media/docs/cpp/cute/intel_performance_guide.md

+doubling the K-tile in the QK GEMM stage meaningfully improves performance by amortizing 2D block
+load overhead over more XMX compute.
+
+**Before** (conservative K-tile = 32):


Was this tested?

Antonyvance · 2026-03-13T21:11:49Z

media/docs/cpp/cute/intel_performance_guide.md

+
+Increasing to 3 or 4 can help on high-latency HBM systems, but raises register pressure.
+
+## Common pitfalls


Multiple APIs are not correct in this table. Please verify.

Antonyvance · 2026-03-13T21:13:02Z

media/docs/cpp/cute/intel_performance_guide.md

+## Fast diagnosis — what to check first
+
+1. **Bandwidth-bound or compute-bound?**
+   Run with Intel VTune "GPU Hotspot" analysis.  Compare achieved memory bandwidth to HBM peak and


We recommend https://github.com/intel/pti-gpu/tree/master

Antonyvance · 2026-03-13T21:14:12Z

media/docs/cpp/cute/intel_performance_guide.md

+   If memory latency is high and XMX utilization is low, increase `PipelineStages` by one and
+   re-benchmark.
+
+4. **Alignment verified?**


I am not sure about this.

Copilot AI and others added 7 commits March 2, 2026 08:05

Initial plan

1955bd1

Add Intel-first CuTe docs: intel_overview, performance guide, GEMM co…

9a39567

…mpanion, update index.rst Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>

docs: add FA BF16 tile-size tuning real-world example to intel_perfor…

688c505

…mance_guide.md Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>

docs: fix accuracy issues in Intel CuTe docs (header paths, FP8 atom,…

e5a0447

… launch API) Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>

docs: add 6 content additions to Intel CuTe docs (legacy API note, la…

2ddc5bb

…yout algebra callout, experimental launch, epilogue wiring, FP8 detail, CollectiveBuilder) Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>

docs: restructure Intel CuTe docs — prerequisites, reading order, Col…

c3ca551

…lectiveBuilder near top, cross-links, de-duplicate tile/pipeline section Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>

docs: fix file path references for sgemm_1_sycl.cpp and bgemm_bmg_leg…

e45fa04

…acy.cpp in companion Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>

vidyasiv reviewed Mar 11, 2026

View reviewed changes

Antonyvance requested review from kausikmaiti, ratnampa and tdeng5 March 11, 2026 19:15

tdeng5 requested review from jiyang1011 and taozha2 March 12, 2026 00:37

Docs: updated US spelling

4b8c9cb

vidyasiv reviewed Mar 13, 2026

View reviewed changes

Antonyvance requested changes Mar 13, 2026

View reviewed changes

tdeng5 added the v0.8 label Mar 16, 2026


		### Intel Xe MMA atoms

		Xe MMA atoms follow the naming convention `XE_8x16x16_<AccumType><AType><BType><CType>_<Layout>`.


		Increasing to 3 or 4 can help on high-latency HBM systems, but raises register pressure.

		## Common pitfalls

Conversation

manvendrasingh21 commented Mar 11, 2026

Description

Uh oh!

vidyasiv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants