Skip to content

Reorganize Intel CuTe docs (index.rst, intel_overview.md, intel_gemm_companion.md, intel_performance_guide.md)#745

Open
manvendrasingh21 wants to merge 8 commits intointel:mainfrom
manvendrasingh21:copilot/add-intel-first-cute-docs
Open

Reorganize Intel CuTe docs (index.rst, intel_overview.md, intel_gemm_companion.md, intel_performance_guide.md)#745
manvendrasingh21 wants to merge 8 commits intointel:mainfrom
manvendrasingh21:copilot/add-intel-first-cute-docs

Conversation

@manvendrasingh21
Copy link

Description

This PR adds an Intel‑first CuTe documentation layer to improve navigation and usability for Intel Xe GPU targets. It restructures the CuTe docs entry point, introduces Intel‑specific overview, performance, and GEMM companion guides

Copilot AI and others added 7 commits March 2, 2026 08:05
…mpanion, update index.rst

Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>
…mance_guide.md

Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>
… launch API)

Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>
…yout algebra callout, experimental launch, epilogue wiring, FP8 detail, CollectiveBuilder)

Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>
…lectiveBuilder near top, cross-links, de-duplicate tile/pipeline section

Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>
…acy.cpp in companion

Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>
Copy link

@vidyasiv vidyasiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the documentation, very useful!
Just some minor comments, will do another pass shortly.

| Pitfall | Description | Fix |
|---------|-------------|-----|
| **Alignment** | `XE_2D_*` loads require the base pointer to be 64-byte aligned and the row stride to be a multiple of 16 elements. Unaligned access silently produces garbage. | Pad matrices to alignment boundaries. |
| **Over-synchronisation** | Inserting `barrier()` after every copy wastes throughput. The epilogue often only needs one barrier. | Audit barrier placement; consolidate where possible. |

This comment was marked as resolved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Consider a "residue" kernel or smaller tiles for non-multiple sizes.

3. **Pipeline depth sufficient?**
If memory latency is high and XMX utilisation is low, increase `PipelineStages` by one and

This comment was marked as resolved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


**Rules of thumb:**
- Start with `(256, 256, 32)` for BF16 on BMG/PVC.
- Reduce M or N if register spill is observed (check with Intel VTune or compiler `-v` output).

This comment was marked as outdated.

│ │ │ XE_2D_U16x32x32_LD_V (B)
│ │ └──────────────────────────── make_shape(Int<256>{}, Int<256>{}, Int<32>{})
│ └────────────────────────────────────── make_tensor(gmem_ptr, shape, stride)
└────────────────────────────────────────────────── make_stride(Int<1>{}, ldA)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make_layout() for layout as layout is typically shape and stride?

> **Prerequisite:** If you are brand new to CuTe, read the
> [quickstart](00_quickstart.md) first for a high-level orientation.
> Note: the quickstart currently uses CUDA/NVCC terminology inherited from upstream CUTLASS —
> the concepts apply identically to SYCL. Substitute `sub_group` for `warp`,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: if possible can we add a table similar to below but with updated Intel terms?

Image


> **Legacy vs. new 2D copy API:** The table above lists both the legacy and new copy headers.
>
> - **Legacy API** (`copy_xe_legacy_U16.hpp`, `copy_xe_legacy_U32.hpp`): Uses named structs per

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove legacy links. Okay to have another section for legacy and note that the APIs might get deprecated in future.


### Intel Xe MMA atoms

Xe MMA atoms follow the naming convention `XE_8x16x16_<AccumType><AType><BType><CType>_<Layout>`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is wrong information. Please refer the xe_architecture.md and refer our MMA(XE_DPAS ) and Copy atoms.

2. **This page** — Intel-specific context and concept map
3. **[01_layout.md](01_layout.md)** → **[02_layout_algebra.md](02_layout_algebra.md)** — The foundation (layout algebra is the most critical concept)
4. **[03_tensor.md](03_tensor.md)** → **[04_algorithms.md](04_algorithms.md)** — Tensors and copy/gemm algorithms
5. **[0x_gemm_tutorial.md](0x_gemm_tutorial.md)** — How GEMM works in CuTe

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to point to Intel Example

> For new kernel development, check whether the new API covers your use case. For understanding
> existing code and examples, refer to the legacy headers.

### Intel Xe MMA atoms

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be better to change this section and add Information for All Xe based Atoms, Helper functions necessary for GEMMs and Kernels.

@@ -0,0 +1,218 @@
# Intel SYCL GEMM Companion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this section(whole .md) do not have much flow. I would recommend one Gemm Example and create md. file to explain the APIs and Flow based on Xe Architecture and Available APIs.

| FP8 | `Shape<_256, _256, _32>` | See `include/cute/arch/mma_xe_legacy.hpp` for available FP8 atoms |
| INT8 | `Shape<_32, _128, _32>` | Mixed-precision; smaller tile is common |

> **FP8 implementation detail:** There is no dedicated FP8 MMA atom in the current codebase.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remve FP8 section

doubling the K-tile in the QK GEMM stage meaningfully improves performance by amortizing 2D block
load overhead over more XMX compute.

**Before** (conservative K-tile = 32):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this tested?


Increasing to 3 or 4 can help on high-latency HBM systems, but raises register pressure.

## Common pitfalls

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple APIs are not correct in this table. Please verify.

## Fast diagnosis — what to check first

1. **Bandwidth-bound or compute-bound?**
Run with Intel VTune "GPU Hotspot" analysis. Compare achieved memory bandwidth to HBM peak and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If memory latency is high and XMX utilization is low, increase `PipelineStages` by one and
re-benchmark.

4. **Alignment verified?**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this.

@tdeng5 tdeng5 added the v0.8 label Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants