Reorganize Intel CuTe docs (index.rst, intel_overview.md, intel_gemm_companion.md, intel_performance_guide.md)#745
Conversation
…mpanion, update index.rst Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>
…mance_guide.md Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>
… launch API) Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>
…yout algebra callout, experimental launch, epilogue wiring, FP8 detail, CollectiveBuilder) Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>
…lectiveBuilder near top, cross-links, de-duplicate tile/pipeline section Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>
…acy.cpp in companion Co-authored-by: manvendrasingh21 <208962721+manvendrasingh21@users.noreply.github.com>
vidyasiv
left a comment
There was a problem hiding this comment.
Thanks for the documentation, very useful!
Just some minor comments, will do another pass shortly.
| | Pitfall | Description | Fix | | ||
| |---------|-------------|-----| | ||
| | **Alignment** | `XE_2D_*` loads require the base pointer to be 64-byte aligned and the row stride to be a multiple of 16 elements. Unaligned access silently produces garbage. | Pad matrices to alignment boundaries. | | ||
| | **Over-synchronisation** | Inserting `barrier()` after every copy wastes throughput. The epilogue often only needs one barrier. | Audit barrier placement; consolidate where possible. | |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| Consider a "residue" kernel or smaller tiles for non-multiple sizes. | ||
|
|
||
| 3. **Pipeline depth sufficient?** | ||
| If memory latency is high and XMX utilisation is low, increase `PipelineStages` by one and |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
|
||
| **Rules of thumb:** | ||
| - Start with `(256, 256, 32)` for BF16 on BMG/PVC. | ||
| - Reduce M or N if register spill is observed (check with Intel VTune or compiler `-v` output). |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| │ │ │ XE_2D_U16x32x32_LD_V (B) | ||
| │ │ └──────────────────────────── make_shape(Int<256>{}, Int<256>{}, Int<32>{}) | ||
| │ └────────────────────────────────────── make_tensor(gmem_ptr, shape, stride) | ||
| └────────────────────────────────────────────────── make_stride(Int<1>{}, ldA) |
There was a problem hiding this comment.
Why not make_layout() for layout as layout is typically shape and stride?
| > **Prerequisite:** If you are brand new to CuTe, read the | ||
| > [quickstart](00_quickstart.md) first for a high-level orientation. | ||
| > Note: the quickstart currently uses CUDA/NVCC terminology inherited from upstream CUTLASS — | ||
| > the concepts apply identically to SYCL. Substitute `sub_group` for `warp`, |
|
|
||
| > **Legacy vs. new 2D copy API:** The table above lists both the legacy and new copy headers. | ||
| > | ||
| > - **Legacy API** (`copy_xe_legacy_U16.hpp`, `copy_xe_legacy_U32.hpp`): Uses named structs per |
There was a problem hiding this comment.
Remove legacy links. Okay to have another section for legacy and note that the APIs might get deprecated in future.
|
|
||
| ### Intel Xe MMA atoms | ||
|
|
||
| Xe MMA atoms follow the naming convention `XE_8x16x16_<AccumType><AType><BType><CType>_<Layout>`. |
There was a problem hiding this comment.
I believe this is wrong information. Please refer the xe_architecture.md and refer our MMA(XE_DPAS ) and Copy atoms.
| 2. **This page** — Intel-specific context and concept map | ||
| 3. **[01_layout.md](01_layout.md)** → **[02_layout_algebra.md](02_layout_algebra.md)** — The foundation (layout algebra is the most critical concept) | ||
| 4. **[03_tensor.md](03_tensor.md)** → **[04_algorithms.md](04_algorithms.md)** — Tensors and copy/gemm algorithms | ||
| 5. **[0x_gemm_tutorial.md](0x_gemm_tutorial.md)** — How GEMM works in CuTe |
| > For new kernel development, check whether the new API covers your use case. For understanding | ||
| > existing code and examples, refer to the legacy headers. | ||
|
|
||
| ### Intel Xe MMA atoms |
There was a problem hiding this comment.
May be better to change this section and add Information for All Xe based Atoms, Helper functions necessary for GEMMs and Kernels.
| @@ -0,0 +1,218 @@ | |||
| # Intel SYCL GEMM Companion | |||
There was a problem hiding this comment.
I believe this section(whole .md) do not have much flow. I would recommend one Gemm Example and create md. file to explain the APIs and Flow based on Xe Architecture and Available APIs.
| | FP8 | `Shape<_256, _256, _32>` | See `include/cute/arch/mma_xe_legacy.hpp` for available FP8 atoms | | ||
| | INT8 | `Shape<_32, _128, _32>` | Mixed-precision; smaller tile is common | | ||
|
|
||
| > **FP8 implementation detail:** There is no dedicated FP8 MMA atom in the current codebase. |
| doubling the K-tile in the QK GEMM stage meaningfully improves performance by amortizing 2D block | ||
| load overhead over more XMX compute. | ||
|
|
||
| **Before** (conservative K-tile = 32): |
|
|
||
| Increasing to 3 or 4 can help on high-latency HBM systems, but raises register pressure. | ||
|
|
||
| ## Common pitfalls |
There was a problem hiding this comment.
Multiple APIs are not correct in this table. Please verify.
| ## Fast diagnosis — what to check first | ||
|
|
||
| 1. **Bandwidth-bound or compute-bound?** | ||
| Run with Intel VTune "GPU Hotspot" analysis. Compare achieved memory bandwidth to HBM peak and |
There was a problem hiding this comment.
We recommend https://github.com/intel/pti-gpu/tree/master
| If memory latency is high and XMX utilization is low, increase `PipelineStages` by one and | ||
| re-benchmark. | ||
|
|
||
| 4. **Alignment verified?** |

Description
This PR adds an Intel‑first CuTe documentation layer to improve navigation and usability for Intel Xe GPU targets. It restructures the CuTe docs entry point, introduces Intel‑specific overview, performance, and GEMM companion guides