Skip to content
21 changes: 21 additions & 0 deletions media/docs/cpp/cute/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,17 @@
CuTe
====================

Start Here (Intel-first)
------------------------

.. toctree::
:maxdepth: 1

Intel CuTe Overview<intel_overview.md>

Core CuTe Tutorial
------------------

.. toctree::
:maxdepth: 1

Expand All @@ -15,3 +26,13 @@ CuTe
0x_gemm_tutorial<0x_gemm_tutorial.md>
0y_predication<0y_predication.md>
0z_tma_tensors<0z_tma_tensors.md>

Intel GPU Extensions
--------------------

.. toctree::
:maxdepth: 1

Xe 2D Copy Operations<xe_2d_copy.md>
Intel Performance Tuning Guide<intel_performance_guide.md>
Intel SYCL GEMM Companion<intel_gemm_companion.md>
218 changes: 218 additions & 0 deletions media/docs/cpp/cute/intel_gemm_companion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# Intel SYCL GEMM Companion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this section(whole .md) do not have much flow. I would recommend one Gemm Example and create md. file to explain the APIs and Flow based on Xe Architecture and Available APIs.


## Purpose

This document provides **SYCL execution model notes** that complement the CuTe GEMM tutorial.
Read the tutorial first:

> 📖 [0x_gemm_tutorial.md](0x_gemm_tutorial.md)
>
> **Also useful:**
> [intel_overview.md](intel_overview.md) for Intel-specific component map
> · [xe_2d_copy.md](xe_2d_copy.md) for copy atom naming reference
This companion does **not** repeat tutorial content. It explains how to translate each tutorial
concept into a SYCL submission structure for Intel Xe and shows where Intel-specific copy and MMA
primitives plug in.

---

## Translating tutorial concepts to SYCL

The GEMM tutorial uses CUDA terminology. The table below maps each concept to its SYCL equivalent
as used in SYCL\*TLA examples (e.g., `examples/cute/tutorial/sgemm_1_sycl.cpp`,
`examples/cute/tutorial/bgemm_bmg_legacy.cpp`).

| Tutorial concept | SYCL\*TLA equivalent | Notes |
|-----------------|---------------------|-------|
| `__global__ void kernel(...)` | SYCL kernel submitted via `compat::launch<gemm_kernel, GemmKernelName>(grid, block, ...)` | Two template params: kernel function and kernel name type (for SYCL named kernels). Queue defaults to `get_default_queue()`; pass explicitly as 3rd arg if needed. |
| `blockIdx.x`, `blockIdx.y` | `BlockIdxX()`, `BlockIdxY()` (from `include/cutlass/gpu_generics.h`) | Portable wrappers over `compat::work_group_id::x/y()`. Used in all Xe kernel examples. |
| `threadIdx.x` | `ThreadIdxX()` (from `include/cutlass/gpu_generics.h`) | Portable wrapper over `compat::local_id::x()`. |
| `__shared__ T smem[N]` | `sycl::ext::oneapi::experimental::work_group_scratch_size<sizeof(T[N])>` | Declared via a kernel property; accessed through a pointer obtained from `sycl::ext::oneapi::experimental::get_work_group_scratch_memory()`. |
| `dim3 grid(gx, gy)` | `compat::dim3{gx, gy, 1}` | Passed to `compat::launch<>`. |
| `__launch_bounds__(N)` | `sycl::ext::oneapi::experimental::sub_group_size<16>` + work-group size in `nd_range` | Intel Xe always uses 16-wide subgroups for XMX dispatch. |

### Kernel submission skeleton

```cpp
// Host side — two template parameters: kernel function + kernel name type
auto grid = compat::dim3{ceil_div(N, BLK_N), ceil_div(M, BLK_M), 1};
auto block = compat::dim3{SubgroupSize * SubgroupsPerGroup, 1, 1};

// Without explicit queue (uses get_default_queue()):
compat::launch<gemm_kernel, GemmKernelName>(grid, block, args...);

// With explicit queue:
compat::launch<gemm_kernel, GemmKernelName>(grid, block, queue, args...);
```
```cpp
// Device side (kernel properties)
auto props = sycl::ext::oneapi::experimental::properties{
sycl::ext::oneapi::experimental::sub_group_size<16>
};
```

### Production kernel launch (with kernel properties)

The skeleton above uses the tutorial-style `compat::launch` API, which is suitable for simple
kernels like `sgemm_1_sycl.cpp`. Production Intel Xe GEMM kernels (e.g.,
`examples/00_bmg_gemm/`, all `GemmUniversalAdapter`-based examples) use the **experimental
launch API** which bundles kernel properties (subgroup size, scratch memory) into a launch policy:

```cpp
namespace sycl_exp = sycl::ext::oneapi::experimental;

auto sycl_grid = compat::dim3{grid_x, grid_y, 1};
auto sycl_block = compat::dim3{block_x, 1, 1};

compat::experimental::launch_properties launch_props{
sycl_exp::work_group_scratch_size(shared_mem_bytes),
};
auto kernel_props = compat::experimental::kernel_properties{
sycl_exp::sub_group_size<16>
};
compat::experimental::launch_policy policy{sycl_grid, sycl_block, launch_props, kernel_props};

// Two template params: device_kernel wrapper + kernel name type
auto event = compat::experimental::launch<
cutlass::device_kernel<GemmKernel>, GemmKernel>(policy, kernel_params);
```
**When to use which:**
| Pattern | Use when |
|---------|----------|
| `compat::launch<K, KName>(grid, block, args...)` | Simple CuTe tutorial kernels, custom one-file kernels |
| `compat::experimental::launch<device_kernel<K>, K>(policy, params)` | `GemmUniversalAdapter`-based kernels, production GEMM/attention, any kernel needing SLM scratch or kernel properties |
The experimental API is what all `examples/00_bmg_gemm/` through `examples/13_bmg_gemm_bias/`
and Flash Attention examples use internally via `GemmUniversalAdapter`.
---
## Where Intel-specific copy primitives plug in
The standard CuTe GEMM flow is:
```
make_tiled_copy → partition → cute::copy → TiledMMA → cute::gemm → epilogue store
```
For Intel Xe, the copy atoms and MMA atom are replaced with Xe hardware primitives.
The pattern from `examples/cute/tutorial/bgemm_bmg_legacy.cpp`:
```cpp
// A-matrix: load with transposed layout (LD_T) for XMX A operand
TiledCopy copyA = make_tiled_copy(
Copy_Atom<Copy_Traits<XE_2D_U16x16x16_LD_T, TA>, TA>{}, ...);
// B-matrix: load with VNNI-packed layout (LD_V) for XMX B operand
TiledCopy copyB = make_tiled_copy(
Copy_Atom<Copy_Traits<XE_2D_U16x32x32_LD_V, TB>, TB>{}, ...);
// C/D-matrix: store with row-major layout (ST_N)
TiledCopy copyC = make_tiled_copy(
Copy_Atom<Copy_Traits<XE_2D_U32x8x16_ST_N, TC>, TC>{}, ...);
// MMA: Xe Matrix Extension atom
TiledMMA mmaC = TiledMMAHelper<
MMA_Atom<XE_8x16x16_F32BF16BF16F32_TT>, ...>::TiledMMA{};
```

### Layout conventions for Xe operands

| Operand | Recommended layout | Reason |
|---------|--------------------|--------|
| A matrix | `LD_T` (column-major / transposed) | XMX expects A in column-major order |
| B matrix | `LD_V` (VNNI-packed, row-major) | XMX requires VNNI packing for B |
| C / D matrix | `ST_N` (row-major) | Output is row-major |

Using `LD_N` for the B matrix is a common mistake that produces incorrect results or severe
performance degradation. Always use `LD_V` for B on Intel Xe.

### Epilogue wiring

The CuTe GEMM flow ends with an epilogue that reads the accumulator, optionally loads C, applies
a fusion (e.g., linear combination, bias, activation), and stores D. The standard Intel Xe
epilogue pattern from `examples/00_bmg_gemm/legacy/00_bmg_gemm.cpp`:

```cpp
using EpilogueDispatchPolicy = cutlass::epilogue::IntelXeXMX16;

// Epilogue fusion — D = alpha * acc + beta * C
using EpilogueOp = cutlass::epilogue::fusion::LinearCombination<
ElementOutput, ElementComputeEpilogue,
ElementAccumulator, ElementAccumulator,
cutlass::FloatRoundStyle::round_to_nearest>;

using FusionCallBacks = cutlass::epilogue::fusion::FusionCallbacks<
EpilogueDispatchPolicy, // dispatch policy — must match mainloop
EpilogueOp,
TileShape,
decltype(tile_shape(TiledMma()))>;

using CollectiveEpilogue = cutlass::epilogue::collective::CollectiveEpilogue<
EpilogueDispatchPolicy,
TileShape,
ElementAccumulator, cutlass::gemm::TagToStrideC_t<LayoutC>, // C
ElementOutput, cutlass::gemm::TagToStrideC_t<LayoutD>, // D
FusionCallBacks,
XE_2D_U32x8x16_LD_N, void, void, // C load atom (for beta * C)
XE_2D_U32x8x16_ST_N, void, void>; // D store atom
```
Key points:
- The epilogue dispatch policy (`IntelXeXMX16`) must match the mainloop dispatch policy.
- `FusionCallbacks` wraps the epilogue operation and connects it to the tile shape.
- C is loaded with `XE_2D_U32x8x16_LD_N` (row-major read) and D is stored with
`XE_2D_U32x8x16_ST_N` (row-major write).
- For grouped GEMM, use `IntelXeXMX16Group` and the array epilogue variant.
- For fused epilogues (bias + activation, softmax, etc.), replace `LinearCombination` with
the appropriate fusion op from `cutlass/epilogue/fusion/xe_callbacks.hpp`.
See `examples/05_bmg_gemm_with_epilogues/` for fused epilogue examples (bias+ReLU, split-K,
dequantization).
---
## GEMM flow diagram with Intel primitives
```
Layout ──► Tensor ──► Tile ──► Copy ──► MMA ──► Store
│ │ │ │ │ │
│ │ │ │ │ └─ XE_2D_U32x8x16_ST_N
│ │ │ │ └────────── XE_8x16x16_F32BF16BF16F32_TT
│ │ │ └─────────────────── XE_2D_U16x16x16_LD_T (A)
│ │ │ XE_2D_U16x32x32_LD_V (B)
│ │ └──────────────────────────── make_shape(Int<256>{}, Int<256>{}, Int<32>{})
│ └────────────────────────────────────── make_tensor(gmem_ptr, shape, stride)
└────────────────────────────────────────────────── make_stride(Int<1>{}, ldA)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make_layout() for layout as layout is typically shape and stride?

```
---
## Performance notes
For detailed tile-size selection, pipeline depth guidance, subgroup sizing, and common pitfalls, see the
[Intel Performance Tuning Guide](intel_performance_guide.md).
**Quick reference for this GEMM flow:**
- **Tile shape:** Start with `Shape<_256, _256, _32>` for BF16 on BMG/PVC.
- **Subgroup layout:** `Layout<Shape<_8, _4, _1>, Stride<_4, _1, _0>>` gives 32 subgroups per
work-group (512 threads with `SubgroupSize = 16`).
- **Pipeline stages:** `PipelineStages = 2` is the standard starting point.
- **CollectiveBuilder:** For standard GEMM, prefer `CollectiveBuilder` over manual wiring —
see the [performance guide](intel_performance_guide.md#start-simple-collectivebuilder).
---
## Further reading
- [xe_2d_copy.md](xe_2d_copy.md) — Full reference for all `XE_2D_*` copy atoms
- [intel_performance_guide.md](intel_performance_guide.md) — Tuning checklist, CollectiveBuilder, and common pitfalls
- [0t_mma_atom.md](0t_mma_atom.md) — CuTe MMA atom concept background
- [examples/README.md](../../../../examples/README.md) — Full SYCL\*TLA example directory (Intel GPU, device-agnostic, NVIDIA SYCL)
- [examples/cute/tutorial/](../../../../examples/cute/tutorial/) — CuTe tutorial examples (including `sgemm_1_sycl.cpp`)
- [test/unit/cute/intel_xe/](../../../../test/unit/cute/intel_xe/) — CuTe unit tests for Xe copy and MMA atoms
105 changes: 105 additions & 0 deletions media/docs/cpp/cute/intel_overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# CuTe in SYCL\*TLA — Intel Overview

## CuTe in SYCL\*TLA (What it is)

> **Prerequisite:** If you are brand new to CuTe, read the
> [quickstart](00_quickstart.md) first for a high-level orientation.
> Note: the quickstart currently uses CUDA/NVCC terminology inherited from upstream CUTLASS —
> the concepts apply identically to SYCL. Substitute `sub_group` for `warp`,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: if possible can we add a table similar to below but with updated Intel terms?

Image

> `work-group` for `threadblock`, and Intel DPC++ for NVCC. A SYCL-first rewrite of the
> quickstart is planned.
CuTe in SYCL\*TLA is a collection of C++ SYCL template abstractions for defining and operating on
hierarchically multidimensional layouts of threads and data.

The two central objects are:

- **`Layout`**: a compile-time mapping from a logical coordinate space to a flat index.
Layouts compose naturally — slicing, tiling, and transposing are pure algebra.
- **`Tensor`**: a `Layout` paired with a pointer to storage. CuTe `Tensor`s handle all
index arithmetic for you; you work in logical coordinates.

Together these building blocks let you express complex GEMM tiling hierarchies (global→SLM→register)
and epilogue fusions without hand-writing index calculations.

## Concept map

```
Layout ──► Layout Algebra ──► Tensor ──► Algorithms ──► Atoms ──► GEMM tutorial
Intel Xe Extensions
(xe_2d_copy, XMX atoms)
```

> See the [Intel SYCL GEMM Companion](intel_gemm_companion.md#gemm-flow-diagram-with-intel-primitives)
> for this flow annotated with specific Intel Xe atom names.
## What's Intel-specific

The following components are unique to the Intel Xe path in this repository and are **not** part of
the upstream NVIDIA CUTLASS CuTe:

| Component | Location | Purpose |
|-----------|----------|---------|
| **Xe 2D block loads/stores/prefetch** | `xe_2d_copy.md`, `include/cute/arch/copy_xe_legacy_U16.hpp`, `include/cute/arch/copy_xe_legacy_U32.hpp`, `include/cute/arch/copy_xe_2d.hpp` (new unified API) | Hardware 2D block operations — see [xe_2d_copy.md](xe_2d_copy.md) for naming reference, [intel_gemm_companion.md](intel_gemm_companion.md) for usage patterns |
| **XMX MMA atoms** (`XE_8x16x16_*`) | `include/cute/arch/mma_xe_legacy.hpp` | Xe Matrix Extension compute atoms — see [intel_gemm_companion.md](intel_gemm_companion.md) for wiring patterns |
| **`SubgroupTensor`** | `include/cute/tensor_sg.hpp` | Intel-specific tensor type that scatters/gathers across subgroup lanes |
| **`TiledMMAHelper`** | `include/cute/atom/mma_atom.hpp` | Helper that constructs a `TiledMMA` from an Xe MMA atom and subgroup tile shape |

> **Legacy vs. new 2D copy API:** The table above lists both the legacy and new copy headers.
>
> - **Legacy API** (`copy_xe_legacy_U16.hpp`, `copy_xe_legacy_U32.hpp`): Uses named structs per

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove legacy links. Okay to have another section for legacy and note that the APIs might get deprecated in future.

> size/type/layout combination — e.g., `XE_2D_U16x32x32_LD_V`, `XE_2D_U32x8x16_ST_N`.
> All existing examples and tests in this repository use the legacy API.
> - **New unified API** (`copy_xe_2d.hpp`): Parameterized templates —
> e.g., `XE_LOAD_2D<Bits, Height, Width>`. This is the future direction and supports
> new atom features like subtiling and size-1 fragments.
>
> For new kernel development, check whether the new API covers your use case. For understanding
> existing code and examples, refer to the legacy headers.
### Intel Xe MMA atoms

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be better to change this section and add Information for All Xe based Atoms, Helper functions necessary for GEMMs and Kernels.


Xe MMA atoms follow the naming convention `XE_8x16x16_<AccumType><AType><BType><CType>_<Layout>`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is wrong information. Please refer the xe_architecture.md and refer our MMA(XE_DPAS ) and Copy atoms.

For example `XE_8x16x16_F32BF16BF16F32_TT` accumulates FP32 from BF16 A and BF16 B operands.
These are defined in `include/cute/arch/mma_xe_legacy.hpp`.

### SubgroupTensor

`SubgroupTensor` (from `include/cute/tensor_sg.hpp`) distributes tensor storage across the lanes of
an Intel subgroup. It is the Intel equivalent of the per-thread register tile used in CUDA CUTLASS.

### TiledMMAHelper

`TiledMMAHelper` (from `include/cute/atom/mma_atom.hpp`) wraps the low-level `MMA_Atom` with
subgroup tile size information to produce the `TiledMMA` object used in GEMM kernels.

## Recommended reading order

For engineers new to SYCL\*TLA CuTe, we recommend this sequence:

1. **[00_quickstart.md](00_quickstart.md)** — What CuTe is (see CUDA-first note above)
2. **This page** — Intel-specific context and concept map
3. **[01_layout.md](01_layout.md)****[02_layout_algebra.md](02_layout_algebra.md)** — The foundation (layout algebra is the most critical concept)
4. **[03_tensor.md](03_tensor.md)****[04_algorithms.md](04_algorithms.md)** — Tensors and copy/gemm algorithms
5. **[0x_gemm_tutorial.md](0x_gemm_tutorial.md)** — How GEMM works in CuTe

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to point to Intel Example

6. **[intel_gemm_companion.md](intel_gemm_companion.md)** — Translating the tutorial to SYCL / Intel Xe
7. **[xe_2d_copy.md](xe_2d_copy.md)** — Intel copy atom reference
8. **[intel_performance_guide.md](intel_performance_guide.md)** — Tuning and optimization

## Quick navigation (jump to any topic)

| Goal | Start here |
|------|-----------|
| **Learn CuTe concepts** | [01_layout.md](01_layout.md)[02_layout_algebra.md](02_layout_algebra.md)[03_tensor.md](03_tensor.md)[04_algorithms.md](04_algorithms.md) |
| **Implement a GEMM** | [0x_gemm_tutorial.md](0x_gemm_tutorial.md) |
| **Explore compute atoms** | [0t_mma_atom.md](0t_mma_atom.md) |
| **Optimize memory movement on Intel** | [xe_2d_copy.md](xe_2d_copy.md) |
| **Tune for Intel GPU performance** | [intel_performance_guide.md](intel_performance_guide.md) |
| **SYCL GEMM companion notes** | [intel_gemm_companion.md](intel_gemm_companion.md) |

> **Key concept:** Layout algebra ([02_layout_algebra.md](02_layout_algebra.md)) is the most important
> concept in CuTe — it powers all tiling, partitioning, and thread-to-data mapping. Functions like
> `logical_divide`, `composition`, and `complement` are how CuTe slices a global problem into
> per-subgroup work. If you read only one concept page, make it that one.
Loading