-
Notifications
You must be signed in to change notification settings - Fork 0
RFC: Add xegpu transform ops #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1c0906a
to
df1b9a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an initial pass. I will go through it again later. It currently looks to me that we need to generalize the layout setting. Currently, the implementation is limited to support a few specific cases only. Can we run analysis inside __transform_main, and query the analysis result for each Op or Value?
mlir/include/mlir/Dialect/XeGPU/TransformOps/XeGPUTransformOps.td
Outdated
Show resolved
Hide resolved
} | ||
|
||
auto sgLayout = getSgLayout(); | ||
if (sgLayout.size() != 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason to limit the rank to be 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I have only considered 2D inputs, namely 2D matmul op. This can be generalized as needed. My goal here was to demonstrate the transform ops with a 2D matmul and use that as the first CI test. Generalization can be added in the same PR or in a follow up with more tests.
Why would you need such analysis? Normally, I think, it is sufficient to inspect the payload op handle and transform op arguments for, say, verification purposes. |
I mean from transform perspective, how to systematically assign layouts to each OpResult and OpOperand in a kernel |
rename tileIndex to operandIndex remove all references to dpas ops where possible
df1b9a3
to
8b11bfd
Compare
Updates:
|
|
||
let summary = "Hoists xegpu tile descriptor ops outside the containing loop"; | ||
let description = [{ | ||
Hoists `xepu.create_nd_tdesc` out of the loop. If the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pass may become unnecessary as we are transitioning to a new create_nd_tdesc definition: nd_tdesc created without offset and move offset to load_nd. Create_nd_tdesc would become loop_invariant.
Referring to this PRs:
a.1. make offset option for create_nd_tdesc (llvm#148335)
a.2. add optional offsets for load_nd and store_nd/prefetch_nd. (llvm#149424)
You may look at Imex innersource github issue#1151 for more background info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes I'm aware of this planned change. It implies some changes to the transform ops - it should in fact make the logic simpler in most cases. Hoisting the desc ops is still needed but indeed we might be able to use existing hoist patterns instead of an xegpu specific method. We can address this issue once the new load_nd-offset pipeline is complete. In the meantime, on my behalf, we could upstream these transform ops so that we can support linalg.matmul lowering.
|
||
let summary = "Adds xegpu prefetch ops to matmul operand tiles."; | ||
let description = [{ | ||
Given an xegpu operation residing in a `scf.for` loop, this transform inserts cooperative `xegpu.prefetch` operations for the A (index = 0) or B (index = 1) operand. The prefetch tile size is determined by the `sg_layout` and `sg_data` attributes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the input is a xegpu DPAS op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the implementation only supports DPAS op at the moment.
auto layoutAttr = | ||
createLayoutAttr(rewriter.getContext(), sgLayout, sgData, instData); | ||
descOp = setDescLayout(rewriter, descOp, layoutAttr); | ||
if (operandIndex == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is current implementation still assume the operation is dpasOp? if so, maybe you can add a TODO note.
Extend support in LLDB for WebAssembly. This PR adds a new Process plugin (ProcessWasm) that extends ProcessGDBRemote for WebAssembly targets. It adds support for WebAssembly's memory model with separate address spaces, and the ability to fetch the call stack from the WebAssembly runtime. I have tested this change with the WebAssembly Micro Runtime (WAMR, https://github.com/bytecodealliance/wasm-micro-runtime) which implements a GDB debug stub and supports the qWasmCallStack packet. ``` (lldb) process connect --plugin wasm connect://localhost:4567 Process 1 stopped * thread #1, name = 'nobody', stop reason = trace frame #0: 0x40000000000001ad wasm32_args.wasm`main: -> 0x40000000000001ad <+3>: global.get 0 0x40000000000001b3 <+9>: i32.const 16 0x40000000000001b5 <+11>: i32.sub 0x40000000000001b6 <+12>: local.set 0 (lldb) b add Breakpoint 1: where = wasm32_args.wasm`add + 28 at test.c:4:12, address = 0x400000000000019c (lldb) c Process 1 resuming Process 1 stopped * thread #1, name = 'nobody', stop reason = breakpoint 1.1 frame #0: 0x400000000000019c wasm32_args.wasm`add(a=<unavailable>, b=<unavailable>) at test.c:4:12 1 int 2 add(int a, int b) 3 { -> 4 return a + b; 5 } 6 7 int (lldb) bt * thread #1, name = 'nobody', stop reason = breakpoint 1.1 * frame #0: 0x400000000000019c wasm32_args.wasm`add(a=<unavailable>, b=<unavailable>) at test.c:4:12 frame #1: 0x40000000000001e5 wasm32_args.wasm`main at test.c:12:12 frame llvm#2: 0x40000000000001fe wasm32_args.wasm ``` This PR is based on an unmerged patch from Paolo Severini: https://reviews.llvm.org/D78801. I intentionally stuck to the foundations to keep this PR small. I have more PRs in the pipeline to support the other features/packets. My motivation for supporting Wasm is to support debugging Swift compiled to WebAssembly: https://www.swift.org/documentation/articles/wasm-getting-started.html
Upstreaming these ops is deferred due to the ongoing changes in the xegpu dialect. Closing. |
xegpu
transform ops for matrix multiplicationPurpose
This document outlines new
transform.xegpu
transform operations.linalg
operations to thexegpu
dialect, although such capability would be useful in a number of user applications. The proposed XeGPU transform operations aim to fill the gaps for loweringlinalg.matmul
operations. They also address necessary tiling, prefetching etc. optimizations necessary to achieve good performance on Xe GPUs. Going forward, the XeGPU transform ops can be extended to support more workloads.scf.for
op) which allows defining differentiated transforms for each op (e.g., a main loop and remainder loop after tiling).New Operations
The new transform ops are:
transform.xegpu.set_operand_layout
: Given a handle to an anchor op, likexegpu.dpas
, setsxegpu.layout
attributes to its operands. Currently only supports DPAS ops. DPAS op must have been tiled to workgroup (WG) size, and reduction loop K size. This op sets thesg_layout
,sg_data
andinst_data
layout attributes. *transform.xegpu.insert_prefetch
: Inserts prefetch operations for an xegpu op operands. Currently only supports DPAS op. Setssg_layout
,sg_data
attributes, emits prefetch ops, and inserts them in the reduction loop.transform.xegpu.hoist_desc_ops
: Hoistsxegpu.create_nd_desc
ops out of the loop.transform.xegpu.set_gpu_launch_threads
: Given a handle to agpu.launch
op, sets the number of gpu threads. This op is a workaround to ensure correct number of threads in the launch op.Example: 4k matrix multiplication payload
Consider the following 4k
linalg.matmul
payload function defined withtensor
s.Applying existing transforms
We can apply workgroup (WG) and reduction dimension (K) tiling using the following upstream transform operations on the matched
linalg.matmul
op handle:This produces an
scf.forall
loop for the WG tiling, followed by anscf.for
reduction loop. The matmul op has shape(256x32, 32x256) -> 256x256
.We can now vectorize the
linalg.matmul
op and hoist the loop-invariant C tile read/store ops. Hoisting can be safely applied as we are working on tensors, thus avoiding any memory side-effects.Next we bufferize the payload function and drop the redundant function return value.
The matrix multiplication is now defined with the
vector
ops andmemref
s.We can now apply existing
gpu
dialect passes to map this loop nest to gpu blocks and treads (WG and SG). We first convert thescf.forall
loop toscf.parallel
. Thegpu-map-parallel-loops
expects twoscf.parallel
loops, one for WG and one for SG level. At this stage, however, we only have the WG loop, so the pass assumes a single GPU thread. We will fix this later.We can now apply the
convert-vector-to-xegpu
pass to convert thevector
dialect ops toxegpu
ops and foldmemref.subview
ops into thexegpu
descriptor op.The reduction loop now reads:
Applying
xegpu
transform opsThe above
xegpu
IR must be further optimized to get good performance. This is where the newxegpu
transform ops come to play.The
transform.xegpu.set_operand_layout
operationThe DPAS op is defined at the WG level without any indication on how it should be distributed to the subgroups. To this end, we apply the
transform.xegpu.set_operand_layout
op which sets thexegpu.layout
attributes. We first match the DPAS op, and then apply the desiredsg_layout
,sg_data
, andinst_data
attributes for the A tile (operandindex = 0
):The B and C tiles are handled analogously:
Setting the layout to the C tile also sets the
layout_result_0
attribute to thexegpu.dpas
op. The final reduction loop with layout attributes is:The
transform.xegpu.hoist_desc_ops
operationThe above IR still has the A and B descriptor ops within the reduction loop. These can be hoisted with the
transform.xegpu.hoist_desc_ops
op:The descriptor op is moved out of the loop, adding the descriptor to the loop's
iter_args
and adding an offset update op in the loop.This op replaces the
scf.for
op and therefore the loop handle is invalidated and an another handle to the new loop is returned.The resulting IR can now lowered further using the
xegpu-wg-to-sg-distribute
andxegpu-blocking
passes.The
transform.xegpu.insert_prefetch
operationCooperative prefetching can be added using the
transform.xegpu.insert_prefetch
op. The op takes a handle to the reduction loop and the DPAS op whose operands we want to prefetch. For the A tile, we prefetch the256x32
tile using 32 threads along the first dimension, i.e. each thread fetches a8x32
tile:This emits the descriptor, update offset and prefetch ops in the reduction loop:
The B tile prefetches are handled analogously. Here we choose to prefetch the
32x256
tile using 32 threads in[8, 4]
layout, each thread fetching again a8x32
tile:The
transform.xegpu.set_gpu_launch_threads
operationFinally, we fix the number of treads in the
gpu.launch
op with the following op:Full lowering schedule
Combining the above transformations we can now write the full lowering schedule for the matmul operation:
The above schedule exposes the following parameters:
The output IR after the above schedule has been applied can be found here (now outdated).
Performance
The above schedule yields ~200 TFLOPS/s performance on a single PVC tile and passes correctness test.
Discussion / Future work
xegpu.set_operand_layout
andxegpu.insert_prefetch
ops to support other ops thanxegpu.dpas
op.inst_data
tile between a load and use. In the long term, we could havexegpu.set_operand_layout
andxegpu.set_result_layout
ops that set attrs for individual ops and use the XeGPU layout propagation mechanism (under development) to handle layout conversions.xegpu.set_gpu_launch_threads
should be handled differently in the future, preferably using suitablegpu
dialect transform ops. It is included for the time being so that the IR can be executed correctly.linalg-matmul-to-xegpu{wg-tile=256,256 sg-tile=32,64 k-tile=32 dpas-tile=8,16,16 a-prefetch=8,32 b-prefetch=8,32 a-load=32,16 b-load=32,16}
. This pass applies the same transforms to all DPAS ops.