Use inline VISA to optimize horizontal batched subgroup reduce #4171

chengjunlu · 2025-05-12T03:39:31Z

Use inline VISA to optimize horizontal batched subgroup reduce.

Support the SIMD reduce on ATS for threads_per_warp=16
Support the SIMD reduce on PVC for threads_per_warp=16 or 32.
There is limitation that:

only supports the accumulate size is equal or larger than threads_per_warp.
interleave lane number = 1.
Only supports the float32 and float16.
Only supports the sum and max.

Run the unit test CI.

Copilot

Pull Request Overview

This PR introduces an experimental inline VISA mechanism to optimize horizontal batched subgroup reduce in the Intel backend while providing stub implementations for NVIDIA and AMD backends. Key changes include:

Adding a new warpBatchReduce function implementation with inline VISA in Intel’s TargetInfo.cpp.
Updating header files across Intel, NVIDIA, and AMD backends and the base interface to declare/open up the new function.
Integrating the new warpBatchReduce call into ReduceOpToLLVM.cpp for early returns when applicable.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/TargetInfo.h	Add stub implementation for warpBatchReduce returning false.
third_party/intel/lib/TritonIntelGPUToLLVM/TargetInfo.h	Declare the new warpBatchReduce function.
third_party/intel/lib/TritonIntelGPUToLLVM/TargetInfo.cpp	Implement experimental inline VISA-based warpBatchReduce logic.
third_party/intel/lib/TritonIntelGPUToLLVM/ReduceOpToLLVM.cpp	Integrate warpBatchReduce into reduce op conversion.
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.h	Add stub implementation for warpBatchReduce returning false.
include/triton/Conversion/TritonGPUToLLVM/TargetInfoBase.h	Add pure virtual declaration for warpBatchReduce.

Copilot · 2025-05-12T03:40:16Z

third_party/intel/lib/TritonIntelGPUToLLVM/TargetInfo.cpp

+    for (auto it : acc) {
+      const SmallVector<unsigned> &key = it.first;
+      SmallVector<Value> &val = acc[key];


[nitpick] Iterating over 'acc' using 'auto it' and then accessing 'acc[key]' results in redundant lookups; consider using structured bindings (e.g., 'for (auto &pair : acc)') to improve clarity and efficiency.

Suggested change

for (auto it : acc) {

const SmallVector<unsigned> &key = it.first;

SmallVector<Value> &val = acc[key];

for (auto &[key, val] : acc) {

Copilot · 2025-05-12T03:40:17Z

third_party/intel/lib/TritonIntelGPUToLLVM/TargetInfo.cpp

+  if (!isSupportedWarpReduceOp(reduceOp, numLaneToReduce, warpSize))
+    return false;
+
+  // It is only experimental code supports threads_per_warp=16


[nitpick] The hard-coded check for warpSize == 16 limits the function to experimental scenarios; consider adding a comment or an assert to clarify the dependency on this constraint.

Suggested change

// It is only experimental code supports threads_per_warp=16

// This code is experimental and currently supports only threads_per_warp=16.

assert(warpSize == 16 && "This experimental code supports only warpSize of 16.");

alexbaden

Where is the GitHub issue for this work?

alexbaden · 2025-05-13T16:26:03Z

third_party/intel/lib/TritonIntelGPUToLLVM/ReduceOpToLLVM.cpp

@@ -176,6 +176,14 @@ struct ReduceOpConversion
    unsigned sizeIntraWarps = helper.getIntraWarpSizeWithUniqueData();
    unsigned threadOffsetOnReductionAxis =
        helper.getThreadOffsetOnReductionAxis();
+
+    auto ret =


Do we need to add the method to the global target info if we are the only ones using it, inside files we control?

It is still under investigating how to implement this to upstream.

Maybe we can use the in-tree MLIR ops: https://mlir.llvm.org/docs/Dialects/GPU/#gpusubgroup_reduce-gpusubgroupreduceop

chengjunlu · 2025-05-13T23:17:16Z

Where is the GitHub issue for this work?

#3310

chengjunlu · 2025-05-21T07:43:37Z

This is a large PR which is going to be split into several small ones.

Add SIMD reduce utils in XeAsmFormat.cpp and unit test.
Add a new placeholder in the ReduceOpLowering.cpp to override the default reduce with-in warp for multiple inputs.
Enable the SIMD reduce by default.

Signed-off-by: Lu,Chengjun <[email protected]>

While developing a kernel, I was given the error message "AssertionError()" without much helpful context on how to proceed with debugging. I could only solve it by understanding that part of the triton source code and spending half a day. That's why I'm (1) adding an error message to this part of the code, and (2) making the error message above it clearer (like it is in visit_While). This should allow the end user to debug this error without the need to dive into the triton source code.

chengjunlu requested a review from Copilot May 12, 2025 03:39

chengjunlu marked this pull request as draft May 12, 2025 03:39

Copilot AI reviewed May 12, 2025

View reviewed changes

chengjunlu force-pushed the chengjun/simd_reduce branch from 0ef1308 to f925709 Compare May 12, 2025 04:43

alexbaden reviewed May 13, 2025

View reviewed changes

chengjunlu mentioned this pull request May 13, 2025

Triton SIMD reduction investigation #3310

Open

chengjunlu force-pushed the chengjun/simd_reduce branch from f925709 to 8d4e6e0 Compare May 21, 2025 07:32

chengjunlu requested review from mfrancepillois, etiotto and whitneywhtsang May 21, 2025 07:40

chengjunlu force-pushed the chengjun/simd_reduce branch from 8d4e6e0 to 98ff036 Compare May 21, 2025 08:07

Add horizontal batched reduce.

98ff036

Signed-off-by: Lu,Chengjun <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use inline VISA to optimize horizontal batched subgroup reduce #4171

Use inline VISA to optimize horizontal batched subgroup reduce #4171

Uh oh!

chengjunlu commented May 12, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 12, 2025

Uh oh!

Copilot AI May 12, 2025

Uh oh!

alexbaden left a comment

Uh oh!

alexbaden May 13, 2025

Uh oh!

chengjunlu May 13, 2025

Uh oh!

chengjunlu commented May 13, 2025

Uh oh!

chengjunlu commented May 21, 2025

Uh oh!

Uh oh!

	// It is only experimental code supports threads_per_warp=16
	// This code is experimental and currently supports only threads_per_warp=16.
	assert(warpSize == 16 && "This experimental code supports only warpSize of 16.");

Use inline VISA to optimize horizontal batched subgroup reduce #4171

Are you sure you want to change the base?

Use inline VISA to optimize horizontal batched subgroup reduce #4171

Uh oh!

Conversation

chengjunlu commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 12, 2025

Choose a reason for hiding this comment

Uh oh!

alexbaden left a comment

Choose a reason for hiding this comment

Uh oh!

alexbaden May 13, 2025

Choose a reason for hiding this comment

Uh oh!

chengjunlu May 13, 2025

Choose a reason for hiding this comment

Uh oh!

chengjunlu commented May 13, 2025

Uh oh!

chengjunlu commented May 21, 2025

Uh oh!

Uh oh!

chengjunlu commented May 12, 2025 •

edited

Loading