-
Notifications
You must be signed in to change notification settings - Fork 68
Use inline VISA to optimize horizontal batched subgroup reduce #4171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces an experimental inline VISA mechanism to optimize horizontal batched subgroup reduce in the Intel backend while providing stub implementations for NVIDIA and AMD backends. Key changes include:
- Adding a new warpBatchReduce function implementation with inline VISA in Intel’s TargetInfo.cpp.
- Updating header files across Intel, NVIDIA, and AMD backends and the base interface to declare/open up the new function.
- Integrating the new warpBatchReduce call into ReduceOpToLLVM.cpp for early returns when applicable.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/TargetInfo.h | Add stub implementation for warpBatchReduce returning false. |
third_party/intel/lib/TritonIntelGPUToLLVM/TargetInfo.h | Declare the new warpBatchReduce function. |
third_party/intel/lib/TritonIntelGPUToLLVM/TargetInfo.cpp | Implement experimental inline VISA-based warpBatchReduce logic. |
third_party/intel/lib/TritonIntelGPUToLLVM/ReduceOpToLLVM.cpp | Integrate warpBatchReduce into reduce op conversion. |
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.h | Add stub implementation for warpBatchReduce returning false. |
include/triton/Conversion/TritonGPUToLLVM/TargetInfoBase.h | Add pure virtual declaration for warpBatchReduce. |
for (auto it : acc) { | ||
const SmallVector<unsigned> &key = it.first; | ||
SmallVector<Value> &val = acc[key]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Iterating over 'acc' using 'auto it' and then accessing 'acc[key]' results in redundant lookups; consider using structured bindings (e.g., 'for (auto &pair : acc)') to improve clarity and efficiency.
for (auto it : acc) { | |
const SmallVector<unsigned> &key = it.first; | |
SmallVector<Value> &val = acc[key]; | |
for (auto &[key, val] : acc) { |
Copilot uses AI. Check for mistakes.
if (!isSupportedWarpReduceOp(reduceOp, numLaneToReduce, warpSize)) | ||
return false; | ||
|
||
// It is only experimental code supports threads_per_warp=16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The hard-coded check for warpSize == 16 limits the function to experimental scenarios; consider adding a comment or an assert to clarify the dependency on this constraint.
// It is only experimental code supports threads_per_warp=16 | |
// This code is experimental and currently supports only threads_per_warp=16. | |
assert(warpSize == 16 && "This experimental code supports only warpSize of 16."); |
Copilot uses AI. Check for mistakes.
0ef1308
to
f925709
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the GitHub issue for this work?
@@ -176,6 +176,14 @@ struct ReduceOpConversion | |||
unsigned sizeIntraWarps = helper.getIntraWarpSizeWithUniqueData(); | |||
unsigned threadOffsetOnReductionAxis = | |||
helper.getThreadOffsetOnReductionAxis(); | |||
|
|||
auto ret = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add the method to the global target info if we are the only ones using it, inside files we control?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still under investigating how to implement this to upstream.
Maybe we can use the in-tree MLIR ops: https://mlir.llvm.org/docs/Dialects/GPU/#gpusubgroup_reduce-gpusubgroupreduceop
|
f925709
to
8d4e6e0
Compare
This is a large PR which is going to be split into several small ones.
|
8d4e6e0
to
98ff036
Compare
Signed-off-by: Lu,Chengjun <[email protected]>
While developing a kernel, I was given the error message "AssertionError()" without much helpful context on how to proceed with debugging. I could only solve it by understanding that part of the triton source code and spending half a day. That's why I'm (1) adding an error message to this part of the code, and (2) making the error message above it clearer (like it is in visit_While). This should allow the end user to debug this error without the need to dive into the triton source code.
Use inline VISA to optimize horizontal batched subgroup reduce.
float32
andfloat16
.Run the unit test CI.