[AMDGPU] fold `memref.subview` into `amdgpu.gather_to_lds` #149851

lialan · 2025-07-21T16:59:23Z

This PR adds a new optimization pass to fold memref.subview operations into amdgpu.gather_to_lds operations, simplifying the overall operation and potentially improving performance. The pass identifies when a GatherToLDSOp has a memref.subview as its source and attempts to fold the subview by adjusting the indices accordingly.

Implements a new pass AmdgpuFoldSubviewOpsPass with pattern FoldSubviewIntoGatherToLDSOp
Adds corresponding folding test

Co-authored-by: Copilot <[email protected]>

Copilot

Pull Request Overview

This PR implements a new optimization pass to fold memref.subview operations into amdgpu.gather_to_lds operations, which can simplify IR and improve performance by eliminating intermediate subview operations.

Adds AmdgpuFoldSubviewOpsPass with pattern FoldSubviewIntoGatherToLDSOp that identifies and folds subview sources
Implements index resolution using affine maps to adjust indices when folding subviews with offsets
Adds comprehensive test coverage for both zero-offset and non-zero offset subview folding scenarios

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp	Core implementation of the folding pass and pattern
mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td	Pass definition and documentation
mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h	Pass declarations and pattern population function
mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt	Build system integration for new source file
mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir	Test cases validating the folding optimization

mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp

github-actions · 2025-07-21T18:23:36Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvmbot · 2025-07-21T18:50:08Z

@llvm/pr-subscribers-mlir-memref

@llvm/pr-subscribers-mlir-gpu

Author: Alan Li (lialan)

Changes

This PR adds a new optimization pass to fold memref.subview operations into amdgpu.gather_to_lds operations, simplifying the overall operation and potentially improving performance. The pass identifies when a GatherToLDSOp has a memref.subview as its source and attempts to fold the subview by adjusting the indices accordingly.

Implements a new pass AmdgpuFoldSubviewOpsPass with pattern FoldSubviewIntoGatherToLDSOp
Adds corresponding folding test

Full diff: https://github.com/llvm/llvm-project/pull/149851.diff

5 Files Affected:

(modified) mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h (+5-1)
(modified) mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td (+12)
(modified) mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt (+2-1)
(added) mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp (+67)
(added) mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir (+50)

diff --git a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h
index cc2f543e79f69..a61903609aaff 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h
+++ b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h
@@ -22,8 +22,9 @@ class ConversionTarget;
 namespace amdgpu {
 
 #define GEN_PASS_DECL_AMDGPUEMULATEATOMICSPASS
-#define GEN_PASS_DECL_AMDGPURESOLVESTRIDEDMETADATAPASS
+#define GEN_PASS_DECL_AMDGPUFOLDSUBVIEWOPSPASS
 #define GEN_PASS_DECL_AMDGPUMASKEDLOADTOLOADPASS
+#define GEN_PASS_DECL_AMDGPURESOLVESTRIDEDMETADATAPASS
 #define GEN_PASS_REGISTRATION
 #include "mlir/Dialect/AMDGPU/Transforms/Passes.h.inc"
 
@@ -38,6 +39,9 @@ void populateAmdgpuResolveStridedMetadataPatterns(RewritePatternSet &patterns,
 void populateAmdgpuMaskedloadToLoadPatterns(RewritePatternSet &patterns,
                                             PatternBenefit benefit = 1);
 
+void populateAmdgpuFoldSubviewOpsPatterns(RewritePatternSet &patterns,
+                                          PatternBenefit benefit = 1);
+
 } // namespace amdgpu
 } // namespace mlir
 
diff --git a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td
index 8d0e6829ab0cc..fad939ced9877 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td
@@ -70,4 +70,16 @@ def AmdgpuMaskedloadToLoadPass : Pass<"amdgpu-maskedload-to-load"> {
     "memref::MemRefDialect"
   ];
 }
+
+def AmdgpuFoldSubviewOpsPass : Pass<"amdgpu-fold-subview-ops"> {
+  let summary = "Fold subview operations into their parent operations";
+  let description = [{
+    This pass identifies `memref.subview` sources of `GatherToLDSOp` and
+    attempts to fold the source ops, potentially simplifying the overall
+    operation and improving performance.
+  }];
+  let dependentDialects = [
+    "memref::MemRefDialect"
+  ];
+}
 #endif // MLIR_DIALECT_AMDGPU_TRANSFORMS_PASSES_TD_
diff --git a/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt b/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt
index 17bbe54ea6c0c..20621ec0d55a4 100644
--- a/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt
+++ b/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt
@@ -1,7 +1,8 @@
 add_mlir_dialect_library(MLIRAMDGPUTransforms
   EmulateAtomics.cpp
-  ResolveStridedMetadata.cpp
+  FoldSubviewOps.cpp
   MaskedloadToLoad.cpp
+  ResolveStridedMetadata.cpp
 
   ADDITIONAL_HEADER_DIRS
   {$MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/AMDGPU/Transforms
diff --git a/mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp b/mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp
new file mode 100644
index 0000000000000..adbdf4b856bd5
--- /dev/null
+++ b/mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp
@@ -0,0 +1,67 @@
+//===- FoldSubviewOps.cpp - AMDGPU fold subview ops ---------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "mlir/Dialect/AMDGPU/Transforms/Passes.h"
+
+#include "mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h"
+#include "mlir/Dialect/Affine/ViewLikeInterfaceUtils.h"
+#include "mlir/Dialect/MemRef/IR/MemRef.h"
+#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
+
+namespace mlir::amdgpu {
+#define GEN_PASS_DEF_AMDGPUFOLDSUBVIEWOPSPASS
+#include "mlir/Dialect/AMDGPU/Transforms/Passes.h.inc"
+} // namespace mlir::amdgpu
+
+using namespace mlir;
+using namespace mlir::amdgpu;
+
+namespace {
+struct AmdgpuFoldSubviewOpsPass
+    : public amdgpu::impl::AmdgpuFoldSubviewOpsPassBase<
+          AmdgpuFoldSubviewOpsPass> {
+  void runOnOperation() override {
+    RewritePatternSet patterns(&getContext());
+    populateAmdgpuFoldSubviewOpsPatterns(patterns);
+    if (failed(applyPatternsGreedily(getOperation(), std::move(patterns))))
+      signalPassFailure();
+  }
+};
+
+struct FoldSubviewIntoGatherToLDSOp : public OpRewritePattern<GatherToLDSOp> {
+  using OpRewritePattern<GatherToLDSOp>::OpRewritePattern;
+  LogicalResult matchAndRewrite(GatherToLDSOp op,
+                                PatternRewriter &rewriter) const override {
+    Location loc = op.getLoc();
+
+    // Check if the source is a subview operation:
+    auto subviewOp = dyn_cast<memref::SubViewOp>(op.getSrc().getDefiningOp());
+    if (!subviewOp)
+      return rewriter.notifyMatchFailure(
+          loc, "GatherToLDSOp folding is currently supported only when the "
+               "source is a SubviewOp. This is one specific pattern, and other "
+               "scenarios may be added in the future.");
+
+    SmallVector<Value> sourceIndices;
+    mlir::affine::resolveIndicesIntoOpWithOffsetsAndStrides(
+        rewriter, loc, subviewOp.getMixedOffsets(), subviewOp.getMixedStrides(),
+        subviewOp.getDroppedDims(), op.getSrcIndices(), sourceIndices);
+
+    rewriter.replaceOpWithNewOp<GatherToLDSOp>(
+        op, subviewOp.getSource(), sourceIndices, op.getDst(),
+        op.getDstIndices(), op.getTransferType());
+
+    return success();
+  }
+};
+} // namespace
+
+void mlir::amdgpu::populateAmdgpuFoldSubviewOpsPatterns(
+    RewritePatternSet &patterns, PatternBenefit benefit) {
+  patterns.add<FoldSubviewIntoGatherToLDSOp>(patterns.getContext(), benefit);
+}
diff --git a/mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir b/mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir
new file mode 100644
index 0000000000000..d582991c3622f
--- /dev/null
+++ b/mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir
@@ -0,0 +1,50 @@
+// RUN: mlir-opt -amdgpu-fold-subview-ops -split-input-file %s | FileCheck %s
+
+#gpu_lds_addrspace = 3
+
+// CHECK: func @test_memref
+// CHECK-SAME: %[[ARG0:.*]]: index, %[[ARG1:.*]]: index
+func.func @test_memref(%offset_i: index, %offset_j: index) {
+  // CHECK: %[[C0:.*]] = arith.constant 0 : index
+  // CHECK: %[[LOCAL:.*]] = memref.alloc() : memref<64x64xf16, 3>
+  // CHECK: %[[MEM:.*]] = memref.alloc() : memref<64x128xf16>
+  // CHECK:  %[[MEM]][%arg0, %arg1], %[[LOCAL]][%[[C0]], %[[C0]]]
+  // CHECK-SAME: vector<8xf16>, memref<64x128xf16>, memref<64x64xf16, 3>
+
+  %alloc = memref.alloc() : memref<64x64xf16, #gpu_lds_addrspace>
+  %mem = memref.alloc() : memref<64x128xf16>
+  %subview = memref.subview %mem[0, 0][32, 64][1, 1] : memref<64x128xf16> to memref<32x64xf16, strided<[128, 1]>>
+  %c0 = arith.constant 0 : index
+  amdgpu.gather_to_lds %subview[%offset_i, %offset_j], %alloc[%c0, %c0]
+    : vector<8xf16>, memref<32x64xf16, strided<[128, 1]>>, memref<64x64xf16, #gpu_lds_addrspace>
+  func.return
+}
+
+// -----
+
+#gpu_lds_addrspace = 3
+
+// CHECK: #[[MAP:.*]] = affine_map<()[s0] -> (s0 + 32)>
+// CHECK: #[[MAP1:.*]] = affine_map<()[s0] -> (s0 + 64)>
+
+// CHECK: func @subview_folding_offset
+// CHECK-SAME: %[[ARG0:.*]]: index, %[[ARG1:.*]]: index
+func.func @subview_folding_offset(%offset_i: index, %offset_j: index) {
+  // CHECK: %[[C0:.*]] = arith.constant 0 : index
+  // CHECK: %[[LOCAL:.*]] = memref.alloc() : memref<64x64xf16, 3>
+  // CHECK: %[[MEM:.*]] = memref.alloc() : memref<64x128xf16>
+
+  // CHECK: %[[IDX0:.*]] = affine.apply #[[MAP]]()[%[[ARG0]]]
+  // CHECK: %[[IDX1:.*]] = affine.apply #[[MAP1]]()[%[[ARG1]]]
+
+  // CHECK:  %[[MEM]][%[[IDX0]], %[[IDX1]]], %[[LOCAL]][%[[C0]], %[[C0]]]
+  // CHECK-SAME: vector<8xf16>, memref<64x128xf16>, memref<64x64xf16, 3>
+
+  %alloc = memref.alloc() : memref<64x64xf16, #gpu_lds_addrspace>
+  %mem = memref.alloc() : memref<64x128xf16>
+  %subview = memref.subview %mem[32, 64][32, 64][1, 1] : memref<64x128xf16> to memref<32x64xf16, strided<[128, 1], offset: 4160>>
+  %c0 = arith.constant 0 : index
+  amdgpu.gather_to_lds %subview[%offset_i, %offset_j], %alloc[%c0, %c0]
+    : vector<8xf16>, memref<32x64xf16, strided<[128, 1], offset: 4160>>, memref<64x64xf16, #gpu_lds_addrspace>
+  func.return
+}

llvmbot · 2025-07-21T18:50:08Z

@llvm/pr-subscribers-backend-amdgpu

Author: Alan Li (lialan)

Changes

This PR adds a new optimization pass to fold memref.subview operations into amdgpu.gather_to_lds operations, simplifying the overall operation and potentially improving performance. The pass identifies when a GatherToLDSOp has a memref.subview as its source and attempts to fold the subview by adjusting the indices accordingly.

Implements a new pass AmdgpuFoldSubviewOpsPass with pattern FoldSubviewIntoGatherToLDSOp
Adds corresponding folding test

Full diff: https://github.com/llvm/llvm-project/pull/149851.diff

5 Files Affected:

(modified) mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h (+5-1)
(modified) mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td (+12)
(modified) mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt (+2-1)
(added) mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp (+67)
(added) mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir (+50)

diff --git a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h
index cc2f543e79f69..a61903609aaff 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h
+++ b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h
@@ -22,8 +22,9 @@ class ConversionTarget;
 namespace amdgpu {
 
 #define GEN_PASS_DECL_AMDGPUEMULATEATOMICSPASS
-#define GEN_PASS_DECL_AMDGPURESOLVESTRIDEDMETADATAPASS
+#define GEN_PASS_DECL_AMDGPUFOLDSUBVIEWOPSPASS
 #define GEN_PASS_DECL_AMDGPUMASKEDLOADTOLOADPASS
+#define GEN_PASS_DECL_AMDGPURESOLVESTRIDEDMETADATAPASS
 #define GEN_PASS_REGISTRATION
 #include "mlir/Dialect/AMDGPU/Transforms/Passes.h.inc"
 
@@ -38,6 +39,9 @@ void populateAmdgpuResolveStridedMetadataPatterns(RewritePatternSet &patterns,
 void populateAmdgpuMaskedloadToLoadPatterns(RewritePatternSet &patterns,
                                             PatternBenefit benefit = 1);
 
+void populateAmdgpuFoldSubviewOpsPatterns(RewritePatternSet &patterns,
+                                          PatternBenefit benefit = 1);
+
 } // namespace amdgpu
 } // namespace mlir
 
diff --git a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td
index 8d0e6829ab0cc..fad939ced9877 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td
@@ -70,4 +70,16 @@ def AmdgpuMaskedloadToLoadPass : Pass<"amdgpu-maskedload-to-load"> {
     "memref::MemRefDialect"
   ];
 }
+
+def AmdgpuFoldSubviewOpsPass : Pass<"amdgpu-fold-subview-ops"> {
+  let summary = "Fold subview operations into their parent operations";
+  let description = [{
+    This pass identifies `memref.subview` sources of `GatherToLDSOp` and
+    attempts to fold the source ops, potentially simplifying the overall
+    operation and improving performance.
+  }];
+  let dependentDialects = [
+    "memref::MemRefDialect"
+  ];
+}
 #endif // MLIR_DIALECT_AMDGPU_TRANSFORMS_PASSES_TD_
diff --git a/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt b/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt
index 17bbe54ea6c0c..20621ec0d55a4 100644
--- a/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt
+++ b/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt
@@ -1,7 +1,8 @@
 add_mlir_dialect_library(MLIRAMDGPUTransforms
   EmulateAtomics.cpp
-  ResolveStridedMetadata.cpp
+  FoldSubviewOps.cpp
   MaskedloadToLoad.cpp
+  ResolveStridedMetadata.cpp
 
   ADDITIONAL_HEADER_DIRS
   {$MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/AMDGPU/Transforms
diff --git a/mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp b/mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp
new file mode 100644
index 0000000000000..adbdf4b856bd5
--- /dev/null
+++ b/mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp
@@ -0,0 +1,67 @@
+//===- FoldSubviewOps.cpp - AMDGPU fold subview ops ---------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "mlir/Dialect/AMDGPU/Transforms/Passes.h"
+
+#include "mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h"
+#include "mlir/Dialect/Affine/ViewLikeInterfaceUtils.h"
+#include "mlir/Dialect/MemRef/IR/MemRef.h"
+#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
+
+namespace mlir::amdgpu {
+#define GEN_PASS_DEF_AMDGPUFOLDSUBVIEWOPSPASS
+#include "mlir/Dialect/AMDGPU/Transforms/Passes.h.inc"
+} // namespace mlir::amdgpu
+
+using namespace mlir;
+using namespace mlir::amdgpu;
+
+namespace {
+struct AmdgpuFoldSubviewOpsPass
+    : public amdgpu::impl::AmdgpuFoldSubviewOpsPassBase<
+          AmdgpuFoldSubviewOpsPass> {
+  void runOnOperation() override {
+    RewritePatternSet patterns(&getContext());
+    populateAmdgpuFoldSubviewOpsPatterns(patterns);
+    if (failed(applyPatternsGreedily(getOperation(), std::move(patterns))))
+      signalPassFailure();
+  }
+};
+
+struct FoldSubviewIntoGatherToLDSOp : public OpRewritePattern<GatherToLDSOp> {
+  using OpRewritePattern<GatherToLDSOp>::OpRewritePattern;
+  LogicalResult matchAndRewrite(GatherToLDSOp op,
+                                PatternRewriter &rewriter) const override {
+    Location loc = op.getLoc();
+
+    // Check if the source is a subview operation:
+    auto subviewOp = dyn_cast<memref::SubViewOp>(op.getSrc().getDefiningOp());
+    if (!subviewOp)
+      return rewriter.notifyMatchFailure(
+          loc, "GatherToLDSOp folding is currently supported only when the "
+               "source is a SubviewOp. This is one specific pattern, and other "
+               "scenarios may be added in the future.");
+
+    SmallVector<Value> sourceIndices;
+    mlir::affine::resolveIndicesIntoOpWithOffsetsAndStrides(
+        rewriter, loc, subviewOp.getMixedOffsets(), subviewOp.getMixedStrides(),
+        subviewOp.getDroppedDims(), op.getSrcIndices(), sourceIndices);
+
+    rewriter.replaceOpWithNewOp<GatherToLDSOp>(
+        op, subviewOp.getSource(), sourceIndices, op.getDst(),
+        op.getDstIndices(), op.getTransferType());
+
+    return success();
+  }
+};
+} // namespace
+
+void mlir::amdgpu::populateAmdgpuFoldSubviewOpsPatterns(
+    RewritePatternSet &patterns, PatternBenefit benefit) {
+  patterns.add<FoldSubviewIntoGatherToLDSOp>(patterns.getContext(), benefit);
+}
diff --git a/mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir b/mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir
new file mode 100644
index 0000000000000..d582991c3622f
--- /dev/null
+++ b/mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir
@@ -0,0 +1,50 @@
+// RUN: mlir-opt -amdgpu-fold-subview-ops -split-input-file %s | FileCheck %s
+
+#gpu_lds_addrspace = 3
+
+// CHECK: func @test_memref
+// CHECK-SAME: %[[ARG0:.*]]: index, %[[ARG1:.*]]: index
+func.func @test_memref(%offset_i: index, %offset_j: index) {
+  // CHECK: %[[C0:.*]] = arith.constant 0 : index
+  // CHECK: %[[LOCAL:.*]] = memref.alloc() : memref<64x64xf16, 3>
+  // CHECK: %[[MEM:.*]] = memref.alloc() : memref<64x128xf16>
+  // CHECK:  %[[MEM]][%arg0, %arg1], %[[LOCAL]][%[[C0]], %[[C0]]]
+  // CHECK-SAME: vector<8xf16>, memref<64x128xf16>, memref<64x64xf16, 3>
+
+  %alloc = memref.alloc() : memref<64x64xf16, #gpu_lds_addrspace>
+  %mem = memref.alloc() : memref<64x128xf16>
+  %subview = memref.subview %mem[0, 0][32, 64][1, 1] : memref<64x128xf16> to memref<32x64xf16, strided<[128, 1]>>
+  %c0 = arith.constant 0 : index
+  amdgpu.gather_to_lds %subview[%offset_i, %offset_j], %alloc[%c0, %c0]
+    : vector<8xf16>, memref<32x64xf16, strided<[128, 1]>>, memref<64x64xf16, #gpu_lds_addrspace>
+  func.return
+}
+
+// -----
+
+#gpu_lds_addrspace = 3
+
+// CHECK: #[[MAP:.*]] = affine_map<()[s0] -> (s0 + 32)>
+// CHECK: #[[MAP1:.*]] = affine_map<()[s0] -> (s0 + 64)>
+
+// CHECK: func @subview_folding_offset
+// CHECK-SAME: %[[ARG0:.*]]: index, %[[ARG1:.*]]: index
+func.func @subview_folding_offset(%offset_i: index, %offset_j: index) {
+  // CHECK: %[[C0:.*]] = arith.constant 0 : index
+  // CHECK: %[[LOCAL:.*]] = memref.alloc() : memref<64x64xf16, 3>
+  // CHECK: %[[MEM:.*]] = memref.alloc() : memref<64x128xf16>
+
+  // CHECK: %[[IDX0:.*]] = affine.apply #[[MAP]]()[%[[ARG0]]]
+  // CHECK: %[[IDX1:.*]] = affine.apply #[[MAP1]]()[%[[ARG1]]]
+
+  // CHECK:  %[[MEM]][%[[IDX0]], %[[IDX1]]], %[[LOCAL]][%[[C0]], %[[C0]]]
+  // CHECK-SAME: vector<8xf16>, memref<64x128xf16>, memref<64x64xf16, 3>
+
+  %alloc = memref.alloc() : memref<64x64xf16, #gpu_lds_addrspace>
+  %mem = memref.alloc() : memref<64x128xf16>
+  %subview = memref.subview %mem[32, 64][32, 64][1, 1] : memref<64x128xf16> to memref<32x64xf16, strided<[128, 1], offset: 4160>>
+  %c0 = arith.constant 0 : index
+  amdgpu.gather_to_lds %subview[%offset_i, %offset_j], %alloc[%c0, %c0]
+    : vector<8xf16>, memref<32x64xf16, strided<[128, 1], offset: 4160>>, memref<64x64xf16, #gpu_lds_addrspace>
+  func.return
+}

llvmbot · 2025-07-21T18:50:09Z

@llvm/pr-subscribers-mlir

Author: Alan Li (lialan)

Changes

This PR adds a new optimization pass to fold memref.subview operations into amdgpu.gather_to_lds operations, simplifying the overall operation and potentially improving performance. The pass identifies when a GatherToLDSOp has a memref.subview as its source and attempts to fold the subview by adjusting the indices accordingly.

Implements a new pass AmdgpuFoldSubviewOpsPass with pattern FoldSubviewIntoGatherToLDSOp
Adds corresponding folding test

Full diff: https://github.com/llvm/llvm-project/pull/149851.diff

5 Files Affected:

(modified) mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h (+5-1)
(modified) mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td (+12)
(modified) mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt (+2-1)
(added) mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp (+67)
(added) mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir (+50)

diff --git a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h
index cc2f543e79f69..a61903609aaff 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h
+++ b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h
@@ -22,8 +22,9 @@ class ConversionTarget;
 namespace amdgpu {
 
 #define GEN_PASS_DECL_AMDGPUEMULATEATOMICSPASS
-#define GEN_PASS_DECL_AMDGPURESOLVESTRIDEDMETADATAPASS
+#define GEN_PASS_DECL_AMDGPUFOLDSUBVIEWOPSPASS
 #define GEN_PASS_DECL_AMDGPUMASKEDLOADTOLOADPASS
+#define GEN_PASS_DECL_AMDGPURESOLVESTRIDEDMETADATAPASS
 #define GEN_PASS_REGISTRATION
 #include "mlir/Dialect/AMDGPU/Transforms/Passes.h.inc"
 
@@ -38,6 +39,9 @@ void populateAmdgpuResolveStridedMetadataPatterns(RewritePatternSet &patterns,
 void populateAmdgpuMaskedloadToLoadPatterns(RewritePatternSet &patterns,
                                             PatternBenefit benefit = 1);
 
+void populateAmdgpuFoldSubviewOpsPatterns(RewritePatternSet &patterns,
+                                          PatternBenefit benefit = 1);
+
 } // namespace amdgpu
 } // namespace mlir
 
diff --git a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td
index 8d0e6829ab0cc..fad939ced9877 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td
@@ -70,4 +70,16 @@ def AmdgpuMaskedloadToLoadPass : Pass<"amdgpu-maskedload-to-load"> {
     "memref::MemRefDialect"
   ];
 }
+
+def AmdgpuFoldSubviewOpsPass : Pass<"amdgpu-fold-subview-ops"> {
+  let summary = "Fold subview operations into their parent operations";
+  let description = [{
+    This pass identifies `memref.subview` sources of `GatherToLDSOp` and
+    attempts to fold the source ops, potentially simplifying the overall
+    operation and improving performance.
+  }];
+  let dependentDialects = [
+    "memref::MemRefDialect"
+  ];
+}
 #endif // MLIR_DIALECT_AMDGPU_TRANSFORMS_PASSES_TD_
diff --git a/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt b/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt
index 17bbe54ea6c0c..20621ec0d55a4 100644
--- a/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt
+++ b/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt
@@ -1,7 +1,8 @@
 add_mlir_dialect_library(MLIRAMDGPUTransforms
   EmulateAtomics.cpp
-  ResolveStridedMetadata.cpp
+  FoldSubviewOps.cpp
   MaskedloadToLoad.cpp
+  ResolveStridedMetadata.cpp
 
   ADDITIONAL_HEADER_DIRS
   {$MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/AMDGPU/Transforms
diff --git a/mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp b/mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp
new file mode 100644
index 0000000000000..adbdf4b856bd5
--- /dev/null
+++ b/mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp
@@ -0,0 +1,67 @@
+//===- FoldSubviewOps.cpp - AMDGPU fold subview ops ---------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "mlir/Dialect/AMDGPU/Transforms/Passes.h"
+
+#include "mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h"
+#include "mlir/Dialect/Affine/ViewLikeInterfaceUtils.h"
+#include "mlir/Dialect/MemRef/IR/MemRef.h"
+#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
+
+namespace mlir::amdgpu {
+#define GEN_PASS_DEF_AMDGPUFOLDSUBVIEWOPSPASS
+#include "mlir/Dialect/AMDGPU/Transforms/Passes.h.inc"
+} // namespace mlir::amdgpu
+
+using namespace mlir;
+using namespace mlir::amdgpu;
+
+namespace {
+struct AmdgpuFoldSubviewOpsPass
+    : public amdgpu::impl::AmdgpuFoldSubviewOpsPassBase<
+          AmdgpuFoldSubviewOpsPass> {
+  void runOnOperation() override {
+    RewritePatternSet patterns(&getContext());
+    populateAmdgpuFoldSubviewOpsPatterns(patterns);
+    if (failed(applyPatternsGreedily(getOperation(), std::move(patterns))))
+      signalPassFailure();
+  }
+};
+
+struct FoldSubviewIntoGatherToLDSOp : public OpRewritePattern<GatherToLDSOp> {
+  using OpRewritePattern<GatherToLDSOp>::OpRewritePattern;
+  LogicalResult matchAndRewrite(GatherToLDSOp op,
+                                PatternRewriter &rewriter) const override {
+    Location loc = op.getLoc();
+
+    // Check if the source is a subview operation:
+    auto subviewOp = dyn_cast<memref::SubViewOp>(op.getSrc().getDefiningOp());
+    if (!subviewOp)
+      return rewriter.notifyMatchFailure(
+          loc, "GatherToLDSOp folding is currently supported only when the "
+               "source is a SubviewOp. This is one specific pattern, and other "
+               "scenarios may be added in the future.");
+
+    SmallVector<Value> sourceIndices;
+    mlir::affine::resolveIndicesIntoOpWithOffsetsAndStrides(
+        rewriter, loc, subviewOp.getMixedOffsets(), subviewOp.getMixedStrides(),
+        subviewOp.getDroppedDims(), op.getSrcIndices(), sourceIndices);
+
+    rewriter.replaceOpWithNewOp<GatherToLDSOp>(
+        op, subviewOp.getSource(), sourceIndices, op.getDst(),
+        op.getDstIndices(), op.getTransferType());
+
+    return success();
+  }
+};
+} // namespace
+
+void mlir::amdgpu::populateAmdgpuFoldSubviewOpsPatterns(
+    RewritePatternSet &patterns, PatternBenefit benefit) {
+  patterns.add<FoldSubviewIntoGatherToLDSOp>(patterns.getContext(), benefit);
+}
diff --git a/mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir b/mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir
new file mode 100644
index 0000000000000..d582991c3622f
--- /dev/null
+++ b/mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir
@@ -0,0 +1,50 @@
+// RUN: mlir-opt -amdgpu-fold-subview-ops -split-input-file %s | FileCheck %s
+
+#gpu_lds_addrspace = 3
+
+// CHECK: func @test_memref
+// CHECK-SAME: %[[ARG0:.*]]: index, %[[ARG1:.*]]: index
+func.func @test_memref(%offset_i: index, %offset_j: index) {
+  // CHECK: %[[C0:.*]] = arith.constant 0 : index
+  // CHECK: %[[LOCAL:.*]] = memref.alloc() : memref<64x64xf16, 3>
+  // CHECK: %[[MEM:.*]] = memref.alloc() : memref<64x128xf16>
+  // CHECK:  %[[MEM]][%arg0, %arg1], %[[LOCAL]][%[[C0]], %[[C0]]]
+  // CHECK-SAME: vector<8xf16>, memref<64x128xf16>, memref<64x64xf16, 3>
+
+  %alloc = memref.alloc() : memref<64x64xf16, #gpu_lds_addrspace>
+  %mem = memref.alloc() : memref<64x128xf16>
+  %subview = memref.subview %mem[0, 0][32, 64][1, 1] : memref<64x128xf16> to memref<32x64xf16, strided<[128, 1]>>
+  %c0 = arith.constant 0 : index
+  amdgpu.gather_to_lds %subview[%offset_i, %offset_j], %alloc[%c0, %c0]
+    : vector<8xf16>, memref<32x64xf16, strided<[128, 1]>>, memref<64x64xf16, #gpu_lds_addrspace>
+  func.return
+}
+
+// -----
+
+#gpu_lds_addrspace = 3
+
+// CHECK: #[[MAP:.*]] = affine_map<()[s0] -> (s0 + 32)>
+// CHECK: #[[MAP1:.*]] = affine_map<()[s0] -> (s0 + 64)>
+
+// CHECK: func @subview_folding_offset
+// CHECK-SAME: %[[ARG0:.*]]: index, %[[ARG1:.*]]: index
+func.func @subview_folding_offset(%offset_i: index, %offset_j: index) {
+  // CHECK: %[[C0:.*]] = arith.constant 0 : index
+  // CHECK: %[[LOCAL:.*]] = memref.alloc() : memref<64x64xf16, 3>
+  // CHECK: %[[MEM:.*]] = memref.alloc() : memref<64x128xf16>
+
+  // CHECK: %[[IDX0:.*]] = affine.apply #[[MAP]]()[%[[ARG0]]]
+  // CHECK: %[[IDX1:.*]] = affine.apply #[[MAP1]]()[%[[ARG1]]]
+
+  // CHECK:  %[[MEM]][%[[IDX0]], %[[IDX1]]], %[[LOCAL]][%[[C0]], %[[C0]]]
+  // CHECK-SAME: vector<8xf16>, memref<64x128xf16>, memref<64x64xf16, 3>
+
+  %alloc = memref.alloc() : memref<64x64xf16, #gpu_lds_addrspace>
+  %mem = memref.alloc() : memref<64x128xf16>
+  %subview = memref.subview %mem[32, 64][32, 64][1, 1] : memref<64x128xf16> to memref<32x64xf16, strided<[128, 1], offset: 4160>>
+  %c0 = arith.constant 0 : index
+  amdgpu.gather_to_lds %subview[%offset_i, %offset_j], %alloc[%c0, %c0]
+    : vector<8xf16>, memref<32x64xf16, strided<[128, 1], offset: 4160>>, memref<64x64xf16, #gpu_lds_addrspace>
+  func.return
+}

llvmbot · 2025-07-21T18:50:09Z

@llvm/pr-subscribers-mlir-amdgpu

Author: Alan Li (lialan)

Changes

This PR adds a new optimization pass to fold memref.subview operations into amdgpu.gather_to_lds operations, simplifying the overall operation and potentially improving performance. The pass identifies when a GatherToLDSOp has a memref.subview as its source and attempts to fold the subview by adjusting the indices accordingly.

Implements a new pass AmdgpuFoldSubviewOpsPass with pattern FoldSubviewIntoGatherToLDSOp
Adds corresponding folding test

Full diff: https://github.com/llvm/llvm-project/pull/149851.diff

5 Files Affected:

(modified) mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h (+5-1)
(modified) mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td (+12)
(modified) mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt (+2-1)
(added) mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp (+67)
(added) mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir (+50)

diff --git a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h
index cc2f543e79f69..a61903609aaff 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h
+++ b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.h
@@ -22,8 +22,9 @@ class ConversionTarget;
 namespace amdgpu {
 
 #define GEN_PASS_DECL_AMDGPUEMULATEATOMICSPASS
-#define GEN_PASS_DECL_AMDGPURESOLVESTRIDEDMETADATAPASS
+#define GEN_PASS_DECL_AMDGPUFOLDSUBVIEWOPSPASS
 #define GEN_PASS_DECL_AMDGPUMASKEDLOADTOLOADPASS
+#define GEN_PASS_DECL_AMDGPURESOLVESTRIDEDMETADATAPASS
 #define GEN_PASS_REGISTRATION
 #include "mlir/Dialect/AMDGPU/Transforms/Passes.h.inc"
 
@@ -38,6 +39,9 @@ void populateAmdgpuResolveStridedMetadataPatterns(RewritePatternSet &patterns,
 void populateAmdgpuMaskedloadToLoadPatterns(RewritePatternSet &patterns,
                                             PatternBenefit benefit = 1);
 
+void populateAmdgpuFoldSubviewOpsPatterns(RewritePatternSet &patterns,
+                                          PatternBenefit benefit = 1);
+
 } // namespace amdgpu
 } // namespace mlir
 
diff --git a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td
index 8d0e6829ab0cc..fad939ced9877 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td
@@ -70,4 +70,16 @@ def AmdgpuMaskedloadToLoadPass : Pass<"amdgpu-maskedload-to-load"> {
     "memref::MemRefDialect"
   ];
 }
+
+def AmdgpuFoldSubviewOpsPass : Pass<"amdgpu-fold-subview-ops"> {
+  let summary = "Fold subview operations into their parent operations";
+  let description = [{
+    This pass identifies `memref.subview` sources of `GatherToLDSOp` and
+    attempts to fold the source ops, potentially simplifying the overall
+    operation and improving performance.
+  }];
+  let dependentDialects = [
+    "memref::MemRefDialect"
+  ];
+}
 #endif // MLIR_DIALECT_AMDGPU_TRANSFORMS_PASSES_TD_
diff --git a/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt b/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt
index 17bbe54ea6c0c..20621ec0d55a4 100644
--- a/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt
+++ b/mlir/lib/Dialect/AMDGPU/Transforms/CMakeLists.txt
@@ -1,7 +1,8 @@
 add_mlir_dialect_library(MLIRAMDGPUTransforms
   EmulateAtomics.cpp
-  ResolveStridedMetadata.cpp
+  FoldSubviewOps.cpp
   MaskedloadToLoad.cpp
+  ResolveStridedMetadata.cpp
 
   ADDITIONAL_HEADER_DIRS
   {$MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/AMDGPU/Transforms
diff --git a/mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp b/mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp
new file mode 100644
index 0000000000000..adbdf4b856bd5
--- /dev/null
+++ b/mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp
@@ -0,0 +1,67 @@
+//===- FoldSubviewOps.cpp - AMDGPU fold subview ops ---------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "mlir/Dialect/AMDGPU/Transforms/Passes.h"
+
+#include "mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h"
+#include "mlir/Dialect/Affine/ViewLikeInterfaceUtils.h"
+#include "mlir/Dialect/MemRef/IR/MemRef.h"
+#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
+
+namespace mlir::amdgpu {
+#define GEN_PASS_DEF_AMDGPUFOLDSUBVIEWOPSPASS
+#include "mlir/Dialect/AMDGPU/Transforms/Passes.h.inc"
+} // namespace mlir::amdgpu
+
+using namespace mlir;
+using namespace mlir::amdgpu;
+
+namespace {
+struct AmdgpuFoldSubviewOpsPass
+    : public amdgpu::impl::AmdgpuFoldSubviewOpsPassBase<
+          AmdgpuFoldSubviewOpsPass> {
+  void runOnOperation() override {
+    RewritePatternSet patterns(&getContext());
+    populateAmdgpuFoldSubviewOpsPatterns(patterns);
+    if (failed(applyPatternsGreedily(getOperation(), std::move(patterns))))
+      signalPassFailure();
+  }
+};
+
+struct FoldSubviewIntoGatherToLDSOp : public OpRewritePattern<GatherToLDSOp> {
+  using OpRewritePattern<GatherToLDSOp>::OpRewritePattern;
+  LogicalResult matchAndRewrite(GatherToLDSOp op,
+                                PatternRewriter &rewriter) const override {
+    Location loc = op.getLoc();
+
+    // Check if the source is a subview operation:
+    auto subviewOp = dyn_cast<memref::SubViewOp>(op.getSrc().getDefiningOp());
+    if (!subviewOp)
+      return rewriter.notifyMatchFailure(
+          loc, "GatherToLDSOp folding is currently supported only when the "
+               "source is a SubviewOp. This is one specific pattern, and other "
+               "scenarios may be added in the future.");
+
+    SmallVector<Value> sourceIndices;
+    mlir::affine::resolveIndicesIntoOpWithOffsetsAndStrides(
+        rewriter, loc, subviewOp.getMixedOffsets(), subviewOp.getMixedStrides(),
+        subviewOp.getDroppedDims(), op.getSrcIndices(), sourceIndices);
+
+    rewriter.replaceOpWithNewOp<GatherToLDSOp>(
+        op, subviewOp.getSource(), sourceIndices, op.getDst(),
+        op.getDstIndices(), op.getTransferType());
+
+    return success();
+  }
+};
+} // namespace
+
+void mlir::amdgpu::populateAmdgpuFoldSubviewOpsPatterns(
+    RewritePatternSet &patterns, PatternBenefit benefit) {
+  patterns.add<FoldSubviewIntoGatherToLDSOp>(patterns.getContext(), benefit);
+}
diff --git a/mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir b/mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir
new file mode 100644
index 0000000000000..d582991c3622f
--- /dev/null
+++ b/mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir
@@ -0,0 +1,50 @@
+// RUN: mlir-opt -amdgpu-fold-subview-ops -split-input-file %s | FileCheck %s
+
+#gpu_lds_addrspace = 3
+
+// CHECK: func @test_memref
+// CHECK-SAME: %[[ARG0:.*]]: index, %[[ARG1:.*]]: index
+func.func @test_memref(%offset_i: index, %offset_j: index) {
+  // CHECK: %[[C0:.*]] = arith.constant 0 : index
+  // CHECK: %[[LOCAL:.*]] = memref.alloc() : memref<64x64xf16, 3>
+  // CHECK: %[[MEM:.*]] = memref.alloc() : memref<64x128xf16>
+  // CHECK:  %[[MEM]][%arg0, %arg1], %[[LOCAL]][%[[C0]], %[[C0]]]
+  // CHECK-SAME: vector<8xf16>, memref<64x128xf16>, memref<64x64xf16, 3>
+
+  %alloc = memref.alloc() : memref<64x64xf16, #gpu_lds_addrspace>
+  %mem = memref.alloc() : memref<64x128xf16>
+  %subview = memref.subview %mem[0, 0][32, 64][1, 1] : memref<64x128xf16> to memref<32x64xf16, strided<[128, 1]>>
+  %c0 = arith.constant 0 : index
+  amdgpu.gather_to_lds %subview[%offset_i, %offset_j], %alloc[%c0, %c0]
+    : vector<8xf16>, memref<32x64xf16, strided<[128, 1]>>, memref<64x64xf16, #gpu_lds_addrspace>
+  func.return
+}
+
+// -----
+
+#gpu_lds_addrspace = 3
+
+// CHECK: #[[MAP:.*]] = affine_map<()[s0] -> (s0 + 32)>
+// CHECK: #[[MAP1:.*]] = affine_map<()[s0] -> (s0 + 64)>
+
+// CHECK: func @subview_folding_offset
+// CHECK-SAME: %[[ARG0:.*]]: index, %[[ARG1:.*]]: index
+func.func @subview_folding_offset(%offset_i: index, %offset_j: index) {
+  // CHECK: %[[C0:.*]] = arith.constant 0 : index
+  // CHECK: %[[LOCAL:.*]] = memref.alloc() : memref<64x64xf16, 3>
+  // CHECK: %[[MEM:.*]] = memref.alloc() : memref<64x128xf16>
+
+  // CHECK: %[[IDX0:.*]] = affine.apply #[[MAP]]()[%[[ARG0]]]
+  // CHECK: %[[IDX1:.*]] = affine.apply #[[MAP1]]()[%[[ARG1]]]
+
+  // CHECK:  %[[MEM]][%[[IDX0]], %[[IDX1]]], %[[LOCAL]][%[[C0]], %[[C0]]]
+  // CHECK-SAME: vector<8xf16>, memref<64x128xf16>, memref<64x64xf16, 3>
+
+  %alloc = memref.alloc() : memref<64x64xf16, #gpu_lds_addrspace>
+  %mem = memref.alloc() : memref<64x128xf16>
+  %subview = memref.subview %mem[32, 64][32, 64][1, 1] : memref<64x128xf16> to memref<32x64xf16, strided<[128, 1], offset: 4160>>
+  %c0 = arith.constant 0 : index
+  amdgpu.gather_to_lds %subview[%offset_i, %offset_j], %alloc[%c0, %c0]
+    : vector<8xf16>, memref<32x64xf16, strided<[128, 1], offset: 4160>>, memref<64x64xf16, #gpu_lds_addrspace>
+  func.return
+}

krzysz00 · 2025-07-21T19:02:02Z

High-level comment: this doesn't need to be a new pass. Just go add logic to FoldMemRefAliasOps to handle this one

krzysz00 · 2025-07-21T19:02:53Z

(That is, tentative reject for being overcomplicated)

(This'll all work nicely if I ever have the time to get my interfaces for DMA-like ops and load/store-like ops up, but for now, this goes in with the rest of the subview folding logic)

mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp

mlir/test/Dialect/AMDGPU/amdgpu-fold-subviews.mlir

qedawkins

Thanks for adding the populate function! We'll want to add FoldMemRefAlias pass variant that run these patterns so we aren't applying them on their own and that'll help.

qedawkins · 2025-07-21T19:14:26Z

mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp

+    auto subviewOp = dyn_cast<memref::SubViewOp>(op.getSrc().getDefiningOp());
+    if (!subviewOp)
+      return rewriter.notifyMatchFailure(
+          loc, "GatherToLDSOp folding is currently supported only when the "


Most of the time the producer won't be foldable so just a simple message clarifying that the producer isn't a subview is good enough and won't clutter --debug output as much (although that's kind of a lost cause anyway). As we add support for folding more ops (expand/collapse_shape) we can expand the message.

Suggested change

loc, "GatherToLDSOp folding is currently supported only when the "

loc, "non-subview source producer."

lialan · 2025-07-21T19:24:58Z

High-level comment: this doesn't need to be a new pass. Just go add logic to FoldMemRefAliasOps to handle this one

Ahh they have NVGPU patterns in the same file so I think it is okay to just do it there. I will move things there.

qedawkins · 2025-07-21T19:39:40Z

High-level comment: this doesn't need to be a new pass. Just go add logic to FoldMemRefAliasOps to handle this one

Ahh they have NVGPU patterns in the same file so I think it is okay to just do it there. I will move things there.

No, we should keep it out. The NVGPU patterns should not be there either.

qedawkins

I disagree. MemRef transforms should not depend on target specific dialects like NVGPU and AMDGPU. Repeating the mistake for AMD doesn't help improve the poor state of dependency management.

lialan · 2025-07-21T19:58:52Z

I disagree. MemRef transforms should not depend on target specific dialects like NVGPU and AMDGPU. Repeating the mistake for AMD doesn't help improve the poor state of dependency management.

I actually think you have a point that this is a design thing that folding source into an op requires the deps of the op. Isolating it to only AMDGPU avoid polluting other target pipelines.

@krzysz00 @kuhar we need a consensus here.

krzysz00 · 2025-07-21T20:12:43Z

Re FoldMemRefSubviewOps ... there should be an interface there.

main...krzysz00:llvm-project:indexed-access-op-interface is the really draft not-done-yet definition of said interface

However, since there isn't an interface, I think that the subview folder should go in FoldMemRefAliasOps to match the existing precedent of NVGPU.

Or, if we really don't want to do that, that there should be comments indicating that this pass is temporary until FoldMemRefAliasOps is generalized

krzysz00 · 2025-07-21T20:14:59Z

I'm trying to avoid code duplication with FoldMemRefAliasOps by just copying the relevant NVGPU pattern, and think that makes more sense than a wholly separate pass/pattern

qedawkins · 2025-07-21T21:59:51Z

I'm trying to avoid code duplication with FoldMemRefAliasOps by just copying the relevant NVGPU pattern, and think that makes more sense than a wholly separate pass/pattern

The pattern here doesn't share any code with the existing stuff so I don't see much code duplication happening. I see the pass added here as solely for testing purposes. Then the users of these patterns can decide which dialects to populate from. Interface makes sense if we want a version of FoldMemRefAliasOps that applies independent of target, but there is no example pipeline in tree to reference for designing the interface AFAICT.

Hardcode84 · 2025-07-21T22:25:40Z

Instead of dedicated pass you can register populate* function with transform dialect and test it using transform interpreter.

krzysz00 · 2025-07-21T22:53:45Z

The pattern here doesn't share any code with the existing stuff

It should though. This should just be a copy of the code for the NVGPU async load ops handling

... and FoldMemRefAliasOps (and DecomposeMemRefOps and the linearizer) are the examples for that interface

Ok I'm going to bow out here since I don't want to hold a line on code I'm not directly maintaining, but better dependency management would be appreciated in general.

kuhar

I have similar concerns to @qedawkins -- it seems weird for memref transforms to know about amdgpu ops. This entails a build-system dep (for bazel, cmake is not that strict) and seems like a future headache

kuhar · 2025-07-22T14:52:03Z

mlir/lib/Dialect/MemRef/Transforms/FoldMemRefAliasOps.cpp

+    Location loc = op.getLoc();
+
+    // Check if the source is a subview operation:
+    auto subviewOp = dyn_cast<memref::SubViewOp>(op.getSrc().getDefiningOp());


Suggested change

auto subviewOp = dyn_cast<memref::SubViewOp>(op.getSrc().getDefiningOp());

auto subviewOp = op.getSrc().getDefiningOp<memref::SubViewOp>();

krzysz00 · 2025-07-22T15:11:41Z

Ok, given the general consensus from everyone else ... I can see why people are trying to avoid the build system coupling issues. Let's go with adding an AMDGPU-specific pass and populate function, especially since there's already precedent for that sort of split.

We can refactor this later and it'll provide good motivation for an interface

krzysz00 · 2025-07-22T15:11:57Z

Sorry about al the back-and-forth here

krzysz00 · 2025-07-22T16:31:32Z

... But also, if we're creating a new file etc., can we go ahead and handle memref.expand_shape and memref.collapse_shape folding?

lialan requested a review from Copilot July 21, 2025 16:59

This comment was marked as outdated.

Sign in to view

lialan force-pushed the fold_subviews branch from 3ca8d94 to 6e12ca4 Compare July 21, 2025 17:01

lialan requested a review from Copilot July 21, 2025 17:07

This comment was marked as outdated.

Sign in to view

[AMDGPU] fold memref.subview into amdgpu.gather_to_lds

9f6afe1

lialan force-pushed the fold_subviews branch from 6e12ca4 to 9f6afe1 Compare July 21, 2025 18:17

lialan and others added 2 commits July 21, 2025 14:21

Update mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp

71fe3aa

Co-authored-by: Copilot <[email protected]>

Update mlir/include/mlir/Dialect/AMDGPU/Transforms/Passes.td

bd4ade5

Co-authored-by: Copilot <[email protected]>

lialan requested a review from Copilot July 21, 2025 18:21

Copilot AI reviewed Jul 21, 2025

View reviewed changes

mlir/lib/Dialect/AMDGPU/Transforms/FoldSubviewOps.cpp Outdated Show resolved Hide resolved

linting

9552f4e

lialan marked this pull request as ready for review July 21, 2025 18:49

llvmbot added backend:AMDGPU mlir:gpu mlir mlir:amdgpu labels Jul 21, 2025

lialan requested review from krzysz00 and qedawkins July 21, 2025 18:50

kuhar reviewed Jul 21, 2025

View reviewed changes

qedawkins approved these changes Jul 21, 2025

View reviewed changes

Move to FoldMemRefAliasOps

06e2831

llvmbot added the mlir:memref label Jul 21, 2025

qedawkins previously requested changes Jul 21, 2025

View reviewed changes

updating tests

97ec2bf

Still ... merge it with FoldMemRefAliasOps pass.

addc07a

lialan force-pushed the fold_subviews branch from 005353a to addc07a Compare July 22, 2025 14:07

kuhar reviewed Jul 22, 2025

View reviewed changes

	loc, "GatherToLDSOp folding is currently supported only when the "
	loc, "non-subview source producer."

	auto subviewOp = dyn_cast<memref::SubViewOp>(op.getSrc().getDefiningOp());
	auto subviewOp = op.getSrc().getDefiningOp<memref::SubViewOp>();

[AMDGPU] fold memref.subview into amdgpu.gather_to_lds #149851

Are you sure you want to change the base?

[AMDGPU] fold memref.subview into amdgpu.gather_to_lds #149851

Conversation

lialan commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

github-actions bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 21, 2025

Uh oh!

llvmbot commented Jul 21, 2025

Uh oh!

llvmbot commented Jul 21, 2025

Uh oh!

krzysz00 commented Jul 21, 2025

Uh oh!

krzysz00 commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qedawkins left a comment

Choose a reason for hiding this comment

Uh oh!

qedawkins Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

lialan commented Jul 21, 2025

Uh oh!

qedawkins commented Jul 21, 2025

Uh oh!

qedawkins left a comment

Choose a reason for hiding this comment

Uh oh!

lialan commented Jul 21, 2025

Uh oh!

krzysz00 commented Jul 21, 2025

Uh oh!

krzysz00 commented Jul 21, 2025

Uh oh!

qedawkins commented Jul 21, 2025

Uh oh!

Hardcode84 commented Jul 21, 2025

Uh oh!

krzysz00 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kuhar left a comment

Choose a reason for hiding this comment

Uh oh!

kuhar Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

krzysz00 commented Jul 22, 2025

Uh oh!

krzysz00 commented Jul 22, 2025

Uh oh!

krzysz00 commented Jul 22, 2025

Uh oh!

Uh oh!

[AMDGPU] fold `memref.subview` into `amdgpu.gather_to_lds` #149851

[AMDGPU] fold `memref.subview` into `amdgpu.gather_to_lds` #149851

lialan commented Jul 21, 2025 •

edited

Loading

github-actions bot commented Jul 21, 2025 •

edited

Loading

llvmbot commented Jul 21, 2025 •

edited

Loading

krzysz00 commented Jul 21, 2025 •

edited

Loading