[RISCV][TTI] Enable masked interleave access #151665

Mel-Chen · 2025-08-01T07:49:08Z

Now that support for masked loads/stores of interleave groups has landed, we can enable the loop vectorizer to generate masked interleave access where applicable.

This improves vectorization in several ways:

Internal predication support: This enables interleave group vectorization for loops with internal control flow predication, provided all members of the group share the same predicate. Gaps in interleave groups are still not efficiently handled by masking, so masking for gaps remains disabled for now.
Tail folding: This allows tail folding of loops with interleave groups by using masking. Without this, vectorized loops with interleaves would fall back to using separate gather/scatter accesses, which can be significantly less efficient.

"[RISCV][TTI] Enable masked interleave access for scalable vector (#149981)" was reverted by 5294793 due to triggering an assertion. The issue has been addressed in the patch "[LV] Fix gap mask requirement for interleaved access (#151105)". On the other hand, this patch also enable fixed-length masked interleave access (#150624) since support for fixed-length has also been landed 992118c.

llvmbot · 2025-08-01T07:49:39Z

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-backend-risc-v

Author: Mel Chen (Mel-Chen)

Changes

Reverted by 5294793 due to triggering an assertion. The issue has been addressed in the patch "[LV] Fix gap mask requirement for interleaved access (#151105)".

Now that support for masked loads/stores of interleave groups has landed, we can enable the loop vectorizer to generate masked interleave access where applicable.

This improves vectorization in several ways:

Internal predication support: This enables interleave group vectorization for loops with internal control flow predication, provided all members of the group share the same predicate. Gaps in interleave groups are still not efficiently handled by masking, so masking for gaps remains disabled for now.
Tail folding: This allows tail folding of loops with interleave groups by using masking. Without this, vectorized loops with interleaves would fall back to using separate gather/scatter accesses, which can be significantly less efficient.
Scalable vector support: Currently, only scalable vector types are supported for masked interleave lowering. Fixed-length vector support will be enabled in the future.

As interleave access is not yet supported with tail folding by EVL, that functionality is temporarily disabled. We are going to create another patch to support it.

Patch is 57.18 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151665.diff

5 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+6-4)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+4)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+2)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll (+152-182)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/tail-folding-interleave.ll (+23-33)

diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 0d5eb86bf899c..a49a1291e0507 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -979,10 +979,12 @@ InstructionCost RISCVTTIImpl::getInterleavedMemoryOpCost(
     Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
     bool UseMaskForCond, bool UseMaskForGaps) const {
 
-  // The interleaved memory access pass will lower interleaved memory ops (i.e
-  // a load and store followed by a specific shuffle) to vlseg/vsseg
-  // intrinsics.
-  if (!UseMaskForCond && !UseMaskForGaps &&
+  // The interleaved memory access pass will lower (de)interleave ops combined
+  // with an adjacent appropriate memory to vlseg/vsseg intrinsics. vlseg/vsseg
+  // only support masking per-iteration (i.e. condition), not per-segment (i.e.
+  // gap).
+  // TODO: Support masked interleaved access for fixed length vector.
+  if ((isa<ScalableVectorType>(VecTy) || !UseMaskForCond) && !UseMaskForGaps &&
       Factor <= TLI->getMaxSupportedInterleaveFactor()) {
     auto *VTy = cast<VectorType>(VecTy);
     std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(VTy);
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index d62d99cf31899..05d504cbcb6bb 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -398,6 +398,10 @@ class RISCVTTIImpl final : public BasicTTIImplBase<RISCVTTIImpl> {
 
   bool enableInterleavedAccessVectorization() const override { return true; }
 
+  bool enableMaskedInterleavedAccessVectorization() const override {
+    return ST->hasVInstructions();
+  }
+
   unsigned getMinTripCountTailFoldingThreshold() const override;
 
   enum RISCVRegisterClass { GPRRC, FPRRC, VRRC };
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 850c4a11edc67..583f8b48a1583 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1378,7 +1378,9 @@ class LoopVectorizationCostModel {
       return;
     // Override EVL styles if needed.
     // FIXME: Investigate opportunity for fixed vector factor.
+    // FIXME: Support interleave accesses.
     bool EVLIsLegal = UserIC <= 1 && IsScalableVF &&
+                      !InterleaveInfo.hasGroups() &&
                       TTI.hasActiveVectorLength() && !EnableVPlanNativePath;
     if (EVLIsLegal)
       return;
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll
index 6f20376e08d85..2dc6f806458f0 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll
@@ -30,26 +30,25 @@ define void @masked_strided_factor2(ptr noalias nocapture readonly %p, ptr noali
 ; SCALAR_EPILOGUE-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; SCALAR_EPILOGUE-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 16 x i32> [ [[TMP5]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; SCALAR_EPILOGUE-NEXT:    [[TMP6:%.*]] = icmp ugt <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
-; SCALAR_EPILOGUE-NEXT:    [[TMP7:%.*]] = shl nuw nsw <vscale x 16 x i32> [[VEC_IND]], splat (i32 1)
-; SCALAR_EPILOGUE-NEXT:    [[TMP8:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
-; SCALAR_EPILOGUE-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP8]]
-; SCALAR_EPILOGUE-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 16 x i8> @llvm.masked.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> [[TMP9]], i32 1, <vscale x 16 x i1> [[TMP6]], <vscale x 16 x i8> poison)
-; SCALAR_EPILOGUE-NEXT:    [[TMP10:%.*]] = or disjoint <vscale x 16 x i32> [[TMP7]], splat (i32 1)
-; SCALAR_EPILOGUE-NEXT:    [[TMP11:%.*]] = zext nneg <vscale x 16 x i32> [[TMP10]] to <vscale x 16 x i64>
-; SCALAR_EPILOGUE-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP11]]
-; SCALAR_EPILOGUE-NEXT:    [[WIDE_MASKED_GATHER3:%.*]] = call <vscale x 16 x i8> @llvm.masked.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> [[TMP12]], i32 1, <vscale x 16 x i1> [[TMP6]], <vscale x 16 x i8> poison)
-; SCALAR_EPILOGUE-NEXT:    [[TMP13:%.*]] = call <vscale x 16 x i8> @llvm.smax.nxv16i8(<vscale x 16 x i8> [[WIDE_MASKED_GATHER]], <vscale x 16 x i8> [[WIDE_MASKED_GATHER3]])
-; SCALAR_EPILOGUE-NEXT:    [[TMP14:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
-; SCALAR_EPILOGUE-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP14]]
-; SCALAR_EPILOGUE-NEXT:    call void @llvm.masked.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP13]], <vscale x 16 x ptr> [[TMP15]], i32 1, <vscale x 16 x i1> [[TMP6]])
-; SCALAR_EPILOGUE-NEXT:    [[TMP16:%.*]] = sub <vscale x 16 x i8> zeroinitializer, [[TMP13]]
-; SCALAR_EPILOGUE-NEXT:    [[TMP17:%.*]] = zext nneg <vscale x 16 x i32> [[TMP10]] to <vscale x 16 x i64>
-; SCALAR_EPILOGUE-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP17]]
-; SCALAR_EPILOGUE-NEXT:    call void @llvm.masked.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP16]], <vscale x 16 x ptr> [[TMP18]], i32 1, <vscale x 16 x i1> [[TMP6]])
+; SCALAR_EPILOGUE-NEXT:    [[TMP7:%.*]] = shl i32 [[INDEX]], 1
+; SCALAR_EPILOGUE-NEXT:    [[TMP8:%.*]] = sext i32 [[TMP7]] to i64
+; SCALAR_EPILOGUE-NEXT:    [[TMP9:%.*]] = getelementptr i8, ptr [[P]], i64 [[TMP8]]
+; SCALAR_EPILOGUE-NEXT:    [[INTERLEAVED_MASK:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP6]], <vscale x 16 x i1> [[TMP6]])
+; SCALAR_EPILOGUE-NEXT:    [[WIDE_MASKED_VEC:%.*]] = call <vscale x 32 x i8> @llvm.masked.load.nxv32i8.p0(ptr [[TMP9]], i32 1, <vscale x 32 x i1> [[INTERLEAVED_MASK]], <vscale x 32 x i8> poison)
+; SCALAR_EPILOGUE-NEXT:    [[STRIDED_VEC:%.*]] = call { <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave2.nxv32i8(<vscale x 32 x i8> [[WIDE_MASKED_VEC]])
+; SCALAR_EPILOGUE-NEXT:    [[TMP10:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC]], 0
+; SCALAR_EPILOGUE-NEXT:    [[TMP11:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC]], 1
+; SCALAR_EPILOGUE-NEXT:    [[TMP12:%.*]] = call <vscale x 16 x i8> @llvm.smax.nxv16i8(<vscale x 16 x i8> [[TMP10]], <vscale x 16 x i8> [[TMP11]])
+; SCALAR_EPILOGUE-NEXT:    [[TMP13:%.*]] = sext i32 [[TMP7]] to i64
+; SCALAR_EPILOGUE-NEXT:    [[TMP14:%.*]] = getelementptr i8, ptr [[Q]], i64 [[TMP13]]
+; SCALAR_EPILOGUE-NEXT:    [[TMP15:%.*]] = sub <vscale x 16 x i8> zeroinitializer, [[TMP12]]
+; SCALAR_EPILOGUE-NEXT:    [[INTERLEAVED_VEC:%.*]] = call <vscale x 32 x i8> @llvm.vector.interleave2.nxv32i8(<vscale x 16 x i8> [[TMP12]], <vscale x 16 x i8> [[TMP15]])
+; SCALAR_EPILOGUE-NEXT:    [[INTERLEAVED_MASK3:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP6]], <vscale x 16 x i1> [[TMP6]])
+; SCALAR_EPILOGUE-NEXT:    call void @llvm.masked.store.nxv32i8.p0(<vscale x 32 x i8> [[INTERLEAVED_VEC]], ptr [[TMP14]], i32 1, <vscale x 32 x i1> [[INTERLEAVED_MASK3]])
 ; SCALAR_EPILOGUE-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP4]]
 ; SCALAR_EPILOGUE-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT2]]
-; SCALAR_EPILOGUE-NEXT:    [[TMP19:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
-; SCALAR_EPILOGUE-NEXT:    br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; SCALAR_EPILOGUE-NEXT:    [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; SCALAR_EPILOGUE-NEXT:    br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; SCALAR_EPILOGUE:       middle.block:
 ; SCALAR_EPILOGUE-NEXT:    [[CMP_N:%.*]] = icmp eq i32 [[N_MOD_VF]], 0
 ; SCALAR_EPILOGUE-NEXT:    br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
@@ -80,26 +79,25 @@ define void @masked_strided_factor2(ptr noalias nocapture readonly %p, ptr noali
 ; PREDICATED_DATA-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i32(i32 [[INDEX]], i32 1024)
 ; PREDICATED_DATA-NEXT:    [[TMP5:%.*]] = icmp ugt <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
 ; PREDICATED_DATA-NEXT:    [[TMP6:%.*]] = select <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i1> [[TMP5]], <vscale x 16 x i1> zeroinitializer
-; PREDICATED_DATA-NEXT:    [[TMP7:%.*]] = shl nuw nsw <vscale x 16 x i32> [[VEC_IND]], splat (i32 1)
-; PREDICATED_DATA-NEXT:    [[TMP8:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
-; PREDICATED_DATA-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP8]]
-; PREDICATED_DATA-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 16 x i8> @llvm.masked.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> [[TMP9]], i32 1, <vscale x 16 x i1> [[TMP6]], <vscale x 16 x i8> poison)
-; PREDICATED_DATA-NEXT:    [[TMP10:%.*]] = or disjoint <vscale x 16 x i32> [[TMP7]], splat (i32 1)
-; PREDICATED_DATA-NEXT:    [[TMP11:%.*]] = zext nneg <vscale x 16 x i32> [[TMP10]] to <vscale x 16 x i64>
-; PREDICATED_DATA-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP11]]
-; PREDICATED_DATA-NEXT:    [[WIDE_MASKED_GATHER3:%.*]] = call <vscale x 16 x i8> @llvm.masked.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> [[TMP12]], i32 1, <vscale x 16 x i1> [[TMP6]], <vscale x 16 x i8> poison)
-; PREDICATED_DATA-NEXT:    [[TMP13:%.*]] = call <vscale x 16 x i8> @llvm.smax.nxv16i8(<vscale x 16 x i8> [[WIDE_MASKED_GATHER]], <vscale x 16 x i8> [[WIDE_MASKED_GATHER3]])
-; PREDICATED_DATA-NEXT:    [[TMP14:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
-; PREDICATED_DATA-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP14]]
-; PREDICATED_DATA-NEXT:    call void @llvm.masked.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP13]], <vscale x 16 x ptr> [[TMP15]], i32 1, <vscale x 16 x i1> [[TMP6]])
-; PREDICATED_DATA-NEXT:    [[TMP16:%.*]] = sub <vscale x 16 x i8> zeroinitializer, [[TMP13]]
-; PREDICATED_DATA-NEXT:    [[TMP17:%.*]] = zext nneg <vscale x 16 x i32> [[TMP10]] to <vscale x 16 x i64>
-; PREDICATED_DATA-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP17]]
-; PREDICATED_DATA-NEXT:    call void @llvm.masked.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP16]], <vscale x 16 x ptr> [[TMP18]], i32 1, <vscale x 16 x i1> [[TMP6]])
+; PREDICATED_DATA-NEXT:    [[TMP7:%.*]] = shl i32 [[INDEX]], 1
+; PREDICATED_DATA-NEXT:    [[TMP8:%.*]] = sext i32 [[TMP7]] to i64
+; PREDICATED_DATA-NEXT:    [[TMP9:%.*]] = getelementptr i8, ptr [[P]], i64 [[TMP8]]
+; PREDICATED_DATA-NEXT:    [[INTERLEAVED_MASK:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP6]], <vscale x 16 x i1> [[TMP6]])
+; PREDICATED_DATA-NEXT:    [[WIDE_MASKED_VEC:%.*]] = call <vscale x 32 x i8> @llvm.masked.load.nxv32i8.p0(ptr [[TMP9]], i32 1, <vscale x 32 x i1> [[INTERLEAVED_MASK]], <vscale x 32 x i8> poison)
+; PREDICATED_DATA-NEXT:    [[STRIDED_VEC:%.*]] = call { <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave2.nxv32i8(<vscale x 32 x i8> [[WIDE_MASKED_VEC]])
+; PREDICATED_DATA-NEXT:    [[TMP10:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC]], 0
+; PREDICATED_DATA-NEXT:    [[TMP11:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC]], 1
+; PREDICATED_DATA-NEXT:    [[TMP12:%.*]] = call <vscale x 16 x i8> @llvm.smax.nxv16i8(<vscale x 16 x i8> [[TMP10]], <vscale x 16 x i8> [[TMP11]])
+; PREDICATED_DATA-NEXT:    [[TMP13:%.*]] = sext i32 [[TMP7]] to i64
+; PREDICATED_DATA-NEXT:    [[TMP14:%.*]] = getelementptr i8, ptr [[Q]], i64 [[TMP13]]
+; PREDICATED_DATA-NEXT:    [[TMP15:%.*]] = sub <vscale x 16 x i8> zeroinitializer, [[TMP12]]
+; PREDICATED_DATA-NEXT:    [[INTERLEAVED_VEC:%.*]] = call <vscale x 32 x i8> @llvm.vector.interleave2.nxv32i8(<vscale x 16 x i8> [[TMP12]], <vscale x 16 x i8> [[TMP15]])
+; PREDICATED_DATA-NEXT:    [[INTERLEAVED_MASK3:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP6]], <vscale x 16 x i1> [[TMP6]])
+; PREDICATED_DATA-NEXT:    call void @llvm.masked.store.nxv32i8.p0(<vscale x 32 x i8> [[INTERLEAVED_VEC]], ptr [[TMP14]], i32 1, <vscale x 32 x i1> [[INTERLEAVED_MASK3]])
 ; PREDICATED_DATA-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP3]]
 ; PREDICATED_DATA-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT2]]
-; PREDICATED_DATA-NEXT:    [[TMP19:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
-; PREDICATED_DATA-NEXT:    br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; PREDICATED_DATA-NEXT:    [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; PREDICATED_DATA-NEXT:    br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; PREDICATED_DATA:       middle.block:
 ; PREDICATED_DATA-NEXT:    br label [[FOR_END:%.*]]
 ; PREDICATED_DATA:       scalar.ph:
@@ -107,44 +105,49 @@ define void @masked_strided_factor2(ptr noalias nocapture readonly %p, ptr noali
 ; PREDICATED_DATA-WITH-EVL-LABEL: define void @masked_strided_factor2
 ; PREDICATED_DATA-WITH-EVL-SAME: (ptr noalias readonly captures(none) [[P:%.*]], ptr noalias captures(none) [[Q:%.*]], i8 zeroext [[GUARD:%.*]]) #[[ATTR0:[0-9]+]] {
 ; PREDICATED_DATA-WITH-EVL-NEXT:  entry:
-; PREDICATED_DATA-WITH-EVL-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
-; PREDICATED_DATA-WITH-EVL:       vector.ph:
 ; PREDICATED_DATA-WITH-EVL-NEXT:    [[CONV:%.*]] = zext i8 [[GUARD]] to i32
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP0:%.*]] = call i32 @llvm.vscale.i32()
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ugt i32 [[TMP0]], 64
+; PREDICATED_DATA-WITH-EVL-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; PREDICATED_DATA-WITH-EVL:       vector.ph:
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP1:%.*]] = call i32 @llvm.vscale.i32()
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP2:%.*]] = shl nuw i32 [[TMP1]], 4
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[N_MOD_VF:%.*]] = urem i32 1024, [[TMP2]]
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[N_VEC:%.*]] = sub nuw nsw i32 1024, [[N_MOD_VF]]
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP3:%.*]] = call i32 @llvm.vscale.i32()
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP4:%.*]] = shl nuw i32 [[TMP3]], 4
 ; PREDICATED_DATA-WITH-EVL-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x i32> poison, i32 [[CONV]], i64 0
 ; PREDICATED_DATA-WITH-EVL-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP0:%.*]] = call <vscale x 16 x i32> @llvm.stepvector.nxv16i32()
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP5:%.*]] = call <vscale x 16 x i32> @llvm.stepvector.nxv16i32()
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 16 x i32> poison, i32 [[TMP4]], i64 0
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 16 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
 ; PREDICATED_DATA-WITH-EVL-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; PREDICATED_DATA-WITH-EVL:       vector.body:
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[EVL_BASED_IV:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], [[VECTOR_BODY]] ]
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 16 x i32> [ [[TMP0]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[AVL:%.*]] = phi i32 [ 1024, [[VECTOR_PH]] ], [ [[AVL_NEXT:%.*]], [[VECTOR_BODY]] ]
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP1:%.*]] = call i32 @llvm.experimental.get.vector.length.i32(i32 [[AVL]], i32 16, i1 true)
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 16 x i32> poison, i32 [[TMP1]], i64 0
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 16 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP2:%.*]] = icmp ugt <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP3:%.*]] = shl nuw nsw <vscale x 16 x i32> [[VEC_IND]], splat (i32 1)
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP4:%.*]] = zext nneg <vscale x 16 x i32> [[TMP3]] to <vscale x 16 x i64>
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP4]]
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 16 x i8> @llvm.vp.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> align 1 [[TMP5]], <vscale x 16 x i1> [[TMP2]], i32 [[TMP1]])
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP6:%.*]] = or disjoint <vscale x 16 x i32> [[TMP3]], splat (i32 1)
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP7:%.*]] = zext nneg <vscale x 16 x i32> [[TMP6]] to <vscale x 16 x i64>
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP7]]
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[WIDE_MASKED_GATHER3:%.*]] = call <vscale x 16 x i8> @llvm.vp.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> align 1 [[TMP8]], <vscale x 16 x i1> [[TMP2]], i32 [[TMP1]])
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP9:%.*]] = call <vscale x 16 x i8> @llvm.smax.nxv16i8(<vscale x 16 x i8> [[WIDE_MASKED_GATHER]], <vscale x 16 x i8> [[WIDE_MASKED_GATHER3]])
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP10:%.*]] = zext nneg <vscale x 16 x i32> [[TMP3]] to <vscale x 16 x i64>
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP10]]
-; PREDICATED_DATA-WITH-EVL-NEXT:    call void @llvm.vp.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP9]], <vscale x 16 x ptr> align 1 [[TMP11]], <vscale x 16 x i1> [[TMP2]], i32 [[TMP1]])
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP12:%.*]] = sub <vscale x 16 x i8> zeroinitializer, [[TMP9]]
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP13:%.*]] = zext nneg <vscale x 16 x i32> [[TMP6]] to <vscale x 16 x i64>
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP13]]
-; PREDICATED_DATA-WITH-EVL-NEXT:    call void @llvm.vp.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP12]], <vscale x 16 x ptr> align 1 [[TMP14]], <vscale x 16 x i1> [[TMP2]], i32 [[TMP1]])
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[INDEX_EVL_NEXT]] = add nuw i32 [[TMP1]], [[EVL_BASED_IV]]
-; PREDICATED_DATA-WITH-EVL-NEXT:    [[AVL_NEXT]] = sub nuw i32 [[AVL]], [[TMP1]]
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 16 x i32> [ [[TMP5]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP6:%.*]] = icmp ugt <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP7:%.*]] = shl i32 [[INDEX]], 1
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP8:%.*]] = sext i32 [[TMP7]] to i64
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[TMP9:%.*]] = getelementptr i8, ptr [[P]], i64 [[TMP8]]
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[INTERLEAVED_MASK:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP6]], <vscale x 16 x i1> [[TMP6]])
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[WIDE_MASKED_VEC:%.*]] = call <vscale x 32 x i8> @llvm.masked.load.nxv32i8.p0(ptr [[TMP9]], i32 1, <vscale x 32 x i1> [[INTERLEAVED_MASK]], <vscale x 32 x i8> poison)
+; PREDICATED_DATA-WITH-EVL-NEXT:    [[STRIDED_VEC:%.*]] = call { <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave2.nxv32i8(<vscale x 32 x i8> [[WIDE_MASKED_VEC]])
+; PREDICATED_D...
[truncated]

lukel97 · 2025-08-01T08:25:32Z

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

+  // only support masking per-iteration (i.e. condition), not per-segment (i.e.
+  // gap).
+  // TODO: Support masked interleaved access for fixed length vector.
+  if ((isa<ScalableVectorType>(VecTy) || !UseMaskForCond) && !UseMaskForGaps &&


Fixed vectors should be supported now, I don't think we need this check: https://github.com/llvm/llvm-project/pull/150624/files

Yes, I saw it. @preames Do you mind if I enable both scalable and fixed-length masked interleaved access in this patch?
305e75f

lukel97 · 2025-08-01T08:26:46Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

    bool EVLIsLegal = UserIC <= 1 && IsScalableVF &&
+                      !InterleaveInfo.hasGroups() &&


I don't think we need to bail on EVL anymore either, #150202 has landed and should fix the concern about the header mask

Hmm, I can't really think of any more cases that might cause issues. Hopefully it's safe enough. (pray)
767de32

We have been tracking the performance of EVL tail folding in the loop vectorizer on RISC-V for a while now, and after much hard work from various contributors we think it should be generally profitable to enable by default now. With tail folding there is a 21% improvement on 525.x264_r on SPEC CPU 2017 on the BPI-F3 (-march=rva22u64_v -O3 -flto), as well as a 30% geomean codesize reduction on SPEC and TSVC, with no significant regressions detected: https://lnt.lukelau.me/db_default/v4/nts/805?compare_to=804 Now that we are early into the LLVM 22.x development cycle it seems like a good time to enable it to catch any issues. There are still more EVL related items of work being tracked in llvm#123069, which should continue to improve performance. For the test diffs in this PR, I've avoided changing RUN lines where possible so that the tests accurately reflect the default configuration. Most test diffs are what we would expected from turning on tail folding, but I've gone through and taken some notes for the interesting changes: - blend-any-of-reduction-cost.ll: This test was added to fix a legacy and vplan cost model mismatch assert. The costing path is still being exercised, but the VPlan in the second test is no longer considered profitable. - fminimumnum.ll: Most tests fall back to a scalar epilogue because of a failing EVL legality check. I don't know why the umax changed from 15 to 16, but it shouldn't make a difference if the minimum trip count is 4096 (which seems way too high to begin with?) - interleaved-accesses.ll: These tests fall back to a gather/scatter but should be fixed soon by llvm#151665 - partial-reduce.ll: Given that partial reductions bail out with EVL tail folding, I've disabled tail folding in these tests for now and left a TODO. Tracked by llvm#151438 - short-trip-count.ll: These are no longer vectorized as they're below TTI().getMinTripCountTailFoldingThreshold, which was set to 5 in llvm#132176. Previously we continued to vectorize even these low trip counts if they were a known power of 2, due to existing logic in the loop vectorizer overriding the tail folding style to CM_ScalarEpilogueNotAllowedLowTripLoop whenever there was no tail folding. I believe this is working as intended now but no longer vectorizing the loop.

…tor (llvm#149981)" Now that support for masked loads/stores of interleave groups has landed, we can enable the loop vectorizer to generate masked interleave access where applicable. This improves vectorization in several ways: * Internal predication support: This enables interleave group vectorization for loops with internal control flow predication, provided all members of the group share the same predicate. Gaps in interleave groups are still not efficiently handled by masking, so masking for gaps remains disabled for now. * Tail folding: This allows tail folding of loops with interleave groups by using masking. Without this, vectorized loops with interleaves would fall back to using separate gather/scatter accesses, which can be significantly less efficient. * Scalable vector support: Currently, only scalable vector types are supported for masked interleave lowering. Fixed-length vector support will be enabled in the future. As interleave access is not yet supported with tail folding by EVL, that functionality is temporarily disabled. We are going to create another patch to support it. Co-authored-by: Philip Reames <[email protected]>

lukel97

LGTM, I tested it out locally and can confirm its generating masked vlseg/vsseg as expected from the loop vectorizer, both for internal predication and with EVL tail folding.

I think the last two restrictions in the PR description just need updated now that they're removed.

Mel-Chen · 2025-08-05T02:46:06Z

LGTM, I tested it out locally and can confirm its generating masked vlseg/vsseg as expected from the loop vectorizer, both for internal predication and with EVL tail folding.

I think the last two restrictions in the PR description just need updated now that they're removed.

Updated the title and description. Rebased and waiting for the test result.

lukel97 · 2025-08-07T04:08:27Z

This gives some nice improvements across llvm-test-suite, uuencode is 33% faster now and there's some other big improvements to TSVC: https://lnt.lukelau.me/db_default/v4/nts/821

Mel-Chen · 2025-08-08T11:30:07Z

This gives some nice improvements across llvm-test-suite, uuencode is 33% faster now and there's some other big improvements to TSVC: https://lnt.lukelau.me/db_default/v4/nts/821

Thanks to @preames for the help. :)

Mel-Chen requested review from asb, preames and lukel97 August 1, 2025 07:49

llvmbot added backend:RISC-V vectorizers llvm:transforms labels Aug 1, 2025

Mel-Chen mentioned this pull request Aug 1, 2025

[RISCV][TTI] Enable masked interleave access for scalable vector #149981

Merged

Mel-Chen requested a review from alexey-bataev August 1, 2025 07:52

lukel97 reviewed Aug 1, 2025

View reviewed changes

lukel97 mentioned this pull request Aug 1, 2025

[RISCV] Enable tail folding by default #151681

Merged

Mel-Chen and others added 3 commits August 4, 2025 02:30

Cherry-pick llvm#150624

305e75f

Remove limitation for EVL

767de32

Mel-Chen force-pushed the reland-enable-masked-interleave branch from f03ffb1 to 767de32 Compare August 4, 2025 10:01

lukel97 approved these changes Aug 4, 2025

View reviewed changes

Mel-Chen changed the title ~~Reland "[RISCV][TTI] Enable masked interleave access for scalable vector (#149981)"~~ [RISCV][TTI] Enable masked interleave access Aug 5, 2025

Merge branch 'main' into reland-enable-masked-interleave

d48697f

This was referenced Aug 5, 2025

[RISCV] Remove fixed vector constraint on masked interleave costing #150624

Closed

[LV][EVL] Support interleaved access with tail folding by EVL #152070

Open

Mel-Chen merged commit 76d98cf into llvm:main Aug 5, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RISCV][TTI] Enable masked interleave access #151665

[RISCV][TTI] Enable masked interleave access #151665

Uh oh!

Mel-Chen commented Aug 1, 2025 •

edited

Loading

Uh oh!

llvmbot commented Aug 1, 2025 •

edited

Loading

Uh oh!

lukel97 Aug 1, 2025

Uh oh!

Mel-Chen Aug 4, 2025

Uh oh!

lukel97 Aug 1, 2025

Uh oh!

Mel-Chen Aug 4, 2025

Uh oh!

lukel97 left a comment

Uh oh!

Mel-Chen commented Aug 5, 2025

Uh oh!

Uh oh!

lukel97 commented Aug 7, 2025

Uh oh!

Mel-Chen commented Aug 8, 2025

Uh oh!

Uh oh!

		bool EVLIsLegal = UserIC <= 1 && IsScalableVF &&
		!InterleaveInfo.hasGroups() &&

[RISCV][TTI] Enable masked interleave access #151665

[RISCV][TTI] Enable masked interleave access #151665

Uh oh!

Conversation

Mel-Chen commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukel97 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Mel-Chen Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

lukel97 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Mel-Chen Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

Mel-Chen commented Aug 5, 2025

Uh oh!

Uh oh!

lukel97 commented Aug 7, 2025

Uh oh!

Mel-Chen commented Aug 8, 2025

Uh oh!

Uh oh!

Mel-Chen commented Aug 1, 2025 •

edited

Loading

llvmbot commented Aug 1, 2025 •

edited

Loading