[RISCV][TTI] Enable masked interleave vectorization #150074

preames · 2025-07-22T17:56:53Z

Now that the masked.load/store support has landed for interleave access, we can let the loop vectorizer generate masked interleave groups for the supported cases. At the moment, only scalable VFs are supported via the lowering; that restriction will be removed once pending work on the shuffle vector path has all landed.

A couple things to highlight in terms of codegen.

First, this improves vectorizations of non-tail folded loops when the loop contains internal predication via control flow. This should only kick in when all elements of the interleave group are predicated together as we still don't support using masking for gaps in the interleave group. (By support, I mean support efficiently; if it gets generated by some quirk of the cost model, we should still lower it correctly.)

Second, this enables our ability to tail fold loops with interleave groups. Without this change, any tail folded loop is going to be forced to use NF independent strided loads or stores. This closes a moderate large performance gap with the non-tail folded version when the interleave group is otherwise unconditional.

Third, for the EVL predication cases, we are not EVL predicating the masked interleaves. Instead, we're hitting the codegen for masked predication. This is still a huge improvement over the prior codegen, but is also quite sub-optimal. Rewriting interleave groups for EVL is a separate project in the vectorizer.

Nww that the masked.load/store support has landed for interleave access, we can let the loop vectorizer generate masked interleave groups for the supported cases. At the moment, only scalable VFs are supported via the lowering; that restriction will be removed once pending work on the shuffle vector path has all landed. A couple things to highlight in terms of codegen. First, this improves vectorizations of non-tail folded loops when the loop contains internal predication via control flow. This should only kick in when all elements of the interleave group are predicated together as we still don't support using masking for gaps in the interleave group. (By support, I mean support efficiently; if it gets generated by some quirk of the cost model, we should still lower it correctly.) Second, this enables our ability to tail fold loops with interleave groups. Without this change, any tail folded loop is going to be forced to use NF independent strided loads or stores. This closes a moderate large performance gap with the non-tail folded version when the interleave group is otherwise unconditional. Third, for the EVL predication cases, we are not EVL predicating the masked interleaves. Instead, we're hitting the codegen for masked predication. This is still a huge improvement over the prior codegen, but is also quite sub-optimal. Rewriting interleave groups for EVL is a separate project in the vectorizer.

llvmbot · 2025-07-22T17:57:25Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-risc-v

Author: Philip Reames (preames)

Changes

Now that the masked.load/store support has landed for interleave access, we can let the loop vectorizer generate masked interleave groups for the supported cases. At the moment, only scalable VFs are supported via the lowering; that restriction will be removed once pending work on the shuffle vector path has all landed.

A couple things to highlight in terms of codegen.

First, this improves vectorizations of non-tail folded loops when the loop contains internal predication via control flow. This should only kick in when all elements of the interleave group are predicated together as we still don't support using masking for gaps in the interleave group. (By support, I mean support efficiently; if it gets generated by some quirk of the cost model, we should still lower it correctly.)

Second, this enables our ability to tail fold loops with interleave groups. Without this change, any tail folded loop is going to be forced to use NF independent strided loads or stores. This closes a moderate large performance gap with the non-tail folded version when the interleave group is otherwise unconditional.

Third, for the EVL predication cases, we are not EVL predicating the masked interleaves. Instead, we're hitting the codegen for masked predication. This is still a huge improvement over the prior codegen, but is also quite sub-optimal. Rewriting interleave groups for EVL is a separate project in the vectorizer.

Patch is 50.70 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/150074.diff

4 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+7-5)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+4)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll (+120-158)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-interleave.ll (+26-15)

diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 56ead92187b04..1ae14b5e565c3 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -979,12 +979,14 @@ InstructionCost RISCVTTIImpl::getInterleavedMemoryOpCost(
     Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
     bool UseMaskForCond, bool UseMaskForGaps) const {
 
-  // The interleaved memory access pass will lower interleaved memory ops (i.e
-  // a load and store followed by a specific shuffle) to vlseg/vsseg
-  // intrinsics.
-  if (!UseMaskForCond && !UseMaskForGaps &&
+  auto *VTy = cast<VectorType>(VecTy);
+
+  // The interleaved memory access pass will lower (de)interleave ops combined
+  // with an adjacent appropriate memory to vlseg/vsseg intrinsics.  We
+  // currently only support masking for the scalable path. vlseg/vsseg only
+  // support masking per-iteration (i.e. condition), not per-segment (i.e. gap).
+  if ((VTy->isScalableTy() || !UseMaskForCond) && !UseMaskForGaps &&
       Factor <= TLI->getMaxSupportedInterleaveFactor()) {
-    auto *VTy = cast<VectorType>(VecTy);
     std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(VTy);
     // Need to make sure type has't been scalarized
     if (LT.second.isVector()) {
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 12bf8c1b4de70..24b148619c771 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -398,6 +398,10 @@ class RISCVTTIImpl final : public BasicTTIImplBase<RISCVTTIImpl> {
 
   bool enableInterleavedAccessVectorization() const override { return true; }
 
+  bool enableMaskedInterleavedAccessVectorization() const override {
+    return ST->hasVInstructions();
+  }
+
   unsigned getMinTripCountTailFoldingThreshold() const override;
 
   enum RISCVRegisterClass { GPRRC, FPRRC, VRRC };
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll
index 45357dd6bf0d6..131ccca2fe298 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-masked-access.ll
@@ -30,26 +30,25 @@ define void @masked_strided_factor2(ptr noalias nocapture readonly %p, ptr noali
 ; SCALAR_EPILOGUE-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; SCALAR_EPILOGUE-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 16 x i32> [ [[TMP5]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; SCALAR_EPILOGUE-NEXT:    [[TMP6:%.*]] = icmp ugt <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
-; SCALAR_EPILOGUE-NEXT:    [[TMP7:%.*]] = shl nuw nsw <vscale x 16 x i32> [[VEC_IND]], splat (i32 1)
-; SCALAR_EPILOGUE-NEXT:    [[TMP8:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
-; SCALAR_EPILOGUE-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP8]]
-; SCALAR_EPILOGUE-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 16 x i8> @llvm.masked.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> [[TMP9]], i32 1, <vscale x 16 x i1> [[TMP6]], <vscale x 16 x i8> poison)
-; SCALAR_EPILOGUE-NEXT:    [[TMP10:%.*]] = or disjoint <vscale x 16 x i32> [[TMP7]], splat (i32 1)
-; SCALAR_EPILOGUE-NEXT:    [[TMP11:%.*]] = zext nneg <vscale x 16 x i32> [[TMP10]] to <vscale x 16 x i64>
-; SCALAR_EPILOGUE-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP11]]
-; SCALAR_EPILOGUE-NEXT:    [[WIDE_MASKED_GATHER3:%.*]] = call <vscale x 16 x i8> @llvm.masked.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> [[TMP12]], i32 1, <vscale x 16 x i1> [[TMP6]], <vscale x 16 x i8> poison)
-; SCALAR_EPILOGUE-NEXT:    [[TMP13:%.*]] = call <vscale x 16 x i8> @llvm.smax.nxv16i8(<vscale x 16 x i8> [[WIDE_MASKED_GATHER]], <vscale x 16 x i8> [[WIDE_MASKED_GATHER3]])
-; SCALAR_EPILOGUE-NEXT:    [[TMP14:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
-; SCALAR_EPILOGUE-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP14]]
-; SCALAR_EPILOGUE-NEXT:    call void @llvm.masked.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP13]], <vscale x 16 x ptr> [[TMP15]], i32 1, <vscale x 16 x i1> [[TMP6]])
-; SCALAR_EPILOGUE-NEXT:    [[TMP16:%.*]] = sub <vscale x 16 x i8> zeroinitializer, [[TMP13]]
-; SCALAR_EPILOGUE-NEXT:    [[TMP17:%.*]] = zext nneg <vscale x 16 x i32> [[TMP10]] to <vscale x 16 x i64>
-; SCALAR_EPILOGUE-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP17]]
-; SCALAR_EPILOGUE-NEXT:    call void @llvm.masked.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP16]], <vscale x 16 x ptr> [[TMP18]], i32 1, <vscale x 16 x i1> [[TMP6]])
+; SCALAR_EPILOGUE-NEXT:    [[TMP7:%.*]] = shl i32 [[INDEX]], 1
+; SCALAR_EPILOGUE-NEXT:    [[TMP8:%.*]] = sext i32 [[TMP7]] to i64
+; SCALAR_EPILOGUE-NEXT:    [[TMP9:%.*]] = getelementptr i8, ptr [[P]], i64 [[TMP8]]
+; SCALAR_EPILOGUE-NEXT:    [[INTERLEAVED_MASK:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP6]], <vscale x 16 x i1> [[TMP6]])
+; SCALAR_EPILOGUE-NEXT:    [[WIDE_MASKED_VEC:%.*]] = call <vscale x 32 x i8> @llvm.masked.load.nxv32i8.p0(ptr [[TMP9]], i32 1, <vscale x 32 x i1> [[INTERLEAVED_MASK]], <vscale x 32 x i8> poison)
+; SCALAR_EPILOGUE-NEXT:    [[STRIDED_VEC:%.*]] = call { <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave2.nxv32i8(<vscale x 32 x i8> [[WIDE_MASKED_VEC]])
+; SCALAR_EPILOGUE-NEXT:    [[TMP10:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC]], 0
+; SCALAR_EPILOGUE-NEXT:    [[TMP11:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC]], 1
+; SCALAR_EPILOGUE-NEXT:    [[TMP12:%.*]] = call <vscale x 16 x i8> @llvm.smax.nxv16i8(<vscale x 16 x i8> [[TMP10]], <vscale x 16 x i8> [[TMP11]])
+; SCALAR_EPILOGUE-NEXT:    [[TMP13:%.*]] = sext i32 [[TMP7]] to i64
+; SCALAR_EPILOGUE-NEXT:    [[TMP14:%.*]] = getelementptr i8, ptr [[Q]], i64 [[TMP13]]
+; SCALAR_EPILOGUE-NEXT:    [[TMP15:%.*]] = sub <vscale x 16 x i8> zeroinitializer, [[TMP12]]
+; SCALAR_EPILOGUE-NEXT:    [[INTERLEAVED_VEC:%.*]] = call <vscale x 32 x i8> @llvm.vector.interleave2.nxv32i8(<vscale x 16 x i8> [[TMP12]], <vscale x 16 x i8> [[TMP15]])
+; SCALAR_EPILOGUE-NEXT:    [[INTERLEAVED_MASK3:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP6]], <vscale x 16 x i1> [[TMP6]])
+; SCALAR_EPILOGUE-NEXT:    call void @llvm.masked.store.nxv32i8.p0(<vscale x 32 x i8> [[INTERLEAVED_VEC]], ptr [[TMP14]], i32 1, <vscale x 32 x i1> [[INTERLEAVED_MASK3]])
 ; SCALAR_EPILOGUE-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP4]]
 ; SCALAR_EPILOGUE-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT2]]
-; SCALAR_EPILOGUE-NEXT:    [[TMP19:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
-; SCALAR_EPILOGUE-NEXT:    br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; SCALAR_EPILOGUE-NEXT:    [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; SCALAR_EPILOGUE-NEXT:    br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; SCALAR_EPILOGUE:       middle.block:
 ; SCALAR_EPILOGUE-NEXT:    [[CMP_N:%.*]] = icmp eq i32 [[N_MOD_VF]], 0
 ; SCALAR_EPILOGUE-NEXT:    br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
@@ -80,26 +79,25 @@ define void @masked_strided_factor2(ptr noalias nocapture readonly %p, ptr noali
 ; PREDICATED_TAIL_FOLDING-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i32(i32 [[INDEX]], i32 1024)
 ; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP5:%.*]] = icmp ugt <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
 ; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP6:%.*]] = select <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i1> [[TMP5]], <vscale x 16 x i1> zeroinitializer
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP7:%.*]] = shl nuw nsw <vscale x 16 x i32> [[VEC_IND]], splat (i32 1)
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP8:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP8]]
-; PREDICATED_TAIL_FOLDING-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 16 x i8> @llvm.masked.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> [[TMP9]], i32 1, <vscale x 16 x i1> [[TMP6]], <vscale x 16 x i8> poison)
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP10:%.*]] = or disjoint <vscale x 16 x i32> [[TMP7]], splat (i32 1)
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP11:%.*]] = zext nneg <vscale x 16 x i32> [[TMP10]] to <vscale x 16 x i64>
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP11]]
-; PREDICATED_TAIL_FOLDING-NEXT:    [[WIDE_MASKED_GATHER3:%.*]] = call <vscale x 16 x i8> @llvm.masked.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> [[TMP12]], i32 1, <vscale x 16 x i1> [[TMP6]], <vscale x 16 x i8> poison)
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP13:%.*]] = call <vscale x 16 x i8> @llvm.smax.nxv16i8(<vscale x 16 x i8> [[WIDE_MASKED_GATHER]], <vscale x 16 x i8> [[WIDE_MASKED_GATHER3]])
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP14:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP14]]
-; PREDICATED_TAIL_FOLDING-NEXT:    call void @llvm.masked.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP13]], <vscale x 16 x ptr> [[TMP15]], i32 1, <vscale x 16 x i1> [[TMP6]])
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP16:%.*]] = sub <vscale x 16 x i8> zeroinitializer, [[TMP13]]
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP17:%.*]] = zext nneg <vscale x 16 x i32> [[TMP10]] to <vscale x 16 x i64>
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP17]]
-; PREDICATED_TAIL_FOLDING-NEXT:    call void @llvm.masked.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP16]], <vscale x 16 x ptr> [[TMP18]], i32 1, <vscale x 16 x i1> [[TMP6]])
+; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP7:%.*]] = shl i32 [[INDEX]], 1
+; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP8:%.*]] = sext i32 [[TMP7]] to i64
+; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP9:%.*]] = getelementptr i8, ptr [[P]], i64 [[TMP8]]
+; PREDICATED_TAIL_FOLDING-NEXT:    [[INTERLEAVED_MASK:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP6]], <vscale x 16 x i1> [[TMP6]])
+; PREDICATED_TAIL_FOLDING-NEXT:    [[WIDE_MASKED_VEC:%.*]] = call <vscale x 32 x i8> @llvm.masked.load.nxv32i8.p0(ptr [[TMP9]], i32 1, <vscale x 32 x i1> [[INTERLEAVED_MASK]], <vscale x 32 x i8> poison)
+; PREDICATED_TAIL_FOLDING-NEXT:    [[STRIDED_VEC:%.*]] = call { <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave2.nxv32i8(<vscale x 32 x i8> [[WIDE_MASKED_VEC]])
+; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP10:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC]], 0
+; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP11:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC]], 1
+; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP12:%.*]] = call <vscale x 16 x i8> @llvm.smax.nxv16i8(<vscale x 16 x i8> [[TMP10]], <vscale x 16 x i8> [[TMP11]])
+; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP13:%.*]] = sext i32 [[TMP7]] to i64
+; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP14:%.*]] = getelementptr i8, ptr [[Q]], i64 [[TMP13]]
+; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP15:%.*]] = sub <vscale x 16 x i8> zeroinitializer, [[TMP12]]
+; PREDICATED_TAIL_FOLDING-NEXT:    [[INTERLEAVED_VEC:%.*]] = call <vscale x 32 x i8> @llvm.vector.interleave2.nxv32i8(<vscale x 16 x i8> [[TMP12]], <vscale x 16 x i8> [[TMP15]])
+; PREDICATED_TAIL_FOLDING-NEXT:    [[INTERLEAVED_MASK3:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP6]], <vscale x 16 x i1> [[TMP6]])
+; PREDICATED_TAIL_FOLDING-NEXT:    call void @llvm.masked.store.nxv32i8.p0(<vscale x 32 x i8> [[INTERLEAVED_VEC]], ptr [[TMP14]], i32 1, <vscale x 32 x i1> [[INTERLEAVED_MASK3]])
 ; PREDICATED_TAIL_FOLDING-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP3]]
 ; PREDICATED_TAIL_FOLDING-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT2]]
-; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP19:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
-; PREDICATED_TAIL_FOLDING-NEXT:    br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; PREDICATED_TAIL_FOLDING-NEXT:    [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; PREDICATED_TAIL_FOLDING-NEXT:    br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; PREDICATED_TAIL_FOLDING:       middle.block:
 ; PREDICATED_TAIL_FOLDING-NEXT:    br label [[FOR_END:%.*]]
 ; PREDICATED_TAIL_FOLDING:       scalar.ph:
@@ -129,28 +127,29 @@ define void @masked_strided_factor2(ptr noalias nocapture readonly %p, ptr noali
 ; PREDICATED_EVL-NEXT:    [[TMP5:%.*]] = call i32 @llvm.experimental.get.vector.length.i32(i32 [[AVL]], i32 16, i1 true)
 ; PREDICATED_EVL-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 16 x i32> poison, i32 [[TMP5]], i64 0
 ; PREDICATED_EVL-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 16 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
-; PREDICATED_EVL-NEXT:    [[TMP6:%.*]] = icmp ugt <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
-; PREDICATED_EVL-NEXT:    [[TMP7:%.*]] = shl nuw nsw <vscale x 16 x i32> [[VEC_IND]], splat (i32 1)
-; PREDICATED_EVL-NEXT:    [[TMP8:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
-; PREDICATED_EVL-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP8]]
-; PREDICATED_EVL-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 16 x i8> @llvm.vp.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> align 1 [[TMP9]], <vscale x 16 x i1> [[TMP6]], i32 [[TMP5]])
-; PREDICATED_EVL-NEXT:    [[TMP10:%.*]] = or disjoint <vscale x 16 x i32> [[TMP7]], splat (i32 1)
-; PREDICATED_EVL-NEXT:    [[TMP11:%.*]] = zext nneg <vscale x 16 x i32> [[TMP10]] to <vscale x 16 x i64>
-; PREDICATED_EVL-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP11]]
-; PREDICATED_EVL-NEXT:    [[WIDE_MASKED_GATHER3:%.*]] = call <vscale x 16 x i8> @llvm.vp.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> align 1 [[TMP12]], <vscale x 16 x i1> [[TMP6]], i32 [[TMP5]])
-; PREDICATED_EVL-NEXT:    [[TMP13:%.*]] = call <vscale x 16 x i8> @llvm.smax.nxv16i8(<vscale x 16 x i8> [[WIDE_MASKED_GATHER]], <vscale x 16 x i8> [[WIDE_MASKED_GATHER3]])
-; PREDICATED_EVL-NEXT:    [[TMP14:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
-; PREDICATED_EVL-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP14]]
-; PREDICATED_EVL-NEXT:    call void @llvm.vp.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP13]], <vscale x 16 x ptr> align 1 [[TMP15]], <vscale x 16 x i1> [[TMP6]], i32 [[TMP5]])
-; PREDICATED_EVL-NEXT:    [[TMP16:%.*]] = sub <vscale x 16 x i8> zeroinitializer, [[TMP13]]
-; PREDICATED_EVL-NEXT:    [[TMP17:%.*]] = zext nneg <vscale x 16 x i32> [[TMP10]] to <vscale x 16 x i64>
-; PREDICATED_EVL-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[Q]], <vscale x 16 x i64> [[TMP17]]
-; PREDICATED_EVL-NEXT:    call void @llvm.vp.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP16]], <vscale x 16 x ptr> align 1 [[TMP18]], <vscale x 16 x i1> [[TMP6]], i32 [[TMP5]])
+; PREDICATED_EVL-NEXT:    [[TMP6:%.*]] = icmp ult <vscale x 16 x i32> [[VEC_IND]], splat (i32 1024)
+; PREDICATED_EVL-NEXT:    [[TMP7:%.*]] = icmp ugt <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
+; PREDICATED_EVL-NEXT:    [[TMP8:%.*]] = select <vscale x 16 x i1> [[TMP6]], <vscale x 16 x i1> [[TMP7]], <vscale x 16 x i1> zeroinitializer
+; PREDICATED_EVL-NEXT:    [[TMP9:%.*]] = shl i32 [[EVL_BASED_IV]], 1
+; PREDICATED_EVL-NEXT:    [[TMP10:%.*]] = sext i32 [[TMP9]] to i64
+; PREDICATED_EVL-NEXT:    [[TMP11:%.*]] = getelementptr i8, ptr [[P]], i64 [[TMP10]]
+; PREDICATED_EVL-NEXT:    [[INTERLEAVED_MASK:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP8]], <vscale x 16 x i1> [[TMP8]])
+; PREDICATED_EVL-NEXT:    [[WIDE_MASKED_VEC:%.*]] = call <vscale x 32 x i8> @llvm.masked.load.nxv32i8.p0(ptr [[TMP11]], i32 1, <vscale x 32 x i1> [[INTERLEAVED_MASK]], <vscale x 32 x i8> poison)
+; PREDICATED_EVL-NEXT:    [[STRIDED_VEC:%.*]] = call { <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave2.nxv32i8(<vscale x 32 x i8> [[WIDE_MASKED_VEC]])
+; PREDICATED_EVL-NEXT:    [[TMP12:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC]], 0
+; PREDICATED_EVL-NEXT:    [[TMP13:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC]], 1
+; PREDICATED_EVL-NEXT:    [[TMP14:%.*]] = call <vscale x 16 x i8> @llvm.smax.nxv16i8(<vscale x 16 x i8> [[TMP12]], <vscale x 16 x i8> [[TMP13]])
+; PREDICATED_EVL-NEXT:    [[TMP15:%.*]] = sext i32 [[TMP9]] to i64
+; PREDICATED_EVL-NEXT:    [[TMP16:%.*]] = getelementptr i8, ptr [[Q]], i64 [[TMP15]]
+; PREDICATED_EVL-NEXT:    [[TMP17:%.*]] = sub <vscale x 16 x i8> zeroinitializer, [[TMP14]]
+; PREDICATED_EVL-NEXT:    [[INTERLEAVED_VEC:%.*]] = call <vscale x 32 x i8> @llvm.vector.interleave2.nxv32i8(<vscale x 16 x i8> [[TMP14]], <vscale x 16 x i8> [[TMP17]])
+; PREDICATED_EVL-NEXT:    [[INTERLEAVED_MASK3:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP8]], <vscale x 16 x i1> [[TMP8]])
+; PREDICATED_EVL-NEXT:    call void @llvm.masked.store.nxv32i8.p0(<vscale x 32 x i8> [[INTERLEAVED_VEC]], ptr [[TMP16]], i32 1, <vscale x 32 x i1> [[INTERLEAVED_MASK3]])
 ; PREDICATED_EVL-NEXT:    [[INDEX_EVL_NEXT]] = add nuw i32 [[TMP5]], [[EVL_BASED_IV]]
 ; PREDICATED_EVL-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP3]]
 ; PREDICATED_EVL-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT2]]
-; PREDICATED_EVL-NEXT:    [[TMP19:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
-; PREDICATED_EVL-NEXT:    br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; PREDICATED_EVL-NEXT:    [[TMP18:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; PREDICATED_EVL-NEXT:    br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; PREDICATED_EVL:       middle.block:
 ; PREDICATED_EVL-NEXT:    br label [[FOR_END:%.*]]
 ; PREDICATED_EVL:       scalar.ph:
@@ -215,42 +214,29 @@ define void @masked_strided_factor4(ptr noalias nocapture readonly %p, ptr noali
 ; SCALAR_EPILOGUE-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; SCALAR_EPILOGUE-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 16 x i32> [ [[TMP5]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; SCALAR_EPILOGUE-NEXT:    [[TMP6:%.*]] = icmp ugt <vscale x 16 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
-; SCALAR_EPILOGUE-NEXT:    [[TMP7:%.*]] = shl nuw nsw <vscale x 16 x i32> [[VEC_IND]], splat (i32 2)
-; SCALAR_EPILOGUE-NEXT:    [[TMP8:%.*]] = or disjoint <vscale x 16 x i32> [[TMP7]], splat (i32 1)
-; SCALAR_EPILOGUE-NEXT:    [[TMP9:%.*]] = or disjoint <vscale x 16 x i32> [[TMP7]], splat (i32 2)
-; SCALAR_EPILOGUE-NEXT:    [[TMP10:%.*]] = or disjoint <vscale x 16 x i32> [[TMP7]], splat (i32 3)
-; SCALAR_EPILOGUE-NEXT:    [[TMP11:%.*]] = zext nneg <vscale x 16 x i32> [[TMP7]] to <vscale x 16 x i64>
-; SCALAR_EPILOGUE-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[P]], <vscale x 16 x i64> [[TMP11]]
-; SCALAR_EPILOGUE-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 16 x i8> @llvm.masked.gather.nxv16i8.nxv16p0(<vscale x 16 x ptr> [[TMP12]], i32 1, <vscale x 16 x i1> [[TMP6]], <vscale x 16 x i8> poison)
-; SCALAR_EPILOGUE-NEXT:    [[TMP13:%.*]] = zext nneg <vscale x 16 x i32> [[TMP8]] to <vscale x 16 x i64>
-; SCALAR_EPILOGUE-NEXT:    [[TMP14:%.*]] = getelementptr inbound...
[truncated]

lukel97

Just checking, this sets out to do the same thing as #149981 right?

preames · 2025-07-23T02:50:33Z

Just checking, this sets out to do the same thing as #149981 right?

It appears to mostly cover the same ground. I prefer my version for a number of small reasons, but I'd not have posted it if I'd seen that first. I'd have just replied there with detailed comments.

lukel97 · 2025-07-23T02:55:00Z

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

+  // with an adjacent appropriate memory to vlseg/vsseg intrinsics.  We
+  // currently only support masking for the scalable path. vlseg/vsseg only
+  // support masking per-iteration (i.e. condition), not per-segment (i.e. gap).
+  if ((VTy->isScalableTy() || !UseMaskForCond) && !UseMaskForGaps &&


To make sure I'm understanding this right, we do support fixed-length deinterleave/interleave intrinsics, i.e.

define {<8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>} @masked_load_factor4_mask(ptr %p, <8 x i1> %mask) { ; CHECK-LABEL: masked_load_factor4_mask: ; CHECK: # %bb.0: ; CHECK-NEXT: vsetvli a1, zero, e8, m1, ta, ma ; CHECK-NEXT: vlseg4e8.v v8, (a0), v0.t ; CHECK-NEXT: ret %interleaved.mask = tail call <32 x i1> @llvm.vector.interleave4.nxv32i1(<8 x i1> %mask, <8 x i1> %mask, <8 x i1> %mask, <8 x i1> %mask) %vec = call <32 x i8> @llvm.masked.load(ptr %p, i32 4, <32 x i1> %interleaved.mask, <32 x i8> poison) %deinterleaved.results = call {<8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>} @llvm.vector.deinterleave4.nxv32i8(<32 x i8> %vec) ret {<8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>} %deinterleaved.results }

Will get lowered to a vlseg. It's just that we don't currently match a masked.load/store with shufflevector [de]interleaves? Which is what the loop vectorizer emits for fixed-length vectors

For fixed-length VF, shuffles are used instead of interleave intrinsics.
However, I believe getMask in the InterleavedAccessPass should already handle shuffle mask:

static Value *getMask(Value *WideMask, unsigned Factor, ElementCount LeafValueEC) { if (auto *IMI = dyn_cast<IntrinsicInst>(WideMask)) { ... if (LeafValueEC.isFixed()) { unsigned LeafMaskLen = LeafValueEC.getFixedValue(); SmallVector<Constant *, 8> LeafMask(LeafMaskLen, nullptr); // If this is a fixed-length constant mask, each lane / leaf has to // use the same mask. This is done by checking if every group with Factor // number of elements in the interleaved mask has homogeneous values. for (unsigned Idx = 0U; Idx < LeafMaskLen * Factor; ++Idx) { Constant *C = ConstMask->getAggregateElement(Idx); if (LeafMask[Idx / Factor] && LeafMask[Idx / Factor] != C) return nullptr; LeafMask[Idx / Factor] = C; } return ConstantVector::get(LeafMask); } } return nullptr; }

Fixed-length should be supported.
Therefore, I think doing it as shown in #149981 should be good enough, right?

Suggested change

if ((VTy->isScalableTy() || !UseMaskForCond) && !UseMaskForGaps &&

if (!UseMaskForGaps && Factor <= TLI->getMaxSupportedInterleaveFactor())

I think it's because we don't handle llvm.masked.{load,store} with shufflevectors, only load and vp.load:

llvm-project/llvm/lib/CodeGen/InterleavedAccessPass.cpp

Lines 689 to 695 in b7889a6

if (match(&I, m_CombineOr(m_Load(m_Value()),

m_Intrinsic<Intrinsic::vp_load>())))

Changed |= lowerInterleavedLoad(&I, DeadInsts);

if (match(&I, m_CombineOr(m_Store(m_Value(), m_Value()),

m_Intrinsic<Intrinsic::vp_store>())))

Changed |= lowerInterleavedStore(&I, DeadInsts);

But yes, it seems a shame that we already have this fixed-length mask functionality in getMask. Hopefully it's not too difficult to extend InterleavedAccessPass to handle llvm.masked.{load,store} with shufflevectors. I think that would round out support for all the different ways of expressing interleaves.

Which as a side note, I think that would mean there's 12 possible different types of [de]interleaves: Either shufflevector or intrinsic based, for each of load, store, masked.load, masked.store, vp.load or vp.store. That's a lot!

Ah, I see. Then I think it does make sense to initially limit support to scalable only. Thanks!

Luke is correct here. I am actively working on fully supporting the shuffle path with masked.load/store, but we're not there yet. The next change is #150241, and we've got at least one more needed after that before we could reasonable enable fixed vectors by default.

lukel97

LGTM

Mel-Chen · 2025-07-23T08:11:26Z

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h

@@ -398,6 +398,10 @@ class RISCVTTIImpl final : public BasicTTIImplBase<RISCVTTIImpl> {

  bool enableInterleavedAccessVectorization() const override { return true; }

+  bool enableMaskedInterleavedAccessVectorization() const override {
+    return ST->hasVInstructions();


This comment isn't opposing the use of hasVInstructions() here — in fact, I think it's a good thing.
I was just curious during my own implementation: why does enableInterleavedAccessVectorization return true directly without checking hasVInstructions()?

I think this is a case where the existing code probably should be checking for vector instructions, but that the difference is essentially stylistic. The cost model results will penalize trying to vectorize without V enough that the feature being enabled won't really matter. The major question (long term) is what we want to do if P ever stabilizes, but that's definitely future work.

An issue with how header masks behave on the second to last iteration was found by Mel. I think we should hold off on landing this until it is fixed, because it may introduce issues whenever we have a masked recipe that isn't transformed to a VP intrinsic.

lukel97 · 2025-07-23T09:34:06Z

In the other PR @Mel-Chen found a latent correctness issue with how we leave around header masks, which with this PR is probably enough to cause miscompiles if we don't convert the VPInterleaveRecipe loads/stores to VP intrinsics. I've filed an issue for it here: #150197.

With EVL tail folding, the EVL may not always be VF on the second-to-last iteration. Recipes that have been converted to VP intrinsics via optimizeMaskToEVL account for this, but recipes that are left behind will still use the old header mask which may end up having a different vector length. This is effectively the same as llvm#95368, and fixes this by converting header masks from icmp ule wide-canonical-iv, backedge-trip-count -> icmp ult step-vector, evl. Without it, recipes that fall through optimizeMaskToEVL may use the wrong vector length, e.g. in llvm#150074 and llvm#149981. We really need to split off optimizeMaskToEVL into VPlanTransforms::optimize and move transformRecipestoEVLRecipes into tryToBuildVPlanWithVPRecipes, so we don't mix up what is needed for correctness and what is needed to optimize away the mask computations. We should be able to still generate a correct albeit suboptimal VPlan without running optimizeMaskToEVL. I've added a TODO for this. Fixes llvm#150197

Mel-Chen · 2025-07-23T14:21:59Z

Just checking, this sets out to do the same thing as #149981 right?

It appears to mostly cover the same ground. I prefer my version for a number of small reasons, but I'd not have posted it if I'd seen that first. I'd have just replied there with detailed comments.

I think that it's totally fine to prioritize merging this patch — you deserve it. :)
I even hope it can land as soon as possible, so that my local VPInterleaveEVLRecipe can start to be reviewed. #149981 was only created because I realized it couldn't be converted to VPInterleaveEVLRecipe after I finished the implementation.

preames · 2025-07-23T15:41:17Z

In the other PR @Mel-Chen found a latent correctness issue with how we leave around header masks, which with this PR is probably enough to cause miscompiles if we don't convert the VPInterleaveRecipe loads/stores to VP intrinsics. I've filed an issue for it here: #150197.

I haven't dug through the overnight activity in full yet, but if we have a suspected miscompile, I'll definitely hold this. We want to fix the miscompile, then enable more features. :)

Co-authored-by: Philip Reames <[email protected]>

preames · 2025-07-24T14:36:33Z

Closed in favor of #149981.

With EVL tail folding, the EVL may not always be VF on the second-to-last iteration. Recipes that have been converted to VP intrinsics via optimizeMaskToEVL account for this, but recipes that are left behind will still use the old header mask which may end up having a different vector length. This is effectively the same as #95368, and fixes this by converting header masks from icmp ule wide-canonical-iv, backedge-trip-count -> icmp ult step-vector, evl. Without it, recipes that fall through optimizeMaskToEVL may use the wrong vector length, e.g. in #150074 and #149981. We really need to split off optimizeMaskToEVL into VPlanTransforms::optimize and move transformRecipestoEVLRecipes into tryToBuildVPlanWithVPRecipes, so we don't mix up what is needed for correctness and what is needed to optimize away the mask computations. We should be able to still generate a correct albeit suboptimal VPlan without running optimizeMaskToEVL. I've added a TODO for this, which I think we can do after #148274 Fixes #150197

preames requested review from lukel97, mshockwave and alexey-bataev July 22, 2025 17:56

llvmbot added backend:RISC-V llvm:transforms labels Jul 22, 2025

Merge branch 'main' into pr-riscv-tti-masked-interleave-enable

77c4250

preames mentioned this pull request Jul 22, 2025

RISC-V EVL tail folding #123069

Open

17 tasks

lukel97 reviewed Jul 23, 2025

View reviewed changes

lukel97 previously approved these changes Jul 23, 2025

View reviewed changes

This was referenced Jul 23, 2025

[RISCV][TTI] Enable masked interleave access for scalable vector #149981

Merged

Remove header mask from EVL tail folded VPInterleaveRecipes #150162

Open

Mel-Chen reviewed Jul 23, 2025

View reviewed changes

lukel97 mentioned this pull request Jul 23, 2025

[VPlan] Fix header masks in EVL tail folding #150202

Merged

Mel-Chen added a commit to Mel-Chen/llvm-project that referenced this pull request Jul 24, 2025

Pick some good change from llvm#150074, thanks

4688b4c

Co-authored-by: Philip Reames <[email protected]>

preames closed this Jul 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RISCV][TTI] Enable masked interleave vectorization #150074

[RISCV][TTI] Enable masked interleave vectorization #150074

Uh oh!

preames commented Jul 22, 2025

Uh oh!

llvmbot commented Jul 22, 2025 •

edited

Loading

Uh oh!

lukel97 left a comment

Uh oh!

preames commented Jul 23, 2025

Uh oh!

lukel97 Jul 23, 2025 •

edited

Loading

Uh oh!

Mel-Chen Jul 23, 2025

Uh oh!

lukel97 Jul 23, 2025

Uh oh!

Mel-Chen Jul 23, 2025

Uh oh!

preames Jul 23, 2025

Uh oh!

lukel97 left a comment

Uh oh!

Mel-Chen Jul 23, 2025

Uh oh!

preames Jul 23, 2025

Uh oh!

lukel97 commented Jul 23, 2025 •

edited

Loading

Uh oh!

Mel-Chen commented Jul 23, 2025

Uh oh!

preames commented Jul 23, 2025

Uh oh!

preames commented Jul 24, 2025

Uh oh!

Uh oh!

	if ((VTy->isScalableTy() \|\| !UseMaskForCond) && !UseMaskForGaps &&
	if (!UseMaskForGaps && Factor <= TLI->getMaxSupportedInterleaveFactor())

	if (match(&I, m_CombineOr(m_Load(m_Value()),
	m_Intrinsic<Intrinsic::vp_load>())))
	Changed \|= lowerInterleavedLoad(&I, DeadInsts);

	if (match(&I, m_CombineOr(m_Store(m_Value(), m_Value()),
	m_Intrinsic<Intrinsic::vp_store>())))
	Changed \|= lowerInterleavedStore(&I, DeadInsts);

[RISCV][TTI] Enable masked interleave vectorization #150074

[RISCV][TTI] Enable masked interleave vectorization #150074

Uh oh!

Conversation

preames commented Jul 22, 2025

Uh oh!

llvmbot commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

preames commented Jul 23, 2025

Uh oh!

lukel97 Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mel-Chen Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

lukel97 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Mel-Chen Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

preames Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

Mel-Chen Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

preames Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mel-Chen commented Jul 23, 2025

Uh oh!

preames commented Jul 23, 2025

Uh oh!

preames commented Jul 24, 2025

Uh oh!

Uh oh!

llvmbot commented Jul 22, 2025 •

edited

Loading

lukel97 Jul 23, 2025 •

edited

Loading

lukel97 commented Jul 23, 2025 •

edited

Loading