Skip to content

[RISCV][TTI] Enable masked interleave vectorization #150074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -979,12 +979,14 @@ InstructionCost RISCVTTIImpl::getInterleavedMemoryOpCost(
Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
bool UseMaskForCond, bool UseMaskForGaps) const {

// The interleaved memory access pass will lower interleaved memory ops (i.e
// a load and store followed by a specific shuffle) to vlseg/vsseg
// intrinsics.
if (!UseMaskForCond && !UseMaskForGaps &&
auto *VTy = cast<VectorType>(VecTy);

// The interleaved memory access pass will lower (de)interleave ops combined
// with an adjacent appropriate memory to vlseg/vsseg intrinsics. We
// currently only support masking for the scalable path. vlseg/vsseg only
// support masking per-iteration (i.e. condition), not per-segment (i.e. gap).
if ((VTy->isScalableTy() || !UseMaskForCond) && !UseMaskForGaps &&
Copy link
Contributor

@lukel97 lukel97 Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure I'm understanding this right, we do support fixed-length deinterleave/interleave intrinsics, i.e.

define {<8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>} @masked_load_factor4_mask(ptr %p, <8 x i1> %mask) {
; CHECK-LABEL: masked_load_factor4_mask:
; CHECK:       # %bb.0:
; CHECK-NEXT:    vsetvli a1, zero, e8, m1, ta, ma
; CHECK-NEXT:    vlseg4e8.v v8, (a0), v0.t
; CHECK-NEXT:    ret
  %interleaved.mask = tail call <32 x i1> @llvm.vector.interleave4.nxv32i1(<8 x i1> %mask, <8 x i1> %mask, <8 x i1> %mask, <8 x i1> %mask)
  %vec = call <32 x i8> @llvm.masked.load(ptr %p, i32 4, <32 x i1> %interleaved.mask, <32 x i8> poison)
  %deinterleaved.results = call {<8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>} @llvm.vector.deinterleave4.nxv32i8(<32 x i8> %vec)
  ret {<8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>} %deinterleaved.results
}

Will get lowered to a vlseg. It's just that we don't currently match a masked.load/store with shufflevector [de]interleaves? Which is what the loop vectorizer emits for fixed-length vectors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fixed-length VF, shuffles are used instead of interleave intrinsics.
However, I believe getMask in the InterleavedAccessPass should already handle shuffle mask:

  static Value *getMask(Value *WideMask, unsigned Factor,
                        ElementCount LeafValueEC) {
    if (auto *IMI = dyn_cast<IntrinsicInst>(WideMask)) {
    ...
          if (LeafValueEC.isFixed()) {
        unsigned LeafMaskLen = LeafValueEC.getFixedValue();
        SmallVector<Constant *, 8> LeafMask(LeafMaskLen, nullptr);
        // If this is a fixed-length constant mask, each lane / leaf has to
        // use the same mask. This is done by checking if every group with Factor
        // number of elements in the interleaved mask has homogeneous values.
        for (unsigned Idx = 0U; Idx < LeafMaskLen * Factor; ++Idx) {
          Constant *C = ConstMask->getAggregateElement(Idx);
          if (LeafMask[Idx / Factor] && LeafMask[Idx / Factor] != C)
            return nullptr;
          LeafMask[Idx / Factor] = C;
        }

        return ConstantVector::get(LeafMask);
      }
    }

    return nullptr;
  }

Fixed-length should be supported.
Therefore, I think doing it as shown in #149981 should be good enough, right?

Suggested change
if ((VTy->isScalableTy() || !UseMaskForCond) && !UseMaskForGaps &&
if (!UseMaskForGaps && Factor <= TLI->getMaxSupportedInterleaveFactor())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because we don't handle llvm.masked.{load,store} with shufflevectors, only load and vp.load:

if (match(&I, m_CombineOr(m_Load(m_Value()),
m_Intrinsic<Intrinsic::vp_load>())))
Changed |= lowerInterleavedLoad(&I, DeadInsts);
if (match(&I, m_CombineOr(m_Store(m_Value(), m_Value()),
m_Intrinsic<Intrinsic::vp_store>())))
Changed |= lowerInterleavedStore(&I, DeadInsts);

But yes, it seems a shame that we already have this fixed-length mask functionality in getMask. Hopefully it's not too difficult to extend InterleavedAccessPass to handle llvm.masked.{load,store} with shufflevectors. I think that would round out support for all the different ways of expressing interleaves.

Which as a side note, I think that would mean there's 12 possible different types of [de]interleaves: Either shufflevector or intrinsic based, for each of load, store, masked.load, masked.store, vp.load or vp.store. That's a lot!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Then I think it does make sense to initially limit support to scalable only. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luke is correct here. I am actively working on fully supporting the shuffle path with masked.load/store, but we're not there yet. The next change is #150241, and we've got at least one more needed after that before we could reasonable enable fixed vectors by default.

Factor <= TLI->getMaxSupportedInterleaveFactor()) {
auto *VTy = cast<VectorType>(VecTy);
std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(VTy);
// Need to make sure type has't been scalarized
if (LT.second.isVector()) {
Expand Down
4 changes: 4 additions & 0 deletions llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -398,6 +398,10 @@ class RISCVTTIImpl final : public BasicTTIImplBase<RISCVTTIImpl> {

bool enableInterleavedAccessVectorization() const override { return true; }

bool enableMaskedInterleavedAccessVectorization() const override {
return ST->hasVInstructions();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't opposing the use of hasVInstructions() here — in fact, I think it's a good thing.
I was just curious during my own implementation: why does enableInterleavedAccessVectorization return true directly without checking hasVInstructions()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a case where the existing code probably should be checking for vector instructions, but that the difference is essentially stylistic. The cost model results will penalize trying to vectorize without V enough that the feature being enabled won't really matter. The major question (long term) is what we want to do if P ever stabilizes, but that's definitely future work.

}

unsigned getMinTripCountTailFoldingThreshold() const override;

enum RISCVRegisterClass { GPRRC, FPRRC, VRRC };
Expand Down
Loading
Loading