Skip to content

[IA][RISCV] Recognize deinterleaved loads that could lower to strided segmented loads #151612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions llvm/include/llvm/CodeGen/TargetLowering.h
Original file line number Diff line number Diff line change
Expand Up @@ -3209,10 +3209,12 @@ class LLVM_ABI TargetLoweringBase {
/// \p Shuffles is the shufflevector list to DE-interleave the loaded vector.
/// \p Indices is the corresponding indices for each shufflevector.
/// \p Factor is the interleave factor.
/// \p MaskFactor is the interleave factor that considers mask, which can
/// reduce the original factor.
virtual bool lowerInterleavedLoad(Instruction *Load, Value *Mask,
ArrayRef<ShuffleVectorInst *> Shuffles,
ArrayRef<unsigned> Indices,
unsigned Factor) const {
ArrayRef<unsigned> Indices, unsigned Factor,
unsigned MaskFactor) const {
return false;
Comment on lines 3214 to 3218
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a bit easier for the targets if we instead passed the stride in bytes in? That way they wouldn't have to worry about the difference between the MaskFactor and Factor.

Targets that don't support strided interleaved loads would check that Stride == DL.getTypeStoreSize(VTy->getElementType())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Targets that don't support strided interleaved loads would check that Stride == DL.getTypeStoreSize(VTy->getElementType())

I believe stride is relative to the current start address, so in the case of skipping fields, the stride will always be Factor * DL.getTypeStoreSize(VTy->getElementType()) regardless of how many fields you wanna skip.
But I guess my more high-level question would be: for those targets that don't support strided interleaved loads, what is the benefit of replacing a check between Factor and MaskFactor with another check on Stride ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh whoops yes, that should be multiplied by factor.

To me MaskFactor feels like a concept internal to InterleavedAccessPass that's leaking through.

I'm not strongly opinionated about this though, just thought I'd throw the idea out there, happy to go with what you prefer.

I guess an alternative is that we could also add a separate "lowerStridedInterleaved" TTI hook. But maybe that will lead to hook explosion again

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a separate "lowerStridedInterleaved" TTI hook

(I believe you meant TLI hooks) Yeah I'm also worried about the fact that it will double the number of hooks, as all four of the them could have a strided version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternate suggestion: pass in GapMask as an APInt, then have the target filter out which set of gaps it can handle.

}

Expand Down
120 changes: 97 additions & 23 deletions llvm/lib/CodeGen/InterleavedAccessPass.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -268,13 +268,19 @@ static Value *getMaskOperand(IntrinsicInst *II) {
}
}

// Return the corresponded deinterleaved mask, or nullptr if there is no valid
// mask.
static Value *getMask(Value *WideMask, unsigned Factor,
ElementCount LeafValueEC);

static Value *getMask(Value *WideMask, unsigned Factor,
VectorType *LeafValueTy) {
// Return a pair of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this offline, but I'm more and more coming to the view we should have made these a set of utility routines (usable by each target), and simply passed the mask operand through (or maybe not even that.) More of an aside for longer term consideration than a comment on this review.

// (1) The corresponded deinterleaved mask, or nullptr if there is no valid
// mask.
// (2) Some mask effectively skips a certain field, this element contains
// the factor after taking such contraction into consideration. Note that
// currently we only support skipping trailing fields. So if the "nominal"
// factor was 5, you cannot only skip field 1 and 2, but you can skip field 3
// and 4.
static std::pair<Value *, unsigned> getMask(Value *WideMask, unsigned Factor,
ElementCount LeafValueEC);

static std::pair<Value *, unsigned> getMask(Value *WideMask, unsigned Factor,
VectorType *LeafValueTy) {
return getMask(WideMask, Factor, LeafValueTy->getElementCount());
}

Expand Down Expand Up @@ -379,22 +385,25 @@ bool InterleavedAccessImpl::lowerInterleavedLoad(
replaceBinOpShuffles(BinOpShuffles.getArrayRef(), Shuffles, Load);

Value *Mask = nullptr;
unsigned GapMaskFactor = Factor;
if (LI) {
LLVM_DEBUG(dbgs() << "IA: Found an interleaved load: " << *Load << "\n");
} else {
// Check mask operand. Handle both all-true/false and interleaved mask.
Mask = getMask(getMaskOperand(II), Factor, VecTy);
std::tie(Mask, GapMaskFactor) = getMask(getMaskOperand(II), Factor, VecTy);
if (!Mask)
return false;

LLVM_DEBUG(dbgs() << "IA: Found an interleaved vp.load or masked.load: "
<< *Load << "\n");
LLVM_DEBUG(dbgs() << "IA: With nominal factor " << Factor
<< " and mask factor " << GapMaskFactor << "\n");
}

// Try to create target specific intrinsics to replace the load and
// shuffles.
if (!TLI->lowerInterleavedLoad(cast<Instruction>(Load), Mask, Shuffles,
Indices, Factor))
Indices, Factor, GapMaskFactor))
// If Extracts is not empty, tryReplaceExtracts made changes earlier.
return !Extracts.empty() || BinOpShuffleChanged;

Expand Down Expand Up @@ -531,15 +540,20 @@ bool InterleavedAccessImpl::lowerInterleavedStore(
"number of stored element should be a multiple of Factor");

Value *Mask = nullptr;
unsigned GapMaskFactor = Factor;
if (SI) {
LLVM_DEBUG(dbgs() << "IA: Found an interleaved store: " << *Store << "\n");
} else {
// Check mask operand. Handle both all-true/false and interleaved mask.
unsigned LaneMaskLen = NumStoredElements / Factor;
Mask = getMask(getMaskOperand(II), Factor,
ElementCount::getFixed(LaneMaskLen));
std::tie(Mask, GapMaskFactor) = getMask(
getMaskOperand(II), Factor, ElementCount::getFixed(LaneMaskLen));
if (!Mask)
return false;
// We shouldn't transform stores even it has a gap mask. And since we might
// already change the IR, we're returning true here.
if (GapMaskFactor != Factor)
return true;

LLVM_DEBUG(dbgs() << "IA: Found an interleaved vp.store or masked.store: "
<< *Store << "\n");
Expand All @@ -556,34 +570,87 @@ bool InterleavedAccessImpl::lowerInterleavedStore(
return true;
}

static Value *getMask(Value *WideMask, unsigned Factor,
ElementCount LeafValueEC) {
// A wide mask <1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0> could be used to skip the
// last field in a factor-of-three interleaved store or deinterleaved load (in
// which case LeafMaskLen is 4). Such (wide) mask is also known as gap mask.
// This helper function tries to detect this pattern and return the actual
// factor we're accessing, which is 2 in this example.
static unsigned getGapMaskFactor(const Constant &MaskConst, unsigned Factor,
unsigned LeafMaskLen) {
APInt FactorMask(Factor, 0);
FactorMask.setAllBits();
for (unsigned F = 0U; F < Factor; ++F) {
bool AllZero = true;
for (unsigned Idx = 0U; Idx < LeafMaskLen; ++Idx) {
Constant *C = MaskConst.getAggregateElement(F + Idx * Factor);
if (!C->isZeroValue()) {
AllZero = false;
break;
}
}
// All mask bits on this field are zero, skipping it.
if (AllZero)
FactorMask.clearBit(F);
}
// We currently only allow gaps in the "trailing" factors / fields. So
// given the original factor being 4, we can skip fields 2 and 3, but we
// cannot only skip fields 1 and 2. If FactorMask does not match such
// pattern, reset it.
if (!FactorMask.isMask())
FactorMask.setAllBits();

return FactorMask.popcount();
}

static std::pair<Value *, unsigned> getMask(Value *WideMask, unsigned Factor,
ElementCount LeafValueEC) {
using namespace PatternMatch;

if (auto *IMI = dyn_cast<IntrinsicInst>(WideMask)) {
if (unsigned F = getInterleaveIntrinsicFactor(IMI->getIntrinsicID());
F && F == Factor && llvm::all_equal(IMI->args())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can handle the case where the tail elements in the interleave are zero. Might be easier to start with this one, as it's the minimum code change. (This combines with my macro comment.)

return IMI->getArgOperand(0);
return {IMI->getArgOperand(0), Factor};
}
}

// Try to match `and <interleaved mask>, <gap mask>`. The WideMask here is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be dropped and done as a follow up commit.

// expected to be a fixed vector and gap mask should be a constant mask.
Value *AndMaskLHS;
Constant *AndMaskRHS;
if (match(WideMask, m_c_And(m_Value(AndMaskLHS), m_Constant(AndMaskRHS))) &&
LeafValueEC.isFixed()) {
assert(!isa<Constant>(AndMaskLHS) &&
"expect constants to be folded already");
return {getMask(AndMaskLHS, Factor, LeafValueEC).first,
getGapMaskFactor(*AndMaskRHS, Factor, LeafValueEC.getFixedValue())};
}

if (auto *ConstMask = dyn_cast<Constant>(WideMask)) {
if (auto *Splat = ConstMask->getSplatValue())
// All-ones or all-zeros mask.
return ConstantVector::getSplat(LeafValueEC, Splat);
return {ConstantVector::getSplat(LeafValueEC, Splat), Factor};

if (LeafValueEC.isFixed()) {
unsigned LeafMaskLen = LeafValueEC.getFixedValue();
// First, check if we use a gap mask to skip some of the factors / fields.
const unsigned GapMaskFactor =
getGapMaskFactor(*ConstMask, Factor, LeafMaskLen);
assert(GapMaskFactor <= Factor);

SmallVector<Constant *, 8> LeafMask(LeafMaskLen, nullptr);
// If this is a fixed-length constant mask, each lane / leaf has to
// use the same mask. This is done by checking if every group with Factor
// number of elements in the interleaved mask has homogeneous values.
for (unsigned Idx = 0U; Idx < LeafMaskLen * Factor; ++Idx) {
if (Idx % Factor >= GapMaskFactor)
continue;
Constant *C = ConstMask->getAggregateElement(Idx);
if (LeafMask[Idx / Factor] && LeafMask[Idx / Factor] != C)
return nullptr;
return {nullptr, Factor};
LeafMask[Idx / Factor] = C;
}

return ConstantVector::get(LeafMask);
return {ConstantVector::get(LeafMask), GapMaskFactor};
}
}

Expand All @@ -603,12 +670,13 @@ static Value *getMask(Value *WideMask, unsigned Factor,
auto *LeafMaskTy =
VectorType::get(Type::getInt1Ty(SVI->getContext()), LeafValueEC);
IRBuilder<> Builder(SVI);
return Builder.CreateExtractVector(LeafMaskTy, SVI->getOperand(0),
uint64_t(0));
return {Builder.CreateExtractVector(LeafMaskTy, SVI->getOperand(0),
uint64_t(0)),
Factor};
}
}

return nullptr;
return {nullptr, Factor};
}

bool InterleavedAccessImpl::lowerDeinterleaveIntrinsic(
Expand Down Expand Up @@ -639,9 +707,12 @@ bool InterleavedAccessImpl::lowerDeinterleaveIntrinsic(
return false;

// Check mask operand. Handle both all-true/false and interleaved mask.
Mask = getMask(getMaskOperand(II), Factor, getDeinterleavedVectorType(DI));
unsigned GapMaskFactor;
std::tie(Mask, GapMaskFactor) =
getMask(getMaskOperand(II), Factor, getDeinterleavedVectorType(DI));
if (!Mask)
return false;
assert(GapMaskFactor == Factor);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a sec to figure out why this assert held, add a && "why this is true"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I added this assertion because there is no way one can synthesize a gap mask for scalable vector. But reading this again, I realized this part of the code (vp.load/masked.load + deinterleave intrinsic) could also handle fixed vectors. So I'm going to turn this into a check instead.


LLVM_DEBUG(dbgs() << "IA: Found a vp.load or masked.load with deinterleave"
<< " intrinsic " << *DI << " and factor = "
Expand Down Expand Up @@ -680,10 +751,13 @@ bool InterleavedAccessImpl::lowerInterleaveIntrinsic(
II->getIntrinsicID() != Intrinsic::vp_store)
return false;
// Check mask operand. Handle both all-true/false and interleaved mask.
Mask = getMask(getMaskOperand(II), Factor,
cast<VectorType>(InterleaveValues[0]->getType()));
unsigned GapMaskFactor;
std::tie(Mask, GapMaskFactor) =
getMask(getMaskOperand(II), Factor,
cast<VectorType>(InterleaveValues[0]->getType()));
if (!Mask)
return false;
assert(GapMaskFactor == Factor);

LLVM_DEBUG(dbgs() << "IA: Found a vp.store or masked.store with interleave"
<< " intrinsic " << *IntII << " and factor = "
Expand Down
5 changes: 4 additions & 1 deletion llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17254,7 +17254,7 @@ static Function *getStructuredStoreFunction(Module *M, unsigned Factor,
/// %vec1 = extractelement { <4 x i32>, <4 x i32> } %ld2, i32 1
bool AArch64TargetLowering::lowerInterleavedLoad(
Instruction *Load, Value *Mask, ArrayRef<ShuffleVectorInst *> Shuffles,
ArrayRef<unsigned> Indices, unsigned Factor) const {
ArrayRef<unsigned> Indices, unsigned Factor, unsigned MaskFactor) const {
assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&
"Invalid interleave factor");
assert(!Shuffles.empty() && "Empty shufflevector input");
Expand All @@ -17266,6 +17266,9 @@ bool AArch64TargetLowering::lowerInterleavedLoad(
return false;
assert(!Mask && "Unexpected mask on a load");

if (Factor != MaskFactor)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an assert (same for most targets), since LoadInst isn't masked by definition.

return false;

const DataLayout &DL = LI->getDataLayout();

VectorType *VTy = Shuffles[0]->getType();
Expand Down
4 changes: 2 additions & 2 deletions llvm/lib/Target/AArch64/AArch64ISelLowering.h
Original file line number Diff line number Diff line change
Expand Up @@ -220,8 +220,8 @@ class AArch64TargetLowering : public TargetLowering {

bool lowerInterleavedLoad(Instruction *Load, Value *Mask,
ArrayRef<ShuffleVectorInst *> Shuffles,
ArrayRef<unsigned> Indices,
unsigned Factor) const override;
ArrayRef<unsigned> Indices, unsigned Factor,
unsigned MaskFactor) const override;
bool lowerInterleavedStore(Instruction *Store, Value *Mask,
ShuffleVectorInst *SVI,
unsigned Factor) const override;
Expand Down
5 changes: 4 additions & 1 deletion llvm/lib/Target/ARM/ARMISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -21599,7 +21599,7 @@ unsigned ARMTargetLowering::getMaxSupportedInterleaveFactor() const {
/// %vec1 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 1
bool ARMTargetLowering::lowerInterleavedLoad(
Instruction *Load, Value *Mask, ArrayRef<ShuffleVectorInst *> Shuffles,
ArrayRef<unsigned> Indices, unsigned Factor) const {
ArrayRef<unsigned> Indices, unsigned Factor, unsigned MaskFactor) const {
assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&
"Invalid interleave factor");
assert(!Shuffles.empty() && "Empty shufflevector input");
Expand All @@ -21611,6 +21611,9 @@ bool ARMTargetLowering::lowerInterleavedLoad(
return false;
assert(!Mask && "Unexpected mask on a load");

if (Factor != MaskFactor)
return false;

auto *VecTy = cast<FixedVectorType>(Shuffles[0]->getType());
Type *EltTy = VecTy->getElementType();

Expand Down
4 changes: 2 additions & 2 deletions llvm/lib/Target/ARM/ARMISelLowering.h
Original file line number Diff line number Diff line change
Expand Up @@ -683,8 +683,8 @@ class VectorType;

bool lowerInterleavedLoad(Instruction *Load, Value *Mask,
ArrayRef<ShuffleVectorInst *> Shuffles,
ArrayRef<unsigned> Indices,
unsigned Factor) const override;
ArrayRef<unsigned> Indices, unsigned Factor,
unsigned MaskFactor) const override;
bool lowerInterleavedStore(Instruction *Store, Value *Mask,
ShuffleVectorInst *SVI,
unsigned Factor) const override;
Expand Down
4 changes: 2 additions & 2 deletions llvm/lib/Target/RISCV/RISCVISelLowering.h
Original file line number Diff line number Diff line change
Expand Up @@ -431,8 +431,8 @@ class RISCVTargetLowering : public TargetLowering {

bool lowerInterleavedLoad(Instruction *Load, Value *Mask,
ArrayRef<ShuffleVectorInst *> Shuffles,
ArrayRef<unsigned> Indices,
unsigned Factor) const override;
ArrayRef<unsigned> Indices, unsigned Factor,
unsigned MaskFactor) const override;

bool lowerInterleavedStore(Instruction *Store, Value *Mask,
ShuffleVectorInst *SVI,
Expand Down
41 changes: 34 additions & 7 deletions llvm/lib/Target/RISCV/RISCVInterleavedAccess.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,12 @@ static const Intrinsic::ID FixedVlsegIntrIds[] = {
Intrinsic::riscv_seg6_load_mask, Intrinsic::riscv_seg7_load_mask,
Intrinsic::riscv_seg8_load_mask};

static const Intrinsic::ID FixedVlssegIntrIds[] = {
Intrinsic::riscv_sseg2_load_mask, Intrinsic::riscv_sseg3_load_mask,
Intrinsic::riscv_sseg4_load_mask, Intrinsic::riscv_sseg5_load_mask,
Intrinsic::riscv_sseg6_load_mask, Intrinsic::riscv_sseg7_load_mask,
Intrinsic::riscv_sseg8_load_mask};

static const Intrinsic::ID ScalableVlsegIntrIds[] = {
Intrinsic::riscv_vlseg2_mask, Intrinsic::riscv_vlseg3_mask,
Intrinsic::riscv_vlseg4_mask, Intrinsic::riscv_vlseg5_mask,
Expand Down Expand Up @@ -197,9 +203,13 @@ static bool getMemOperands(unsigned Factor, VectorType *VTy, Type *XLenTy,
/// %vec1 = extractelement { <4 x i32>, <4 x i32> } %ld2, i32 1
bool RISCVTargetLowering::lowerInterleavedLoad(
Instruction *Load, Value *Mask, ArrayRef<ShuffleVectorInst *> Shuffles,
ArrayRef<unsigned> Indices, unsigned Factor) const {
ArrayRef<unsigned> Indices, unsigned Factor, unsigned MaskFactor) const {
assert(Indices.size() == Shuffles.size());
assert(MaskFactor <= Factor);

// TODO: Lower to strided load when MaskFactor = 1.
if (MaskFactor < 2)
return false;
IRBuilder<> Builder(Load);

const DataLayout &DL = Load->getDataLayout();
Expand All @@ -208,20 +218,37 @@ bool RISCVTargetLowering::lowerInterleavedLoad(

Value *Ptr, *VL;
Align Alignment;
if (!getMemOperands(Factor, VTy, XLenTy, Load, Ptr, Mask, VL, Alignment))
if (!getMemOperands(MaskFactor, VTy, XLenTy, Load, Ptr, Mask, VL, Alignment))
return false;

Type *PtrTy = Ptr->getType();
unsigned AS = PtrTy->getPointerAddressSpace();
if (!isLegalInterleavedAccessType(VTy, Factor, Alignment, AS, DL))
if (!isLegalInterleavedAccessType(VTy, MaskFactor, Alignment, AS, DL))
return false;

CallInst *VlsegN = Builder.CreateIntrinsic(
FixedVlsegIntrIds[Factor - 2], {VTy, PtrTy, XLenTy}, {Ptr, Mask, VL});
CallInst *SegLoad = nullptr;
if (MaskFactor < Factor) {
// Lower to strided segmented load.
unsigned ScalarSizeInBytes = DL.getTypeStoreSize(VTy->getElementType());
Value *Stride = ConstantInt::get(XLenTy, Factor * ScalarSizeInBytes);
SegLoad = Builder.CreateIntrinsic(FixedVlssegIntrIds[MaskFactor - 2],
{VTy, PtrTy, XLenTy, XLenTy},
{Ptr, Stride, Mask, VL});
} else {
// Lower to normal segmented load.
SegLoad = Builder.CreateIntrinsic(FixedVlsegIntrIds[Factor - 2],
{VTy, PtrTy, XLenTy}, {Ptr, Mask, VL});
}

for (unsigned i = 0; i < Shuffles.size(); i++) {
Value *SubVec = Builder.CreateExtractValue(VlsegN, Indices[i]);
Shuffles[i]->replaceAllUsesWith(SubVec);
unsigned FactorIdx = Indices[i];
if (FactorIdx >= MaskFactor) {
// Replace masked-off factors (that are still extracted) with poison.
Shuffles[i]->replaceAllUsesWith(PoisonValue::get(VTy));
} else {
Value *SubVec = Builder.CreateExtractValue(SegLoad, FactorIdx);
Shuffles[i]->replaceAllUsesWith(SubVec);
}
}

return true;
Expand Down
4 changes: 2 additions & 2 deletions llvm/lib/Target/X86/X86ISelLowering.h
Original file line number Diff line number Diff line change
Expand Up @@ -1663,8 +1663,8 @@ namespace llvm {
/// instructions/intrinsics.
bool lowerInterleavedLoad(Instruction *Load, Value *Mask,
ArrayRef<ShuffleVectorInst *> Shuffles,
ArrayRef<unsigned> Indices,
unsigned Factor) const override;
ArrayRef<unsigned> Indices, unsigned Factor,
unsigned MaskFactor) const override;

/// Lower interleaved store(s) into target specific
/// instructions/intrinsics.
Expand Down
Loading
Loading