Skip to content

[AMDGPU] introduce S_WAITCNT_FENCE_soft emitted by memory legalizer #150167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions llvm/lib/Target/AMDGPU/SIDefines.h
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#ifndef LLVM_LIB_TARGET_AMDGPU_SIDEFINES_H
#define LLVM_LIB_TARGET_AMDGPU_SIDEFINES_H

#include "llvm/ADT/BitmaskEnum.h"
#include "llvm/MC/MCInstrDesc.h"

namespace llvm {
Expand Down Expand Up @@ -419,6 +420,38 @@ enum CPol {

} // namespace CPol

/// The atomic synchronization scopes supported by the AMDGPU target.
enum class SIAtomicScope {
NONE,
SINGLETHREAD,
WAVEFRONT,
WORKGROUP,
AGENT,
SYSTEM
};

/// The distinct address spaces supported by the AMDGPU target for
/// atomic memory operation. Can be ORed together.
enum class SIAtomicAddrSpace {
NONE = 0u,
GLOBAL = 1u << 0,
LDS = 1u << 1,
SCRATCH = 1u << 2,
GDS = 1u << 3,
OTHER = 1u << 4,

/// The address spaces that can be accessed by a FLAT instruction.
FLAT = GLOBAL | LDS | SCRATCH,

/// The address spaces that support atomic instructions.
ATOMIC = GLOBAL | LDS | SCRATCH | GDS,

/// All address spaces.
ALL = GLOBAL | LDS | SCRATCH | GDS | OTHER,

LLVM_MARK_AS_BITMASK_ENUM(/* LargestFlag = */ ALL)
};

namespace SendMsg { // Encoding of SIMM16 used in s_sendmsg* insns.

enum Id { // Message ID, width(4) [3:0].
Expand Down
34 changes: 34 additions & 0 deletions llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,9 @@
#include "llvm/CodeGen/MachinePostDominators.h"
#include "llvm/Support/DebugCounter.h"
#include "llvm/TargetParser/TargetParser.h"

using namespace llvm;
using namespace AMDGPU;

#define DEBUG_TYPE "si-insert-waitcnts"

Expand Down Expand Up @@ -1381,6 +1383,32 @@ bool WaitcntGeneratorPreGFX12::applyPreexistingWaitcnt(
Modified = true;
} else
WaitcntInstr = &II;
} else if (Opcode == AMDGPU::S_WAITCNT_FENCE_soft) {
// Each direct load to LDS is also a store to LDS, but we do not have a
// separate counter for it. Instead these operations increment LOAD_CNT
// and need to be waited for at a release fence. So we treat a release
// fence as if it depends on any previous LDS DMA stores.
unsigned Ordering =
TII->getNamedOperand(II, AMDGPU::OpName::Ordering)->getImm();
unsigned Scope =
TII->getNamedOperand(II, AMDGPU::OpName::Scope)->getImm();
unsigned AddrSpace =
TII->getNamedOperand(II, AMDGPU::OpName::AddrSpace)->getImm();
if (isReleaseOrStronger((AtomicOrdering)Ordering) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, this part bothers me a bit because now InsertWaitCnt has to be aware of atomic orderings and deal with them accordingly. It blurs the separation of concerns between this pass and the MemoryLegalizer.

I know there is a good argument for doing that, but I think this being too generic for what we need at this stage. It's something that needs a lot of planning beforehand (and it's an item on my to-do list, though lower priority).
Can we consider adding a simple s_wait_lds_dma_soft instead, targeted exactly for this use case, and emit that ? I would prefer doing the minimum amount of changes, and then removing that pseudo later in favor of a generic one, than locking us into a specific approach right now.

I think what I'm afraid of is that this sets a precedent, and over time I suspect we'll rely more and more on this pseudo here and elsewhere (e.g. instead of fixing something properly, we just check the pseudo elsewhere and hack a fix there instead), and end up with the memory model implementation being spread over multiple files, which will make it difficult to manage.

Copy link
Collaborator Author

@ssahasra ssahasra Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there is a good argument for doing that, but I think this being too generic for what we need at this stage. It's something that needs a lot of planning beforehand (and it's an item on my to-do list, though lower priority).

Quoting from #147257:

Something still doesn't feel right with this PR for me, I feel like this isn't the right approach but I struggle to suggest something better.

Longer term we should really just have a single waitcnt pseudo for the MemoryLegalizer that is target-independent, it'd fix issues like these if we had special sentinel values for different things.

In all sincerity, could it be possible that there is an analysis paralysis happening here? Are we overthinking this situation? What could be more effective as "a single waitcnt pseudo" than a fence? What is an example of something less generic than a fence and yet effective for all uses?

Can we consider adding a simple s_wait_lds_dma_soft instead, targeted exactly for this use case, and emit that ?

I consider that too specific. I contend that the distinction between the memory legalizer and the waitcount inserter absolutely needs to be blurred. They cannot exist separately, they implement the memory model together and complement each other in that process. One specific problem with having an S_WAIT_LDS_DMA_soft is that it is needed only on release operations but not on acquire operations.

I would prefer doing the minimum amount of changes, and then removing that pseudo later in favor of a generic one, than locking us into a specific approach right now.

Do you have any specific examples of potential concerns? This approach is not locking us into anything more than information that is already relevant to the memory model, which is orderings, scopes and address spaces. It can't possibly lock us into anything incompatible with future work.

I think what I'm afraid of is that this sets a precedent, and over time I suspect we'll rely more and more on this pseudo here and elsewhere (e.g. instead of fixing something properly, we just check the pseudo elsewhere and hack a fix there instead), and end up with the memory model implementation being spread over multiple files, which will make it difficult to manage.

That is precisely my intention. It is a mistake to think that only the memory legalizer is relevant to the memory model. It can produce "safe cache operations and waits", but it can't produce efficient ones. The real memory model has to be spread across two files, or perhaps we should merge those two files. But I don't see this as a major blocker for what I am proposing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all sincerity, could it be possible that there is an analysis paralysis happening here? Are we overthinking this situation?

Yes, definitely. Sorry about that.

I'm going to try to lay out my thoughts in a simpler way, so I don't start contradicting myself again:

  • I agree the MemoryLegalizer and InsertWaitCnt are inseparable, but there is still some separation of concern: InsertWaitCnt doesn't look at atomic orderings for example. That can be blurred in the future, but as the owner of the MemoryLegalizer I would like it to be a separate task that's more carefully planned, rather than done to fix a specific issue.
    • Furthermore, I see the legalizer's role as "implement the memory model in a conservative way" while InsertWaitCnt is "optimize the waitcnts while preserving semantics" (+ insert new waits ofc)
  • When I imagined a generic pseudo for waitcnt insertion, I did not imagine something that includes information like the atomic ordering. I imagined something with ad-hoc bits that carry only the information we need, and nothing else.
    • For example, we had a few times when we wondered if the memory legalizer needed to insert waits on vm_vsrc. That can't be conveyed using the AS/Ordering alone. We need something more specific, something where we can feel free to add new flags for any reason we see fit.
  • My worry i that if the operation is too generic, and carries info like atomic ordering, it opens the door to implementing some memory model fixes outside the legalizer (e.g. waitcnts for specific fences would now be handled by InsertWaitCnt without the legalizer's knowledge) which I do not want as it's best kept in one place.
    • I guess it's valid to see it as irrational as I don't have proof that it could/will happen.

So in my opinion, @kerbowa's approach in #138802 fits best.
Yes it's not ideal, but there's a lot of things not ideal with the current way things are laid out right now. I'd rather keep on the same trajectory by adding a specific pseudo, and then refactor it all in one batch, than try something new to fix a specific problem.

Again, sorry for derailing this a bit. The discussion spread over weeks and multiple PRs so I lost context and contradicted myself a few times.

Scope >= (unsigned)AMDGPU::SIAtomicScope::WORKGROUP &&
any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) {
LLVM_DEBUG(dbgs() << "Processing S_WAITCNT_FENCE_soft: " << II
<< "Before: " << Wait.LoadCnt << '\n';);
ScoreBrackets.determineWait(LOAD_CNT, FIRST_LDS_VGPR, Wait);
LLVM_DEBUG(dbgs() << "After: " << Wait.LoadCnt << '\n';);
}
// It is possible (but unlikely) that this is the only wait instruction,
// in which case, we exit this loop without a WaitcntInstr to consume
// `Wait`. But that works because `Wait` was passed in by reference, and
// the callee eventually calls createNewWaitcnt on it. We test this
// possibility in an articial MIR test since such a situation cannot be
// recreated by running the memory legalizer.
II.eraseFromParent();
} else {
assert(Opcode == AMDGPU::S_WAITCNT_VSCNT);
assert(II.getOperand(0).getReg() == AMDGPU::SGPR_NULL);
Expand Down Expand Up @@ -1552,6 +1580,11 @@ bool WaitcntGeneratorGFX12Plus::applyPreexistingWaitcnt(
ScoreBrackets.simplifyWaitcnt(OldWait);
Wait = Wait.combined(OldWait);
UpdatableInstr = &CombinedStoreDsCntInstr;
} else if (Opcode == AMDGPU::S_WAITCNT_FENCE_soft) {
// Architectures higher than GFX10 do not have direct loads to
// LDS, so no work required here yet.
II.eraseFromParent();
continue;
} else {
std::optional<InstCounterType> CT = counterTypeForInstr(Opcode);
assert(CT.has_value());
Expand Down Expand Up @@ -2444,6 +2477,7 @@ static bool isWaitInstr(MachineInstr &Inst) {
Inst.getOperand(0).getReg() == AMDGPU::SGPR_NULL) ||
Opcode == AMDGPU::S_WAIT_LOADCNT_DSCNT ||
Opcode == AMDGPU::S_WAIT_STORECNT_DSCNT ||
Opcode == AMDGPU::S_WAITCNT_FENCE_soft ||
counterTypeForInstr(Opcode).has_value();
}

Expand Down
71 changes: 39 additions & 32 deletions llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -57,38 +57,6 @@ enum class Position {
AFTER
};

/// The atomic synchronization scopes supported by the AMDGPU target.
enum class SIAtomicScope {
NONE,
SINGLETHREAD,
WAVEFRONT,
WORKGROUP,
AGENT,
SYSTEM
};

/// The distinct address spaces supported by the AMDGPU target for
/// atomic memory operation. Can be ORed together.
enum class SIAtomicAddrSpace {
NONE = 0u,
GLOBAL = 1u << 0,
LDS = 1u << 1,
SCRATCH = 1u << 2,
GDS = 1u << 3,
OTHER = 1u << 4,

/// The address spaces that can be accessed by a FLAT instruction.
FLAT = GLOBAL | LDS | SCRATCH,

/// The address spaces that support atomic instructions.
ATOMIC = GLOBAL | LDS | SCRATCH | GDS,

/// All address spaces.
ALL = GLOBAL | LDS | SCRATCH | GDS | OTHER,

LLVM_MARK_AS_BITMASK_ENUM(/* LargestFlag = */ ALL)
};

class SIMemOpInfo final {
private:

Expand Down Expand Up @@ -1160,6 +1128,19 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
Changed = true;
}

// Emit a soft wait count as a place holder for SIInsertWaitcnts, which will
// later add additional waits. To minimize clutter, we do this only when
// required. For now this just means a release operation at workgroup scope
// that synchronizes LDS, required by direct loads to LDS.
if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that should go into some helper function

any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) {
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_FENCE_soft))
.addImm((unsigned)Order)
.addImm((unsigned)Scope)
.addImm((unsigned)AddrSpace);
Changed = true;
}

if (Pos == Position::AFTER)
--MI;

Expand Down Expand Up @@ -2068,6 +2049,19 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
Changed = true;
}

// Emit a soft wait count as a place holder for SIInsertWaitcnts, which will
// later add additional waits. To minimize clutter, we do this only when
// required. For now this just means a release operation at workgroup scope
// that synchronizes LDS, required by direct loads to LDS.
if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP &&
any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) {
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_FENCE_soft))
.addImm((unsigned)Order)
.addImm((unsigned)Scope)
.addImm((unsigned)AddrSpace);
Changed = true;
}

if (VSCnt) {
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
.addReg(AMDGPU::SGPR_NULL, RegState::Undef)
Expand Down Expand Up @@ -2385,6 +2379,19 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
Changed = true;
}

// Emit a soft wait count as a place holder for SIInsertWaitcnts, which will
// later add additional waits. To minimize clutter, we do this only when
// required. For now this just means a release operation at workgroup scope
// that synchronizes LDS, required by direct loads to LDS.
if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP &&
any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) {
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_FENCE_soft))
.addImm((unsigned)Order)
.addImm((unsigned)Scope)
.addImm((unsigned)AddrSpace);
Changed = true;
}

if (Pos == Position::AFTER)
--MI;

Expand Down
6 changes: 6 additions & 0 deletions llvm/lib/Target/AMDGPU/SOPInstructions.td
Original file line number Diff line number Diff line change
Expand Up @@ -1621,6 +1621,12 @@ let OtherPredicates = [HasImageInsts] in {
def S_WAIT_KMCNT_soft : SOPP_Pseudo <"s_soft_wait_kmcnt", (ins s16imm:$simm16), "$simm16">;
}

def S_WAITCNT_FENCE_soft : SPseudoInstSI <
(outs), (ins i32imm:$Ordering, i32imm:$Scope, i32imm:$AddrSpace)> {
let hasSideEffects = 0;
let UseNamedOperandTable = 1;
}

def S_SETHALT : SOPP_Pseudo <"s_sethalt" , (ins i32imm:$simm16), "$simm16",
[(int_amdgcn_s_sethalt timm:$simm16)]>;
def S_SETKILL : SOPP_Pseudo <"s_setkill" , (ins i16imm:$simm16), "$simm16">;
Expand Down
Loading