-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[AMDGPU] introduce S_WAITCNT_FENCE_soft emitted by memory legalizer #150167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The new instruction represents any wait counts resulting from a fence, which the memory legalizer cannot determine on its own. After lowering a fence to the appropriate cache operations and any necessary S_WAIT* instructions, the legalizer hands over further work to SIInsertWaitcnts using this instruction as a placeholder. The waitcnt inserter can use additional information. Currently this is used to implement efficient waitcnts for direct loads to LDS, based on the ordering, scope and addrspace specified on the soft fence.
@llvm/pr-subscribers-backend-amdgpu Author: Sameer Sahasrabuddhe (ssahasra) ChangesThe new instruction represents any wait counts resulting from a fence, which the Currently this is used to implement efficient waitcnts for direct loads to LDS, Patch is 57.03 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/150167.diff 8 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIDefines.h b/llvm/lib/Target/AMDGPU/SIDefines.h
index 3902d4c3b1027..9d30951cac1a3 100644
--- a/llvm/lib/Target/AMDGPU/SIDefines.h
+++ b/llvm/lib/Target/AMDGPU/SIDefines.h
@@ -10,6 +10,7 @@
#ifndef LLVM_LIB_TARGET_AMDGPU_SIDEFINES_H
#define LLVM_LIB_TARGET_AMDGPU_SIDEFINES_H
+#include "llvm/ADT/BitmaskEnum.h"
#include "llvm/MC/MCInstrDesc.h"
namespace llvm {
@@ -419,6 +420,38 @@ enum CPol {
} // namespace CPol
+/// The atomic synchronization scopes supported by the AMDGPU target.
+enum class SIAtomicScope {
+ NONE,
+ SINGLETHREAD,
+ WAVEFRONT,
+ WORKGROUP,
+ AGENT,
+ SYSTEM
+};
+
+/// The distinct address spaces supported by the AMDGPU target for
+/// atomic memory operation. Can be ORed together.
+enum class SIAtomicAddrSpace {
+ NONE = 0u,
+ GLOBAL = 1u << 0,
+ LDS = 1u << 1,
+ SCRATCH = 1u << 2,
+ GDS = 1u << 3,
+ OTHER = 1u << 4,
+
+ /// The address spaces that can be accessed by a FLAT instruction.
+ FLAT = GLOBAL | LDS | SCRATCH,
+
+ /// The address spaces that support atomic instructions.
+ ATOMIC = GLOBAL | LDS | SCRATCH | GDS,
+
+ /// All address spaces.
+ ALL = GLOBAL | LDS | SCRATCH | GDS | OTHER,
+
+ LLVM_MARK_AS_BITMASK_ENUM(/* LargestFlag = */ ALL)
+};
+
namespace SendMsg { // Encoding of SIMM16 used in s_sendmsg* insns.
enum Id { // Message ID, width(4) [3:0].
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 9faf4974e3fd6..ca1c534896b5a 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -37,7 +37,9 @@
#include "llvm/CodeGen/MachinePostDominators.h"
#include "llvm/Support/DebugCounter.h"
#include "llvm/TargetParser/TargetParser.h"
+
using namespace llvm;
+using namespace AMDGPU;
#define DEBUG_TYPE "si-insert-waitcnts"
@@ -1381,6 +1383,32 @@ bool WaitcntGeneratorPreGFX12::applyPreexistingWaitcnt(
Modified = true;
} else
WaitcntInstr = &II;
+ } else if (Opcode == AMDGPU::S_WAITCNT_FENCE_soft) {
+ // Each direct load to LDS is also a store to LDS, but we do not have a
+ // separate counter for it. Instead these operations increment LOAD_CNT
+ // and need to be waited for at a release fence. So we treat a release
+ // fence as if it depends on any previous LDS DMA stores.
+ unsigned Ordering =
+ TII->getNamedOperand(II, AMDGPU::OpName::Ordering)->getImm();
+ unsigned Scope =
+ TII->getNamedOperand(II, AMDGPU::OpName::Scope)->getImm();
+ unsigned AddrSpace =
+ TII->getNamedOperand(II, AMDGPU::OpName::AddrSpace)->getImm();
+ if (isReleaseOrStronger((AtomicOrdering)Ordering) &&
+ Scope >= (unsigned)AMDGPU::SIAtomicScope::WORKGROUP &&
+ any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) {
+ LLVM_DEBUG(dbgs() << "Processing S_WAITCNT_FENCE_soft: " << II
+ << "Before: " << Wait.LoadCnt << '\n';);
+ ScoreBrackets.determineWait(LOAD_CNT, FIRST_LDS_VGPR, Wait);
+ LLVM_DEBUG(dbgs() << "After: " << Wait.LoadCnt << '\n';);
+ }
+ // It is possible (but unlikely) that this is the only wait instruction,
+ // in which case, we exit this loop without a WaitcntInstr to consume
+ // `Wait`. But that works because `Wait` was passed in by reference, and
+ // the callee eventually calls createNewWaitcnt on it. We test this
+ // possibility in an articial MIR test since such a situation cannot be
+ // recreated by running the memory legalizer.
+ II.eraseFromParent();
} else {
assert(Opcode == AMDGPU::S_WAITCNT_VSCNT);
assert(II.getOperand(0).getReg() == AMDGPU::SGPR_NULL);
@@ -1552,6 +1580,11 @@ bool WaitcntGeneratorGFX12Plus::applyPreexistingWaitcnt(
ScoreBrackets.simplifyWaitcnt(OldWait);
Wait = Wait.combined(OldWait);
UpdatableInstr = &CombinedStoreDsCntInstr;
+ } else if (Opcode == AMDGPU::S_WAITCNT_FENCE_soft) {
+ // Architectures higher than GFX10 do not have direct loads to
+ // LDS, so no work required here yet.
+ II.eraseFromParent();
+ continue;
} else {
std::optional<InstCounterType> CT = counterTypeForInstr(Opcode);
assert(CT.has_value());
@@ -2444,6 +2477,7 @@ static bool isWaitInstr(MachineInstr &Inst) {
Inst.getOperand(0).getReg() == AMDGPU::SGPR_NULL) ||
Opcode == AMDGPU::S_WAIT_LOADCNT_DSCNT ||
Opcode == AMDGPU::S_WAIT_STORECNT_DSCNT ||
+ Opcode == AMDGPU::S_WAITCNT_FENCE_soft ||
counterTypeForInstr(Opcode).has_value();
}
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 3212060f303a5..6e998098a1ab2 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -57,38 +57,6 @@ enum class Position {
AFTER
};
-/// The atomic synchronization scopes supported by the AMDGPU target.
-enum class SIAtomicScope {
- NONE,
- SINGLETHREAD,
- WAVEFRONT,
- WORKGROUP,
- AGENT,
- SYSTEM
-};
-
-/// The distinct address spaces supported by the AMDGPU target for
-/// atomic memory operation. Can be ORed together.
-enum class SIAtomicAddrSpace {
- NONE = 0u,
- GLOBAL = 1u << 0,
- LDS = 1u << 1,
- SCRATCH = 1u << 2,
- GDS = 1u << 3,
- OTHER = 1u << 4,
-
- /// The address spaces that can be accessed by a FLAT instruction.
- FLAT = GLOBAL | LDS | SCRATCH,
-
- /// The address spaces that support atomic instructions.
- ATOMIC = GLOBAL | LDS | SCRATCH | GDS,
-
- /// All address spaces.
- ALL = GLOBAL | LDS | SCRATCH | GDS | OTHER,
-
- LLVM_MARK_AS_BITMASK_ENUM(/* LargestFlag = */ ALL)
-};
-
class SIMemOpInfo final {
private:
@@ -1160,6 +1128,19 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
Changed = true;
}
+ // Emit a soft wait count as a place holder for SIInsertWaitcnts, which will
+ // later add additional waits. To minimize clutter, we do this only when
+ // required. For now this just means a release operation at workgroup scope
+ // that synchronizes LDS, required by direct loads to LDS.
+ if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP &&
+ any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) {
+ BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_FENCE_soft))
+ .addImm((unsigned)Order)
+ .addImm((unsigned)Scope)
+ .addImm((unsigned)AddrSpace);
+ Changed = true;
+ }
+
if (Pos == Position::AFTER)
--MI;
@@ -2068,6 +2049,19 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
Changed = true;
}
+ // Emit a soft wait count as a place holder for SIInsertWaitcnts, which will
+ // later add additional waits. To minimize clutter, we do this only when
+ // required. For now this just means a release operation at workgroup scope
+ // that synchronizes LDS, required by direct loads to LDS.
+ if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP &&
+ any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) {
+ BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_FENCE_soft))
+ .addImm((unsigned)Order)
+ .addImm((unsigned)Scope)
+ .addImm((unsigned)AddrSpace);
+ Changed = true;
+ }
+
if (VSCnt) {
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
.addReg(AMDGPU::SGPR_NULL, RegState::Undef)
@@ -2385,6 +2379,19 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
Changed = true;
}
+ // Emit a soft wait count as a place holder for SIInsertWaitcnts, which will
+ // later add additional waits. To minimize clutter, we do this only when
+ // required. For now this just means a release operation at workgroup scope
+ // that synchronizes LDS, required by direct loads to LDS.
+ if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP &&
+ any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) {
+ BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_FENCE_soft))
+ .addImm((unsigned)Order)
+ .addImm((unsigned)Scope)
+ .addImm((unsigned)AddrSpace);
+ Changed = true;
+ }
+
if (Pos == Position::AFTER)
--MI;
diff --git a/llvm/lib/Target/AMDGPU/SOPInstructions.td b/llvm/lib/Target/AMDGPU/SOPInstructions.td
index e103ccc2f00e6..df1a7bd4424bc 100644
--- a/llvm/lib/Target/AMDGPU/SOPInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SOPInstructions.td
@@ -1621,6 +1621,12 @@ let OtherPredicates = [HasImageInsts] in {
def S_WAIT_KMCNT_soft : SOPP_Pseudo <"s_soft_wait_kmcnt", (ins s16imm:$simm16), "$simm16">;
}
+def S_WAITCNT_FENCE_soft : SPseudoInstSI <
+ (outs), (ins i32imm:$Ordering, i32imm:$Scope, i32imm:$AddrSpace)> {
+ let hasSideEffects = 0;
+ let UseNamedOperandTable = 1;
+}
+
def S_SETHALT : SOPP_Pseudo <"s_sethalt" , (ins i32imm:$simm16), "$simm16",
[(int_amdgcn_s_sethalt timm:$simm16)]>;
def S_SETKILL : SOPP_Pseudo <"s_setkill" , (ins i16imm:$simm16), "$simm16">;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
index 66037615f0ba0..1f01c64de546c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
@@ -536,30 +536,36 @@ entry:
define amdgpu_kernel void @workgroup_one_as_release() #0 {
; GFX6-LABEL: name: workgroup_one_as_release
; GFX6: bb.0.entry:
+ ; GFX6-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX6-NEXT: S_ENDPGM 0
;
; GFX8-LABEL: name: workgroup_one_as_release
; GFX8: bb.0.entry:
+ ; GFX8-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX8-NEXT: S_ENDPGM 0
;
; GFX10WGP-LABEL: name: workgroup_one_as_release
; GFX10WGP: bb.0.entry:
; GFX10WGP-NEXT: S_WAITCNT_soft 16240
+ ; GFX10WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX10WGP-NEXT: S_ENDPGM 0
;
; GFX10CU-LABEL: name: workgroup_one_as_release
; GFX10CU: bb.0.entry:
+ ; GFX10CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10CU-NEXT: S_ENDPGM 0
;
; GFX11WGP-LABEL: name: workgroup_one_as_release
; GFX11WGP: bb.0.entry:
; GFX11WGP-NEXT: S_WAITCNT_soft 1015
+ ; GFX11WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX11WGP-NEXT: S_ENDPGM 0
;
; GFX11CU-LABEL: name: workgroup_one_as_release
; GFX11CU: bb.0.entry:
+ ; GFX11CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11CU-NEXT: S_ENDPGM 0
entry:
fence syncscope("workgroup-one-as") release
@@ -569,32 +575,38 @@ entry:
define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
; GFX6-LABEL: name: workgroup_one_as_acq_rel
; GFX6: bb.0.entry:
+ ; GFX6-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX6-NEXT: S_ENDPGM 0
;
; GFX8-LABEL: name: workgroup_one_as_acq_rel
; GFX8: bb.0.entry:
+ ; GFX8-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX8-NEXT: S_ENDPGM 0
;
; GFX10WGP-LABEL: name: workgroup_one_as_acq_rel
; GFX10WGP: bb.0.entry:
; GFX10WGP-NEXT: S_WAITCNT_soft 16240
+ ; GFX10WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX10WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX10WGP-NEXT: S_ENDPGM 0
;
; GFX10CU-LABEL: name: workgroup_one_as_acq_rel
; GFX10CU: bb.0.entry:
+ ; GFX10CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10CU-NEXT: S_ENDPGM 0
;
; GFX11WGP-LABEL: name: workgroup_one_as_acq_rel
; GFX11WGP: bb.0.entry:
; GFX11WGP-NEXT: S_WAITCNT_soft 1015
+ ; GFX11WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX11WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX11WGP-NEXT: S_ENDPGM 0
;
; GFX11CU-LABEL: name: workgroup_one_as_acq_rel
; GFX11CU: bb.0.entry:
+ ; GFX11CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11CU-NEXT: S_ENDPGM 0
entry:
fence syncscope("workgroup-one-as") acq_rel
@@ -604,32 +616,38 @@ entry:
define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
; GFX6-LABEL: name: workgroup_one_as_seq_cst
; GFX6: bb.0.entry:
+ ; GFX6-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX6-NEXT: S_ENDPGM 0
;
; GFX8-LABEL: name: workgroup_one_as_seq_cst
; GFX8: bb.0.entry:
+ ; GFX8-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX8-NEXT: S_ENDPGM 0
;
; GFX10WGP-LABEL: name: workgroup_one_as_seq_cst
; GFX10WGP: bb.0.entry:
; GFX10WGP-NEXT: S_WAITCNT_soft 16240
+ ; GFX10WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX10WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX10WGP-NEXT: S_ENDPGM 0
;
; GFX10CU-LABEL: name: workgroup_one_as_seq_cst
; GFX10CU: bb.0.entry:
+ ; GFX10CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10CU-NEXT: S_ENDPGM 0
;
; GFX11WGP-LABEL: name: workgroup_one_as_seq_cst
; GFX11WGP: bb.0.entry:
; GFX11WGP-NEXT: S_WAITCNT_soft 1015
+ ; GFX11WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX11WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX11WGP-NEXT: S_ENDPGM 0
;
; GFX11CU-LABEL: name: workgroup_one_as_seq_cst
; GFX11CU: bb.0.entry:
+ ; GFX11CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11CU-NEXT: S_ENDPGM 0
entry:
fence syncscope("workgroup-one-as") seq_cst
@@ -1283,33 +1301,39 @@ define amdgpu_kernel void @workgroup_release() #0 {
; GFX6-LABEL: name: workgroup_release
; GFX6: bb.0.entry:
; GFX6-NEXT: S_WAITCNT_soft 127
+ ; GFX6-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX6-NEXT: S_ENDPGM 0
;
; GFX8-LABEL: name: workgroup_release
; GFX8: bb.0.entry:
; GFX8-NEXT: S_WAITCNT_soft 127
+ ; GFX8-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX8-NEXT: S_ENDPGM 0
;
; GFX10WGP-LABEL: name: workgroup_release
; GFX10WGP: bb.0.entry:
; GFX10WGP-NEXT: S_WAITCNT_soft 112
+ ; GFX10WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX10WGP-NEXT: S_ENDPGM 0
;
; GFX10CU-LABEL: name: workgroup_release
; GFX10CU: bb.0.entry:
; GFX10CU-NEXT: S_WAITCNT_soft 49279
+ ; GFX10CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10CU-NEXT: S_ENDPGM 0
;
; GFX11WGP-LABEL: name: workgroup_release
; GFX11WGP: bb.0.entry:
; GFX11WGP-NEXT: S_WAITCNT_soft 7
+ ; GFX11WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX11WGP-NEXT: S_ENDPGM 0
;
; GFX11CU-LABEL: name: workgroup_release
; GFX11CU: bb.0.entry:
; GFX11CU-NEXT: S_WAITCNT_soft 64519
+ ; GFX11CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11CU-NEXT: S_ENDPGM 0
entry:
fence syncscope("workgroup") release
@@ -1320,16 +1344,19 @@ define amdgpu_kernel void @workgroup_acq_rel() #0 {
; GFX6-LABEL: name: workgroup_acq_rel
; GFX6: bb.0.entry:
; GFX6-NEXT: S_WAITCNT_soft 127
+ ; GFX6-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX6-NEXT: S_ENDPGM 0
;
; GFX8-LABEL: name: workgroup_acq_rel
; GFX8: bb.0.entry:
; GFX8-NEXT: S_WAITCNT_soft 127
+ ; GFX8-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX8-NEXT: S_ENDPGM 0
;
; GFX10WGP-LABEL: name: workgroup_acq_rel
; GFX10WGP: bb.0.entry:
; GFX10WGP-NEXT: S_WAITCNT_soft 112
+ ; GFX10WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX10WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX10WGP-NEXT: S_ENDPGM 0
@@ -1337,11 +1364,13 @@ define amdgpu_kernel void @workgroup_acq_rel() #0 {
; GFX10CU-LABEL: name: workgroup_acq_rel
; GFX10CU: bb.0.entry:
; GFX10CU-NEXT: S_WAITCNT_soft 49279
+ ; GFX10CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10CU-NEXT: S_ENDPGM 0
;
; GFX11WGP-LABEL: name: workgroup_acq_rel
; GFX11WGP: bb.0.entry:
; GFX11WGP-NEXT: S_WAITCNT_soft 7
+ ; GFX11WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX11WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX11WGP-NEXT: S_ENDPGM 0
@@ -1349,6 +1378,7 @@ define amdgpu_kernel void @workgroup_acq_rel() #0 {
; GFX11CU-LABEL: name: workgroup_acq_rel
; GFX11CU: bb.0.entry:
; GFX11CU-NEXT: S_WAITCNT_soft 64519
+ ; GFX11CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11CU-NEXT: S_ENDPGM 0
entry:
fence syncscope("workgroup") acq_rel
@@ -1359,16 +1389,19 @@ define amdgpu_kernel void @workgroup_seq_cst() #0 {
; GFX6-LABEL: name: workgroup_seq_cst
; GFX6: bb.0.entry:
; GFX6-NEXT: S_WAITCNT_soft 127
+ ; GFX6-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX6-NEXT: S_ENDPGM 0
;
; GFX8-LABEL: name: workgroup_seq_cst
; GFX8: bb.0.entry:
; GFX8-NEXT: S_WAITCNT_soft 127
+ ; GFX8-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX8-NEXT: S_ENDPGM 0
;
; GFX10WGP-LABEL: name: workgroup_seq_cst
; GFX10WGP: bb.0.entry:
; GFX10WGP-NEXT: S_WAITCNT_soft 112
+ ; GFX10WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX10WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX10WGP-NEXT: S_ENDPGM 0
@@ -1376,11 +1409,13 @@ define amdgpu_kernel void @workgroup_seq_cst() #0 {
; GFX10CU-LABEL: name: workgroup_seq_cst
; GFX10CU: bb.0.entry:
; GFX10CU-NEXT: S_WAITCNT_soft 49279
+ ; GFX10CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10CU-NEXT: S_ENDPGM 0
;
; GFX11WGP-LABEL: name: workgroup_seq_cst
; GFX11WGP: bb.0.entry:
; GFX11WGP-NEXT: S_WAITCNT_soft 7
+ ; GFX11WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX11WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX11WGP-NEXT: S_ENDPGM 0
@@ -1388,6 +1423,7 @@ define amdgpu_kernel void @workgroup_seq_cst() #0 {
; GFX11CU-LABEL: name: workgroup_seq_cst
; GFX11CU: bb.0.entry:
; GFX11CU-NEXT: S_WAITCNT_soft 64519
+ ; GFX11CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11CU-NEXT: S_ENDPGM 0
entry:
fence syncscope("workgroup") seq_cst
diff --git a/llvm/test/CodeGen/AMDGPU/insert-waitcnts-fence-soft.mir b/llvm/test/CodeGen/AMDGPU/insert-waitcnts-fence-soft.mir
new file mode 100644
index 0000000000000..b5f5a359ee49b
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/insert-waitcnts-fence-soft.mir
@@ -0,0 +1,153 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn -mcpu=gfx942 -verify-machineinstrs -run-pass si-insert-waitcnts -o - %s | FileCheck -check-prefix=GCN %s
+
+
+# Expected vmcnt(0) since the direct load is the only load.
+---
+name: dma_then_fence
+body: |
+ bb.0:
+ ; GCN-LABEL: name: dma_then_fence
+ ; GCN: S_WAITCNT 0
+ ; GCN-NEXT: $m0 = S_MOV_B32 0
+ ; GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4, addrspace 1), (store (s32) into `ptr addrspace(3) poison` + 4, addrspace 3)
+ ; GCN-NEXT: S_WAITCNT 3952
+ ; GCN-NEXT: $vgpr1 = V_ADD_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
+ ; GCN-NEXT: S_ENDPGM 0
+ $m0 = S_MOV_B32 0
+ BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
+ S_WAITCNT_FENCE_soft 5, 3, 15
+ $vgpr1 = V_ADD_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
+ S_ENDPGM 0
+
+...
+
+# Expected vmcnt(1) since the global load is not processed by SIInsertWaitcnts.
+
+---
+name: dma_then_global_load
+body: |
+ bb.0:
+ ; GCN-LABEL: name: dma_then_global_load
+ ; GCN: S_WAITCNT 0
+ ; GCN-NEXT: $m0 = S_MOV_B32 0
+ ; GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4, addrspace 1), (store (s32) into `ptr addrspace(3) poison` + 4, addrspace 3)
+ ; GCN-NEXT: $vgpr2 = GLOBAL_LOAD_DWORD $vgpr4_vgpr5, 0, 0, implicit $exec
+ ; GCN-NEXT: S_WAITCNT 3953
+ ; GCN-NEXT: $vgpr1 = V_AD...
[truncated]
|
@llvm/pr-subscribers-llvm-globalisel Author: Sameer Sahasrabuddhe (ssahasra) ChangesThe new instruction represents any wait counts resulting from a fence, which the Currently this is used to implement efficient waitcnts for direct loads to LDS, Patch is 57.03 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/150167.diff 8 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIDefines.h b/llvm/lib/Target/AMDGPU/SIDefines.h
index 3902d4c3b1027..9d30951cac1a3 100644
--- a/llvm/lib/Target/AMDGPU/SIDefines.h
+++ b/llvm/lib/Target/AMDGPU/SIDefines.h
@@ -10,6 +10,7 @@
#ifndef LLVM_LIB_TARGET_AMDGPU_SIDEFINES_H
#define LLVM_LIB_TARGET_AMDGPU_SIDEFINES_H
+#include "llvm/ADT/BitmaskEnum.h"
#include "llvm/MC/MCInstrDesc.h"
namespace llvm {
@@ -419,6 +420,38 @@ enum CPol {
} // namespace CPol
+/// The atomic synchronization scopes supported by the AMDGPU target.
+enum class SIAtomicScope {
+ NONE,
+ SINGLETHREAD,
+ WAVEFRONT,
+ WORKGROUP,
+ AGENT,
+ SYSTEM
+};
+
+/// The distinct address spaces supported by the AMDGPU target for
+/// atomic memory operation. Can be ORed together.
+enum class SIAtomicAddrSpace {
+ NONE = 0u,
+ GLOBAL = 1u << 0,
+ LDS = 1u << 1,
+ SCRATCH = 1u << 2,
+ GDS = 1u << 3,
+ OTHER = 1u << 4,
+
+ /// The address spaces that can be accessed by a FLAT instruction.
+ FLAT = GLOBAL | LDS | SCRATCH,
+
+ /// The address spaces that support atomic instructions.
+ ATOMIC = GLOBAL | LDS | SCRATCH | GDS,
+
+ /// All address spaces.
+ ALL = GLOBAL | LDS | SCRATCH | GDS | OTHER,
+
+ LLVM_MARK_AS_BITMASK_ENUM(/* LargestFlag = */ ALL)
+};
+
namespace SendMsg { // Encoding of SIMM16 used in s_sendmsg* insns.
enum Id { // Message ID, width(4) [3:0].
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 9faf4974e3fd6..ca1c534896b5a 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -37,7 +37,9 @@
#include "llvm/CodeGen/MachinePostDominators.h"
#include "llvm/Support/DebugCounter.h"
#include "llvm/TargetParser/TargetParser.h"
+
using namespace llvm;
+using namespace AMDGPU;
#define DEBUG_TYPE "si-insert-waitcnts"
@@ -1381,6 +1383,32 @@ bool WaitcntGeneratorPreGFX12::applyPreexistingWaitcnt(
Modified = true;
} else
WaitcntInstr = &II;
+ } else if (Opcode == AMDGPU::S_WAITCNT_FENCE_soft) {
+ // Each direct load to LDS is also a store to LDS, but we do not have a
+ // separate counter for it. Instead these operations increment LOAD_CNT
+ // and need to be waited for at a release fence. So we treat a release
+ // fence as if it depends on any previous LDS DMA stores.
+ unsigned Ordering =
+ TII->getNamedOperand(II, AMDGPU::OpName::Ordering)->getImm();
+ unsigned Scope =
+ TII->getNamedOperand(II, AMDGPU::OpName::Scope)->getImm();
+ unsigned AddrSpace =
+ TII->getNamedOperand(II, AMDGPU::OpName::AddrSpace)->getImm();
+ if (isReleaseOrStronger((AtomicOrdering)Ordering) &&
+ Scope >= (unsigned)AMDGPU::SIAtomicScope::WORKGROUP &&
+ any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) {
+ LLVM_DEBUG(dbgs() << "Processing S_WAITCNT_FENCE_soft: " << II
+ << "Before: " << Wait.LoadCnt << '\n';);
+ ScoreBrackets.determineWait(LOAD_CNT, FIRST_LDS_VGPR, Wait);
+ LLVM_DEBUG(dbgs() << "After: " << Wait.LoadCnt << '\n';);
+ }
+ // It is possible (but unlikely) that this is the only wait instruction,
+ // in which case, we exit this loop without a WaitcntInstr to consume
+ // `Wait`. But that works because `Wait` was passed in by reference, and
+ // the callee eventually calls createNewWaitcnt on it. We test this
+ // possibility in an articial MIR test since such a situation cannot be
+ // recreated by running the memory legalizer.
+ II.eraseFromParent();
} else {
assert(Opcode == AMDGPU::S_WAITCNT_VSCNT);
assert(II.getOperand(0).getReg() == AMDGPU::SGPR_NULL);
@@ -1552,6 +1580,11 @@ bool WaitcntGeneratorGFX12Plus::applyPreexistingWaitcnt(
ScoreBrackets.simplifyWaitcnt(OldWait);
Wait = Wait.combined(OldWait);
UpdatableInstr = &CombinedStoreDsCntInstr;
+ } else if (Opcode == AMDGPU::S_WAITCNT_FENCE_soft) {
+ // Architectures higher than GFX10 do not have direct loads to
+ // LDS, so no work required here yet.
+ II.eraseFromParent();
+ continue;
} else {
std::optional<InstCounterType> CT = counterTypeForInstr(Opcode);
assert(CT.has_value());
@@ -2444,6 +2477,7 @@ static bool isWaitInstr(MachineInstr &Inst) {
Inst.getOperand(0).getReg() == AMDGPU::SGPR_NULL) ||
Opcode == AMDGPU::S_WAIT_LOADCNT_DSCNT ||
Opcode == AMDGPU::S_WAIT_STORECNT_DSCNT ||
+ Opcode == AMDGPU::S_WAITCNT_FENCE_soft ||
counterTypeForInstr(Opcode).has_value();
}
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 3212060f303a5..6e998098a1ab2 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -57,38 +57,6 @@ enum class Position {
AFTER
};
-/// The atomic synchronization scopes supported by the AMDGPU target.
-enum class SIAtomicScope {
- NONE,
- SINGLETHREAD,
- WAVEFRONT,
- WORKGROUP,
- AGENT,
- SYSTEM
-};
-
-/// The distinct address spaces supported by the AMDGPU target for
-/// atomic memory operation. Can be ORed together.
-enum class SIAtomicAddrSpace {
- NONE = 0u,
- GLOBAL = 1u << 0,
- LDS = 1u << 1,
- SCRATCH = 1u << 2,
- GDS = 1u << 3,
- OTHER = 1u << 4,
-
- /// The address spaces that can be accessed by a FLAT instruction.
- FLAT = GLOBAL | LDS | SCRATCH,
-
- /// The address spaces that support atomic instructions.
- ATOMIC = GLOBAL | LDS | SCRATCH | GDS,
-
- /// All address spaces.
- ALL = GLOBAL | LDS | SCRATCH | GDS | OTHER,
-
- LLVM_MARK_AS_BITMASK_ENUM(/* LargestFlag = */ ALL)
-};
-
class SIMemOpInfo final {
private:
@@ -1160,6 +1128,19 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
Changed = true;
}
+ // Emit a soft wait count as a place holder for SIInsertWaitcnts, which will
+ // later add additional waits. To minimize clutter, we do this only when
+ // required. For now this just means a release operation at workgroup scope
+ // that synchronizes LDS, required by direct loads to LDS.
+ if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP &&
+ any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) {
+ BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_FENCE_soft))
+ .addImm((unsigned)Order)
+ .addImm((unsigned)Scope)
+ .addImm((unsigned)AddrSpace);
+ Changed = true;
+ }
+
if (Pos == Position::AFTER)
--MI;
@@ -2068,6 +2049,19 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
Changed = true;
}
+ // Emit a soft wait count as a place holder for SIInsertWaitcnts, which will
+ // later add additional waits. To minimize clutter, we do this only when
+ // required. For now this just means a release operation at workgroup scope
+ // that synchronizes LDS, required by direct loads to LDS.
+ if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP &&
+ any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) {
+ BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_FENCE_soft))
+ .addImm((unsigned)Order)
+ .addImm((unsigned)Scope)
+ .addImm((unsigned)AddrSpace);
+ Changed = true;
+ }
+
if (VSCnt) {
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
.addReg(AMDGPU::SGPR_NULL, RegState::Undef)
@@ -2385,6 +2379,19 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
Changed = true;
}
+ // Emit a soft wait count as a place holder for SIInsertWaitcnts, which will
+ // later add additional waits. To minimize clutter, we do this only when
+ // required. For now this just means a release operation at workgroup scope
+ // that synchronizes LDS, required by direct loads to LDS.
+ if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP &&
+ any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) {
+ BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_FENCE_soft))
+ .addImm((unsigned)Order)
+ .addImm((unsigned)Scope)
+ .addImm((unsigned)AddrSpace);
+ Changed = true;
+ }
+
if (Pos == Position::AFTER)
--MI;
diff --git a/llvm/lib/Target/AMDGPU/SOPInstructions.td b/llvm/lib/Target/AMDGPU/SOPInstructions.td
index e103ccc2f00e6..df1a7bd4424bc 100644
--- a/llvm/lib/Target/AMDGPU/SOPInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SOPInstructions.td
@@ -1621,6 +1621,12 @@ let OtherPredicates = [HasImageInsts] in {
def S_WAIT_KMCNT_soft : SOPP_Pseudo <"s_soft_wait_kmcnt", (ins s16imm:$simm16), "$simm16">;
}
+def S_WAITCNT_FENCE_soft : SPseudoInstSI <
+ (outs), (ins i32imm:$Ordering, i32imm:$Scope, i32imm:$AddrSpace)> {
+ let hasSideEffects = 0;
+ let UseNamedOperandTable = 1;
+}
+
def S_SETHALT : SOPP_Pseudo <"s_sethalt" , (ins i32imm:$simm16), "$simm16",
[(int_amdgcn_s_sethalt timm:$simm16)]>;
def S_SETKILL : SOPP_Pseudo <"s_setkill" , (ins i16imm:$simm16), "$simm16">;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
index 66037615f0ba0..1f01c64de546c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
@@ -536,30 +536,36 @@ entry:
define amdgpu_kernel void @workgroup_one_as_release() #0 {
; GFX6-LABEL: name: workgroup_one_as_release
; GFX6: bb.0.entry:
+ ; GFX6-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX6-NEXT: S_ENDPGM 0
;
; GFX8-LABEL: name: workgroup_one_as_release
; GFX8: bb.0.entry:
+ ; GFX8-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX8-NEXT: S_ENDPGM 0
;
; GFX10WGP-LABEL: name: workgroup_one_as_release
; GFX10WGP: bb.0.entry:
; GFX10WGP-NEXT: S_WAITCNT_soft 16240
+ ; GFX10WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX10WGP-NEXT: S_ENDPGM 0
;
; GFX10CU-LABEL: name: workgroup_one_as_release
; GFX10CU: bb.0.entry:
+ ; GFX10CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10CU-NEXT: S_ENDPGM 0
;
; GFX11WGP-LABEL: name: workgroup_one_as_release
; GFX11WGP: bb.0.entry:
; GFX11WGP-NEXT: S_WAITCNT_soft 1015
+ ; GFX11WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX11WGP-NEXT: S_ENDPGM 0
;
; GFX11CU-LABEL: name: workgroup_one_as_release
; GFX11CU: bb.0.entry:
+ ; GFX11CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11CU-NEXT: S_ENDPGM 0
entry:
fence syncscope("workgroup-one-as") release
@@ -569,32 +575,38 @@ entry:
define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
; GFX6-LABEL: name: workgroup_one_as_acq_rel
; GFX6: bb.0.entry:
+ ; GFX6-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX6-NEXT: S_ENDPGM 0
;
; GFX8-LABEL: name: workgroup_one_as_acq_rel
; GFX8: bb.0.entry:
+ ; GFX8-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX8-NEXT: S_ENDPGM 0
;
; GFX10WGP-LABEL: name: workgroup_one_as_acq_rel
; GFX10WGP: bb.0.entry:
; GFX10WGP-NEXT: S_WAITCNT_soft 16240
+ ; GFX10WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX10WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX10WGP-NEXT: S_ENDPGM 0
;
; GFX10CU-LABEL: name: workgroup_one_as_acq_rel
; GFX10CU: bb.0.entry:
+ ; GFX10CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10CU-NEXT: S_ENDPGM 0
;
; GFX11WGP-LABEL: name: workgroup_one_as_acq_rel
; GFX11WGP: bb.0.entry:
; GFX11WGP-NEXT: S_WAITCNT_soft 1015
+ ; GFX11WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX11WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX11WGP-NEXT: S_ENDPGM 0
;
; GFX11CU-LABEL: name: workgroup_one_as_acq_rel
; GFX11CU: bb.0.entry:
+ ; GFX11CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11CU-NEXT: S_ENDPGM 0
entry:
fence syncscope("workgroup-one-as") acq_rel
@@ -604,32 +616,38 @@ entry:
define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
; GFX6-LABEL: name: workgroup_one_as_seq_cst
; GFX6: bb.0.entry:
+ ; GFX6-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX6-NEXT: S_ENDPGM 0
;
; GFX8-LABEL: name: workgroup_one_as_seq_cst
; GFX8: bb.0.entry:
+ ; GFX8-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX8-NEXT: S_ENDPGM 0
;
; GFX10WGP-LABEL: name: workgroup_one_as_seq_cst
; GFX10WGP: bb.0.entry:
; GFX10WGP-NEXT: S_WAITCNT_soft 16240
+ ; GFX10WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX10WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX10WGP-NEXT: S_ENDPGM 0
;
; GFX10CU-LABEL: name: workgroup_one_as_seq_cst
; GFX10CU: bb.0.entry:
+ ; GFX10CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10CU-NEXT: S_ENDPGM 0
;
; GFX11WGP-LABEL: name: workgroup_one_as_seq_cst
; GFX11WGP: bb.0.entry:
; GFX11WGP-NEXT: S_WAITCNT_soft 1015
+ ; GFX11WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX11WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX11WGP-NEXT: S_ENDPGM 0
;
; GFX11CU-LABEL: name: workgroup_one_as_seq_cst
; GFX11CU: bb.0.entry:
+ ; GFX11CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11CU-NEXT: S_ENDPGM 0
entry:
fence syncscope("workgroup-one-as") seq_cst
@@ -1283,33 +1301,39 @@ define amdgpu_kernel void @workgroup_release() #0 {
; GFX6-LABEL: name: workgroup_release
; GFX6: bb.0.entry:
; GFX6-NEXT: S_WAITCNT_soft 127
+ ; GFX6-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX6-NEXT: S_ENDPGM 0
;
; GFX8-LABEL: name: workgroup_release
; GFX8: bb.0.entry:
; GFX8-NEXT: S_WAITCNT_soft 127
+ ; GFX8-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX8-NEXT: S_ENDPGM 0
;
; GFX10WGP-LABEL: name: workgroup_release
; GFX10WGP: bb.0.entry:
; GFX10WGP-NEXT: S_WAITCNT_soft 112
+ ; GFX10WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX10WGP-NEXT: S_ENDPGM 0
;
; GFX10CU-LABEL: name: workgroup_release
; GFX10CU: bb.0.entry:
; GFX10CU-NEXT: S_WAITCNT_soft 49279
+ ; GFX10CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10CU-NEXT: S_ENDPGM 0
;
; GFX11WGP-LABEL: name: workgroup_release
; GFX11WGP: bb.0.entry:
; GFX11WGP-NEXT: S_WAITCNT_soft 7
+ ; GFX11WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX11WGP-NEXT: S_ENDPGM 0
;
; GFX11CU-LABEL: name: workgroup_release
; GFX11CU: bb.0.entry:
; GFX11CU-NEXT: S_WAITCNT_soft 64519
+ ; GFX11CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11CU-NEXT: S_ENDPGM 0
entry:
fence syncscope("workgroup") release
@@ -1320,16 +1344,19 @@ define amdgpu_kernel void @workgroup_acq_rel() #0 {
; GFX6-LABEL: name: workgroup_acq_rel
; GFX6: bb.0.entry:
; GFX6-NEXT: S_WAITCNT_soft 127
+ ; GFX6-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX6-NEXT: S_ENDPGM 0
;
; GFX8-LABEL: name: workgroup_acq_rel
; GFX8: bb.0.entry:
; GFX8-NEXT: S_WAITCNT_soft 127
+ ; GFX8-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX8-NEXT: S_ENDPGM 0
;
; GFX10WGP-LABEL: name: workgroup_acq_rel
; GFX10WGP: bb.0.entry:
; GFX10WGP-NEXT: S_WAITCNT_soft 112
+ ; GFX10WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX10WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX10WGP-NEXT: S_ENDPGM 0
@@ -1337,11 +1364,13 @@ define amdgpu_kernel void @workgroup_acq_rel() #0 {
; GFX10CU-LABEL: name: workgroup_acq_rel
; GFX10CU: bb.0.entry:
; GFX10CU-NEXT: S_WAITCNT_soft 49279
+ ; GFX10CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10CU-NEXT: S_ENDPGM 0
;
; GFX11WGP-LABEL: name: workgroup_acq_rel
; GFX11WGP: bb.0.entry:
; GFX11WGP-NEXT: S_WAITCNT_soft 7
+ ; GFX11WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX11WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX11WGP-NEXT: S_ENDPGM 0
@@ -1349,6 +1378,7 @@ define amdgpu_kernel void @workgroup_acq_rel() #0 {
; GFX11CU-LABEL: name: workgroup_acq_rel
; GFX11CU: bb.0.entry:
; GFX11CU-NEXT: S_WAITCNT_soft 64519
+ ; GFX11CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11CU-NEXT: S_ENDPGM 0
entry:
fence syncscope("workgroup") acq_rel
@@ -1359,16 +1389,19 @@ define amdgpu_kernel void @workgroup_seq_cst() #0 {
; GFX6-LABEL: name: workgroup_seq_cst
; GFX6: bb.0.entry:
; GFX6-NEXT: S_WAITCNT_soft 127
+ ; GFX6-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX6-NEXT: S_ENDPGM 0
;
; GFX8-LABEL: name: workgroup_seq_cst
; GFX8: bb.0.entry:
; GFX8-NEXT: S_WAITCNT_soft 127
+ ; GFX8-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX8-NEXT: S_ENDPGM 0
;
; GFX10WGP-LABEL: name: workgroup_seq_cst
; GFX10WGP: bb.0.entry:
; GFX10WGP-NEXT: S_WAITCNT_soft 112
+ ; GFX10WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX10WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX10WGP-NEXT: S_ENDPGM 0
@@ -1376,11 +1409,13 @@ define amdgpu_kernel void @workgroup_seq_cst() #0 {
; GFX10CU-LABEL: name: workgroup_seq_cst
; GFX10CU: bb.0.entry:
; GFX10CU-NEXT: S_WAITCNT_soft 49279
+ ; GFX10CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX10CU-NEXT: S_ENDPGM 0
;
; GFX11WGP-LABEL: name: workgroup_seq_cst
; GFX11WGP: bb.0.entry:
; GFX11WGP-NEXT: S_WAITCNT_soft 7
+ ; GFX11WGP-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11WGP-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
; GFX11WGP-NEXT: BUFFER_GL0_INV implicit $exec
; GFX11WGP-NEXT: S_ENDPGM 0
@@ -1388,6 +1423,7 @@ define amdgpu_kernel void @workgroup_seq_cst() #0 {
; GFX11CU-LABEL: name: workgroup_seq_cst
; GFX11CU: bb.0.entry:
; GFX11CU-NEXT: S_WAITCNT_soft 64519
+ ; GFX11CU-NEXT: S_WAITCNT_FENCE_soft 5, 3, 15
; GFX11CU-NEXT: S_ENDPGM 0
entry:
fence syncscope("workgroup") seq_cst
diff --git a/llvm/test/CodeGen/AMDGPU/insert-waitcnts-fence-soft.mir b/llvm/test/CodeGen/AMDGPU/insert-waitcnts-fence-soft.mir
new file mode 100644
index 0000000000000..b5f5a359ee49b
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/insert-waitcnts-fence-soft.mir
@@ -0,0 +1,153 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn -mcpu=gfx942 -verify-machineinstrs -run-pass si-insert-waitcnts -o - %s | FileCheck -check-prefix=GCN %s
+
+
+# Expected vmcnt(0) since the direct load is the only load.
+---
+name: dma_then_fence
+body: |
+ bb.0:
+ ; GCN-LABEL: name: dma_then_fence
+ ; GCN: S_WAITCNT 0
+ ; GCN-NEXT: $m0 = S_MOV_B32 0
+ ; GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4, addrspace 1), (store (s32) into `ptr addrspace(3) poison` + 4, addrspace 3)
+ ; GCN-NEXT: S_WAITCNT 3952
+ ; GCN-NEXT: $vgpr1 = V_ADD_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
+ ; GCN-NEXT: S_ENDPGM 0
+ $m0 = S_MOV_B32 0
+ BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
+ S_WAITCNT_FENCE_soft 5, 3, 15
+ $vgpr1 = V_ADD_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
+ S_ENDPGM 0
+
+...
+
+# Expected vmcnt(1) since the global load is not processed by SIInsertWaitcnts.
+
+---
+name: dma_then_global_load
+body: |
+ bb.0:
+ ; GCN-LABEL: name: dma_then_global_load
+ ; GCN: S_WAITCNT 0
+ ; GCN-NEXT: $m0 = S_MOV_B32 0
+ ; GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4, addrspace 1), (store (s32) into `ptr addrspace(3) poison` + 4, addrspace 3)
+ ; GCN-NEXT: $vgpr2 = GLOBAL_LOAD_DWORD $vgpr4_vgpr5, 0, 0, implicit $exec
+ ; GCN-NEXT: S_WAITCNT 3953
+ ; GCN-NEXT: $vgpr1 = V_AD...
[truncated]
|
- fence does not specify LDS - fence is not a release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this does not (in its current form) defer waitcnt insertion entirely to InsertWaitCnt; it just adds a context pseudo before the waitcnts that are usually inserted, right? That means we're one small patch away from moving all of the waitcnt insertion from the legalizer to InsertWaitCnt?
// later add additional waits. To minimize clutter, we do this only when | ||
// required. For now this just means a release operation at workgroup scope | ||
// that synchronizes LDS, required by direct loads to LDS. | ||
if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that should go into some helper function
TII->getNamedOperand(II, AMDGPU::OpName::Scope)->getImm(); | ||
unsigned AddrSpace = | ||
TII->getNamedOperand(II, AMDGPU::OpName::AddrSpace)->getImm(); | ||
if (isReleaseOrStronger((AtomicOrdering)Ordering) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, this part bothers me a bit because now InsertWaitCnt has to be aware of atomic orderings and deal with them accordingly. It blurs the separation of concerns between this pass and the MemoryLegalizer.
I know there is a good argument for doing that, but I think this being too generic for what we need at this stage. It's something that needs a lot of planning beforehand (and it's an item on my to-do list, though lower priority).
Can we consider adding a simple s_wait_lds_dma_soft
instead, targeted exactly for this use case, and emit that ? I would prefer doing the minimum amount of changes, and then removing that pseudo later in favor of a generic one, than locking us into a specific approach right now.
I think what I'm afraid of is that this sets a precedent, and over time I suspect we'll rely more and more on this pseudo here and elsewhere (e.g. instead of fixing something properly, we just check the pseudo elsewhere and hack a fix there instead), and end up with the memory model implementation being spread over multiple files, which will make it difficult to manage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there is a good argument for doing that, but I think this being too generic for what we need at this stage. It's something that needs a lot of planning beforehand (and it's an item on my to-do list, though lower priority).
Quoting from #147257:
Something still doesn't feel right with this PR for me, I feel like this isn't the right approach but I struggle to suggest something better.
Longer term we should really just have a single waitcnt pseudo for the MemoryLegalizer that is target-independent, it'd fix issues like these if we had special sentinel values for different things.
In all sincerity, could it be possible that there is an analysis paralysis happening here? Are we overthinking this situation? What could be more effective as "a single waitcnt pseudo" than a fence? What is an example of something less generic than a fence and yet effective for all uses?
Can we consider adding a simple s_wait_lds_dma_soft instead, targeted exactly for this use case, and emit that ?
I consider that too specific. I contend that the distinction between the memory legalizer and the waitcount inserter absolutely needs to be blurred. They cannot exist separately, they implement the memory model together and complement each other in that process. One specific problem with having an S_WAIT_LDS_DMA_soft is that it is needed only on release operations but not on acquire operations.
I would prefer doing the minimum amount of changes, and then removing that pseudo later in favor of a generic one, than locking us into a specific approach right now.
Do you have any specific examples of potential concerns? This approach is not locking us into anything more than information that is already relevant to the memory model, which is orderings, scopes and address spaces. It can't possibly lock us into anything incompatible with future work.
I think what I'm afraid of is that this sets a precedent, and over time I suspect we'll rely more and more on this pseudo here and elsewhere (e.g. instead of fixing something properly, we just check the pseudo elsewhere and hack a fix there instead), and end up with the memory model implementation being spread over multiple files, which will make it difficult to manage.
That is precisely my intention. It is a mistake to think that only the memory legalizer is relevant to the memory model. It can produce "safe cache operations and waits", but it can't produce efficient ones. The real memory model has to be spread across two files, or perhaps we should merge those two files. But I don't see this as a major blocker for what I am proposing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all sincerity, could it be possible that there is an analysis paralysis happening here? Are we overthinking this situation?
Yes, definitely. Sorry about that.
I'm going to try to lay out my thoughts in a simpler way, so I don't start contradicting myself again:
- I agree the MemoryLegalizer and InsertWaitCnt are inseparable, but there is still some separation of concern: InsertWaitCnt doesn't look at atomic orderings for example. That can be blurred in the future, but as the owner of the MemoryLegalizer I would like it to be a separate task that's more carefully planned, rather than done to fix a specific issue.
- Furthermore, I see the legalizer's role as "implement the memory model in a conservative way" while InsertWaitCnt is "optimize the waitcnts while preserving semantics" (+ insert new waits ofc)
- When I imagined a generic pseudo for waitcnt insertion, I did not imagine something that includes information like the atomic ordering. I imagined something with ad-hoc bits that carry only the information we need, and nothing else.
- For example, we had a few times when we wondered if the memory legalizer needed to insert waits on vm_vsrc. That can't be conveyed using the AS/Ordering alone. We need something more specific, something where we can feel free to add new flags for any reason we see fit.
- My worry i that if the operation is too generic, and carries info like atomic ordering, it opens the door to implementing some memory model fixes outside the legalizer (e.g. waitcnts for specific fences would now be handled by InsertWaitCnt without the legalizer's knowledge) which I do not want as it's best kept in one place.
- I guess it's valid to see it as irrational as I don't have proof that it could/will happen.
So in my opinion, @kerbowa's approach in #138802 fits best.
Yes it's not ideal, but there's a lot of things not ideal with the current way things are laid out right now. I'd rather keep on the same trajectory by adding a specific pseudo, and then refactor it all in one batch, than try something new to fix a specific problem.
Again, sorry for derailing this a bit. The discussion spread over weeks and multiple PRs so I lost context and contradicted myself a few times.
Yes, I would love to submit a patch that moves all wait count insertion to the waitcnt inserter (because after all, that is its name). But there is no hurry in fixing what isn't broken right now. |
Despite my rant about moving all waitcnt insertion to the waitcnt inserter pass, this specific thing smells wrong to me. S_WAITCNT_FENCE_soft might be a powerful tool, but I don't see any significant use of it other than fixing this one bug about direct loads to LDS, and that only exists on some specific older architectures. Looks a lot like a solution looking for problems to justify its existence. I would rather just go back to what @kerbowa had originally proposed, with #138802 and #142018, except with new insight, we can do it in a much more precise surgery. |
The new instruction represents any wait counts resulting from a fence, which the
memory legalizer cannot determine on its own. After lowering a fence to the
appropriate cache operations and any necessary S_WAIT* instructions, the
legalizer hands over further work to SIInsertWaitcnts using this instruction as
a placeholder. The waitcnt inserter can use additional information.
Currently this is used to implement efficient waitcnts for direct loads to LDS,
based on the ordering, scope and addrspace specified on the soft fence.