-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[AMDGPU] introduce S_WAITCNT_FENCE_soft emitted by memory legalizer #150167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,38 +57,6 @@ enum class Position { | |
AFTER | ||
}; | ||
|
||
/// The atomic synchronization scopes supported by the AMDGPU target. | ||
enum class SIAtomicScope { | ||
NONE, | ||
SINGLETHREAD, | ||
WAVEFRONT, | ||
WORKGROUP, | ||
AGENT, | ||
SYSTEM | ||
}; | ||
|
||
/// The distinct address spaces supported by the AMDGPU target for | ||
/// atomic memory operation. Can be ORed together. | ||
enum class SIAtomicAddrSpace { | ||
NONE = 0u, | ||
GLOBAL = 1u << 0, | ||
LDS = 1u << 1, | ||
SCRATCH = 1u << 2, | ||
GDS = 1u << 3, | ||
OTHER = 1u << 4, | ||
|
||
/// The address spaces that can be accessed by a FLAT instruction. | ||
FLAT = GLOBAL | LDS | SCRATCH, | ||
|
||
/// The address spaces that support atomic instructions. | ||
ATOMIC = GLOBAL | LDS | SCRATCH | GDS, | ||
|
||
/// All address spaces. | ||
ALL = GLOBAL | LDS | SCRATCH | GDS | OTHER, | ||
|
||
LLVM_MARK_AS_BITMASK_ENUM(/* LargestFlag = */ ALL) | ||
}; | ||
|
||
class SIMemOpInfo final { | ||
private: | ||
|
||
|
@@ -1160,6 +1128,19 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI, | |
Changed = true; | ||
} | ||
|
||
// Emit a soft wait count as a place holder for SIInsertWaitcnts, which will | ||
// later add additional waits. To minimize clutter, we do this only when | ||
// required. For now this just means a release operation at workgroup scope | ||
// that synchronizes LDS, required by direct loads to LDS. | ||
if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP && | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that should go into some helper function |
||
any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) { | ||
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_FENCE_soft)) | ||
.addImm((unsigned)Order) | ||
.addImm((unsigned)Scope) | ||
.addImm((unsigned)AddrSpace); | ||
Changed = true; | ||
} | ||
|
||
if (Pos == Position::AFTER) | ||
--MI; | ||
|
||
|
@@ -2068,6 +2049,19 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI, | |
Changed = true; | ||
} | ||
|
||
// Emit a soft wait count as a place holder for SIInsertWaitcnts, which will | ||
// later add additional waits. To minimize clutter, we do this only when | ||
// required. For now this just means a release operation at workgroup scope | ||
// that synchronizes LDS, required by direct loads to LDS. | ||
if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP && | ||
any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) { | ||
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_FENCE_soft)) | ||
.addImm((unsigned)Order) | ||
.addImm((unsigned)Scope) | ||
.addImm((unsigned)AddrSpace); | ||
Changed = true; | ||
} | ||
|
||
if (VSCnt) { | ||
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT_soft)) | ||
.addReg(AMDGPU::SGPR_NULL, RegState::Undef) | ||
|
@@ -2385,6 +2379,19 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI, | |
Changed = true; | ||
} | ||
|
||
// Emit a soft wait count as a place holder for SIInsertWaitcnts, which will | ||
// later add additional waits. To minimize clutter, we do this only when | ||
// required. For now this just means a release operation at workgroup scope | ||
// that synchronizes LDS, required by direct loads to LDS. | ||
if (isReleaseOrStronger(Order) && Scope == SIAtomicScope::WORKGROUP && | ||
any((SIAtomicAddrSpace)AddrSpace & SIAtomicAddrSpace::LDS)) { | ||
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_FENCE_soft)) | ||
.addImm((unsigned)Order) | ||
.addImm((unsigned)Scope) | ||
.addImm((unsigned)AddrSpace); | ||
Changed = true; | ||
} | ||
|
||
if (Pos == Position::AFTER) | ||
--MI; | ||
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, this part bothers me a bit because now InsertWaitCnt has to be aware of atomic orderings and deal with them accordingly. It blurs the separation of concerns between this pass and the MemoryLegalizer.
I know there is a good argument for doing that, but I think this being too generic for what we need at this stage. It's something that needs a lot of planning beforehand (and it's an item on my to-do list, though lower priority).
Can we consider adding a simple
s_wait_lds_dma_soft
instead, targeted exactly for this use case, and emit that ? I would prefer doing the minimum amount of changes, and then removing that pseudo later in favor of a generic one, than locking us into a specific approach right now.I think what I'm afraid of is that this sets a precedent, and over time I suspect we'll rely more and more on this pseudo here and elsewhere (e.g. instead of fixing something properly, we just check the pseudo elsewhere and hack a fix there instead), and end up with the memory model implementation being spread over multiple files, which will make it difficult to manage.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quoting from #147257:
In all sincerity, could it be possible that there is an analysis paralysis happening here? Are we overthinking this situation? What could be more effective as "a single waitcnt pseudo" than a fence? What is an example of something less generic than a fence and yet effective for all uses?
I consider that too specific. I contend that the distinction between the memory legalizer and the waitcount inserter absolutely needs to be blurred. They cannot exist separately, they implement the memory model together and complement each other in that process. One specific problem with having an S_WAIT_LDS_DMA_soft is that it is needed only on release operations but not on acquire operations.
Do you have any specific examples of potential concerns? This approach is not locking us into anything more than information that is already relevant to the memory model, which is orderings, scopes and address spaces. It can't possibly lock us into anything incompatible with future work.
That is precisely my intention. It is a mistake to think that only the memory legalizer is relevant to the memory model. It can produce "safe cache operations and waits", but it can't produce efficient ones. The real memory model has to be spread across two files, or perhaps we should merge those two files. But I don't see this as a major blocker for what I am proposing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely. Sorry about that.
I'm going to try to lay out my thoughts in a simpler way, so I don't start contradicting myself again:
So in my opinion, @kerbowa's approach in #138802 fits best.
Yes it's not ideal, but there's a lot of things not ideal with the current way things are laid out right now. I'd rather keep on the same trajectory by adding a specific pseudo, and then refactor it all in one batch, than try something new to fix a specific problem.
Again, sorry for derailing this a bit. The discussion spread over weeks and multiple PRs so I lost context and contradicted myself a few times.