Skip to content

Conversation

@martien-de-jong
Copy link
Collaborator

This adds an alternative to the latency-aware heuristic in WAWRegisterRewriter and selects between the two based on LoopClass.

// The initial graph will have ordering edges induced by hasSideEffects of the
// locks/DONE.
class LockDelays : public ScheduleDAGMutation {
bool ExactLatencies = true;
Copy link
Collaborator

@andcarminati andcarminati Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this:

 class LockDelays : public ScheduleDAGMutation {
-  bool ExactLatencies = true;
+  bool ExactLatencies;
   void apply(ScheduleDAGInstrs *DAG) override {
     const auto *TII = static_cast<const AIEBaseInstrInfo *>(DAG->TII);
     const int CoreStallCycle = TII->getCoreStallCycleAfterLock();
@@ -243,7 +243,7 @@ class LockDelays : public ScheduleDAGMutation {
   }
 
 public:
-  LockDelays(bool ExactLatencies) : ExactLatencies(ExactLatencies) {};
+  LockDelays(bool ExactLatencies = true) : ExactLatencies(ExactLatencies) {};
 };

We don't need any change in the current instantiation (getPostRAMutationsImpl) of the mutators.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's true. But I have bad feelings about default parameters in constructors. It restricts future constructors which would be ambiguous.

std::optional<int> MemLat = TII->getMemoryLatency(
SrcMI.getDesc().getSchedClass(), MI.getDesc().getSchedClass());
if (!MemLat.has_value()) {
int Latency = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK: if we don't need exact latencies, and we don't have them, we use a very optimistic value instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very optimistic. For a RAW ST -> LD pair, the latency is actually 1, since the LOAD reads memory late.. It's also the default that was assumed before Gaetan made it defined by a target hook.

@andcarminati
Copy link
Collaborator

nit: check this commit message: [AIE][NFC] Use equivalant LoopUtils function.

}

bool AIEWawRegRewriter::renameMBBPhysRegs(const MachineBasicBlock *MBB) {
// We do this mainly for the postpipeliner, and it will want to see ZOL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could have some verifier problems by accepting ZOL loops before MachineBlockPlacement, right?

It is better to choose one direction here:

  • Implement another hook, like isPreLayoutZOLBody.
  • Extend isZOLBody and target machine verifier.
  • Clearly state in the comment why we are not relying isZOLBody here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, sometimes we can see some unexpected gains related to PreSWP + WAWRewriter. Are we preventing gains in some pipelined decrement/branch loop here? Could we guard this behavior with a command line option, to be able to run additional tests in the future?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Under these assumptions, we should perhaps not constrain it to single-block loops at all, because the only advantage we can have from PreSWP loops is the scheduler, and for the scheduler anything might work.
I don't think isPreLayoutZOLBody is very clear concept, it is so tied to the flow.
If you have evidence of the combination of PreSWP, CountedLoops and WAWRegRewriter paying off, I would suggest to just remove the check. My main motivation was that I saw it kick in on loops with calls inside, which is concisely prevented by the ZOL check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even AIEBaseInstrInfo::isZOLBody(const MachineBasicBlock &MBB, bool OnlyCheckLastInstr = true)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I found evidence in AvgPool. Removed the new check


static cl::opt<bool>
LatencyAware("aie-realloc-latencyaware", cl::Hidden, cl::init(true),
static cl::opt<int>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to figure out the rationale behind this change...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm overloading here, I agree. for tuning, I want to select false, true, or autoselect.

Martien de Jong added 4 commits November 10, 2025 13:56
We pass some flags down which control the asserts that are meant to check
completeness of the scheduler model for postscheduler's sake.
That allows us to create a more approximate ddg in earlier stages.

These would throw, mainly on occurrences of PseudoInstructions
Reorder the allocation order of the candidates based on an approximate
pipeline schedule.
Martien de Jong added 3 commits November 10, 2025 16:51
The enable flags are now integers, with 0 meaning false, 1 meaning true,
anything else meaning auto-select based on LoopClass.

This selection avoids running swpaware if latencyaware has run

Add some tuning based on LoopClass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants