Skip to content

Commit 9d72e03

Browse files
committed
[LoopPeel] Fix branch weights' effect on block frequencies
For example: ``` declare void @f(i32) define void @test(i32 %n) { entry: br label %do.body do.body: %i = phi i32 [ 0, %entry ], [ %inc, %do.body ] %inc = add i32 %i, 1 call void @f(i32 %i) %c = icmp sge i32 %inc, %n br i1 %c, label %do.end, label %do.body, !prof !0 do.end: ret void } !0 = !{!"branch_weights", i32 1, i32 9} ``` Given those branch weights, once any loop iteration is actually reached, the probability of the loop exiting at the iteration's end is 1/(1+9). That is, the loop is likely to exit every 10 iterations and thus has an estimated trip count of 10. `opt -passes='print<block-freq>'` shows that 10 is indeed the frequency of the loop body: ``` Printing analysis results of BFI for function 'test': block-frequency-info: test - entry: float = 1.0, int = 1801439852625920 - do.body: float = 10.0, int = 18014398509481984 - do.end: float = 1.0, int = 1801439852625920 ``` Key Observation: The frequency of reaching any particular iteration is less than for the previous iteration because the previous iteration has a non-zero probability of exiting the loop. This observation holds even though every loop iteration, once actually reached, has exactly the same probability of exiting and thus exactly the same branch weights. Now we use `opt -unroll-force-peel-count=2 -passes=loop-unroll` to peel 2 iterations and insert them before the remaining loop. We expect the key observation above not to change, but it does under the implementation without this patch. The block frequency becomes 1.0 for the first iteration, 0.9 for the second, and 6.4 for the main loop body. Again, a decreasing frequency is expected, but it decreases too much: the total frequency of the original loop body becomes 8.3. The new branch weights reveal the problem: ``` !0 = !{!"branch_weights", i32 1, i32 9} !1 = !{!"branch_weights", i32 1, i32 8} !2 = !{!"branch_weights", i32 1, i32 7} ``` The exit probability is now 1/10 for the first peeled iteration, 1/9 for the second, and 1/8 for the remaining loop iterations. It seems this behavior was trying to ensure a decreasing block frequency. However, as in the key observation above for the original loop, that happens correctly without decreasing the branch weights across iterations. This patch changes the peeling implementation not to decrease the branch weights across loop iterations so that the frequency for every iteration is the same as it was in the original loop. The total frequency of the loop body, summed across all its occurrences, thus remains 10 after peeling. Unfortunately, that change means a later analysis cannot accurately estimate the trip count of the remaining loop while examining the remaining loop in isolation without considering the probability of actually reaching it. For that purpose, this patch stores the new trip count as separate metadata named `llvm.loop.estimated_trip_count` and extends `llvm::getLoopEstimatedTripCount` to prefer it, if present, over branch weights. An alternative fix is for `llvm::getLoopEstimatedTripCount` to subtract the `llvm.loop.peeled.count` metadata from the trip count estimated by a loop's branch weights. However, there might be other loop transformations that still corrupt block frequencies in a similar manner and require a similar fix. `llvm.loop.estimated_trip_count` is intended to provide a general way to store estimated trip counts when branch weights cannot directly store them. This patch introduces several fixme comments that need to be addressed before it can land.
1 parent e619073 commit 9d72e03

File tree

8 files changed

+208
-154
lines changed

8 files changed

+208
-154
lines changed

llvm/include/llvm/Transforms/Utils/LoopUtils.h

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -315,7 +315,8 @@ TransformationMode hasLICMVersioningTransformation(const Loop *L);
315315
void addStringMetadataToLoop(Loop *TheLoop, const char *MDString,
316316
unsigned V = 0);
317317

318-
/// Returns a loop's estimated trip count based on branch weight metadata.
318+
/// Returns a loop's estimated trip count based on
319+
/// llvm.loop.estimated_trip_count metadata or, if none, branch weight metadata.
319320
/// In addition if \p EstimatedLoopInvocationWeight is not null it is
320321
/// initialized with weight of loop's latch leading to the exit.
321322
/// Returns a valid positive trip count, saturated at UINT_MAX, or std::nullopt
@@ -324,13 +325,21 @@ std::optional<unsigned>
324325
getLoopEstimatedTripCount(Loop *L,
325326
unsigned *EstimatedLoopInvocationWeight = nullptr);
326327

327-
/// Set a loop's branch weight metadata to reflect that loop has \p
328-
/// EstimatedTripCount iterations and \p EstimatedLoopInvocationWeight exits
329-
/// through latch. Returns true if metadata is successfully updated, false
330-
/// otherwise. Note that loop must have a latch block which controls loop exit
331-
/// in order to succeed.
332-
bool setLoopEstimatedTripCount(Loop *L, unsigned EstimatedTripCount,
333-
unsigned EstimatedLoopInvocationWeight);
328+
/// Set a loop's llvm.loop.estimated_trip_count metadata and, if \p
329+
/// EstimatedLoopInvocationWeight, branch weight metadata to reflect that loop
330+
/// has \p EstimatedTripCount iterations and \p EstimatedLoopInvocationWeight
331+
/// exit weight through latch. Returns true if metadata is successfully updated,
332+
/// false otherwise. Note that loop must have a latch block which controls loop
333+
/// exit in order to succeed.
334+
///
335+
/// The use case for not setting branch weight metadata is when the original
336+
/// branch weight metadata is correct for computing block frequencies but the
337+
/// trip count has changed due to a loop transformation. The branch weight
338+
/// metadata cannot be adjusted to reflect the new trip count, so we store the
339+
/// new trip count separately.
340+
bool setLoopEstimatedTripCount(
341+
Loop *L, unsigned EstimatedTripCount,
342+
std::optional<unsigned> EstimatedLoopInvocationWeight);
334343

335344
/// Check inner loop (L) backedge count is known to be invariant on all
336345
/// iterations of its outer loop. If the loop has no parent, this is trivially

llvm/lib/Transforms/Utils/LoopPeel.cpp

Lines changed: 52 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -655,84 +655,6 @@ void llvm::computePeelCount(Loop *L, unsigned LoopSize,
655655
}
656656
}
657657

658-
struct WeightInfo {
659-
// Weights for current iteration.
660-
SmallVector<uint32_t> Weights;
661-
// Weights to subtract after each iteration.
662-
const SmallVector<uint32_t> SubWeights;
663-
};
664-
665-
/// Update the branch weights of an exiting block of a peeled-off loop
666-
/// iteration.
667-
/// Let F is a weight of the edge to continue (fallthrough) into the loop.
668-
/// Let E is a weight of the edge to an exit.
669-
/// F/(F+E) is a probability to go to loop and E/(F+E) is a probability to
670-
/// go to exit.
671-
/// Then, Estimated ExitCount = F / E.
672-
/// For I-th (counting from 0) peeled off iteration we set the weights for
673-
/// the peeled exit as (EC - I, 1). It gives us reasonable distribution,
674-
/// The probability to go to exit 1/(EC-I) increases. At the same time
675-
/// the estimated exit count in the remainder loop reduces by I.
676-
/// To avoid dealing with division rounding we can just multiple both part
677-
/// of weights to E and use weight as (F - I * E, E).
678-
static void updateBranchWeights(Instruction *Term, WeightInfo &Info) {
679-
setBranchWeights(*Term, Info.Weights, /*IsExpected=*/false);
680-
for (auto [Idx, SubWeight] : enumerate(Info.SubWeights))
681-
if (SubWeight != 0)
682-
// Don't set the probability of taking the edge from latch to loop header
683-
// to less than 1:1 ratio (meaning Weight should not be lower than
684-
// SubWeight), as this could significantly reduce the loop's hotness,
685-
// which would be incorrect in the case of underestimating the trip count.
686-
Info.Weights[Idx] =
687-
Info.Weights[Idx] > SubWeight
688-
? std::max(Info.Weights[Idx] - SubWeight, SubWeight)
689-
: SubWeight;
690-
}
691-
692-
/// Initialize the weights for all exiting blocks.
693-
static void initBranchWeights(DenseMap<Instruction *, WeightInfo> &WeightInfos,
694-
Loop *L) {
695-
SmallVector<BasicBlock *> ExitingBlocks;
696-
L->getExitingBlocks(ExitingBlocks);
697-
for (BasicBlock *ExitingBlock : ExitingBlocks) {
698-
Instruction *Term = ExitingBlock->getTerminator();
699-
SmallVector<uint32_t> Weights;
700-
if (!extractBranchWeights(*Term, Weights))
701-
continue;
702-
703-
// See the comment on updateBranchWeights() for an explanation of what we
704-
// do here.
705-
uint32_t FallThroughWeights = 0;
706-
uint32_t ExitWeights = 0;
707-
for (auto [Succ, Weight] : zip(successors(Term), Weights)) {
708-
if (L->contains(Succ))
709-
FallThroughWeights += Weight;
710-
else
711-
ExitWeights += Weight;
712-
}
713-
714-
// Don't try to update weights for degenerate case.
715-
if (FallThroughWeights == 0)
716-
continue;
717-
718-
SmallVector<uint32_t> SubWeights;
719-
for (auto [Succ, Weight] : zip(successors(Term), Weights)) {
720-
if (!L->contains(Succ)) {
721-
// Exit weights stay the same.
722-
SubWeights.push_back(0);
723-
continue;
724-
}
725-
726-
// Subtract exit weights on each iteration, distributed across all
727-
// fallthrough edges.
728-
double W = (double)Weight / (double)FallThroughWeights;
729-
SubWeights.push_back((uint32_t)(ExitWeights * W));
730-
}
731-
732-
WeightInfos.insert({Term, {std::move(Weights), std::move(SubWeights)}});
733-
}
734-
}
735-
736658
/// Clones the body of the loop L, putting it between \p InsertTop and \p
737659
/// InsertBot.
738660
/// \param IterNumber The serial number of the iteration currently being
@@ -1006,11 +928,6 @@ bool llvm::peelLoop(Loop *L, unsigned PeelCount, LoopInfo *LI,
1006928
Instruction *LatchTerm =
1007929
cast<Instruction>(cast<BasicBlock>(Latch)->getTerminator());
1008930

1009-
// If we have branch weight information, we'll want to update it for the
1010-
// newly created branches.
1011-
DenseMap<Instruction *, WeightInfo> Weights;
1012-
initBranchWeights(Weights, L);
1013-
1014931
// Identify what noalias metadata is inside the loop: if it is inside the
1015932
// loop, the associated metadata must be cloned for each iteration.
1016933
SmallVector<MDNode *, 6> LoopLocalNoAliasDeclScopes;
@@ -1038,11 +955,6 @@ bool llvm::peelLoop(Loop *L, unsigned PeelCount, LoopInfo *LI,
1038955
assert(DT.verify(DominatorTree::VerificationLevel::Fast));
1039956
#endif
1040957

1041-
for (auto &[Term, Info] : Weights) {
1042-
auto *TermCopy = cast<Instruction>(VMap[Term]);
1043-
updateBranchWeights(TermCopy, Info);
1044-
}
1045-
1046958
// Remove Loop metadata from the latch branch instruction
1047959
// because it is not the Loop's latch branch anymore.
1048960
auto *LatchTermCopy = cast<Instruction>(VMap[LatchTerm]);
@@ -1068,15 +980,62 @@ bool llvm::peelLoop(Loop *L, unsigned PeelCount, LoopInfo *LI,
1068980
PHI->setIncomingValueForBlock(NewPreHeader, NewVal);
1069981
}
1070982

1071-
for (const auto &[Term, Info] : Weights) {
1072-
setBranchWeights(*Term, Info.Weights, /*IsExpected=*/false);
1073-
}
1074-
1075983
// Update Metadata for count of peeled off iterations.
1076984
unsigned AlreadyPeeled = 0;
1077985
if (auto Peeled = getOptionalIntLoopAttribute(L, PeeledCountMetaData))
1078986
AlreadyPeeled = *Peeled;
1079-
addStringMetadataToLoop(L, PeeledCountMetaData, AlreadyPeeled + PeelCount);
987+
unsigned TotalPeeled = AlreadyPeeled + PeelCount;
988+
addStringMetadataToLoop(L, PeeledCountMetaData, TotalPeeled);
989+
990+
// Update metadata for the estimated trip count. The original branch weight
991+
// metadata is already correct for both the remaining loop and the peeled loop
992+
// iterations, so don't adjust it.
993+
//
994+
// For example, consider what happens when peeling 2 iterations from a loop
995+
// with an estimated trip count of 10 and inserting them before the remaining
996+
// loop. Each of the peeled iterations and each iteration in the remaining
997+
// loop still has the same probability of exiting the *entire original* loop
998+
// as it did when in the original loop, and thus it should still have the same
999+
// branch weights. The peeled iterations' non-zero probabilities of exiting
1000+
// already appropriately reduce the probability of reaching the remaining
1001+
// iterations just as they did in the original loop. Trying to also adjust
1002+
// the remaining loop's branch weights to reflect its new trip count of 8 will
1003+
// erroneously further reduce its block frequencies. However, in case an
1004+
// analysis later needs to determine the trip count of the remaining loop
1005+
// while examining it in isolation without considering the probability of
1006+
// actually reaching it, we store the new trip count as separate metadata.
1007+
//
1008+
// FIXME: getLoopEstimatedTripCount and setLoopEstimatedTripCount skip loops
1009+
// that don't match the restrictions of getExpectedExitLoopLatchBranch in
1010+
// LoopUtils.cpp. For example,
1011+
// llvm/tests/Transforms/LoopUnroll/peel-branch-weights.ll (introduced by
1012+
// b43a4d0850d5) has multiple exits. Should we try to extend them to handle
1013+
// such cases? For now, we just don't try to record
1014+
// llvm.loop.estimated_trip_count for such cases, so the original branch
1015+
// weights will have to do.
1016+
if (auto EstimatedTripCount = getLoopEstimatedTripCount(L)) {
1017+
// FIXME: The previous updateBranchWeights implementation had this
1018+
// comment:
1019+
//
1020+
// Don't set the probability of taking the edge from latch to loop header
1021+
// to less than 1:1 ratio (meaning Weight should not be lower than
1022+
// SubWeight), as this could significantly reduce the loop's hotness,
1023+
// which would be incorrect in the case of underestimating the trip count.
1024+
//
1025+
// See e8d5db206c2f commit log for further discussion. That seems to
1026+
// suggest that we should avoid ever setting a trip count of < 2 here
1027+
// (equal chance of continuing and exiting means the loop will likely
1028+
// continue once and then exit once). Or is keeping the original branch
1029+
// weights already a sufficient improvement for whatever analysis cares
1030+
// about this case?
1031+
unsigned EstimatedTripCountNew = *EstimatedTripCount;
1032+
if (EstimatedTripCountNew < TotalPeeled) // FIXME: TotalPeeled + 2?
1033+
EstimatedTripCountNew = 0; // FIXME: = 2?
1034+
else
1035+
EstimatedTripCountNew -= TotalPeeled;
1036+
setLoopEstimatedTripCount(L, EstimatedTripCountNew,
1037+
/*EstimatedLoopInvocationWeight=*/std::nullopt);
1038+
}
10801039

10811040
if (Loop *ParentLoop = L->getParentLoop())
10821041
L = ParentLoop;

llvm/lib/Transforms/Utils/LoopUtils.cpp

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,8 @@ using namespace llvm::PatternMatch;
5353

5454
static const char *LLVMLoopDisableNonforced = "llvm.loop.disable_nonforced";
5555
static const char *LLVMLoopDisableLICM = "llvm.licm.disable";
56+
static const char *LLVMLoopEstimatedTripCount =
57+
"llvm.loop.estimated_trip_count";
5658

5759
bool llvm::formDedicatedExitBlocks(Loop *L, DominatorTree *DT, LoopInfo *LI,
5860
MemorySSAUpdater *MSSAU,
@@ -864,27 +866,39 @@ llvm::getLoopEstimatedTripCount(Loop *L,
864866
getEstimatedTripCount(LatchBranch, L, ExitWeight)) {
865867
if (EstimatedLoopInvocationWeight)
866868
*EstimatedLoopInvocationWeight = ExitWeight;
869+
// FIXME: Where else are branch weights directly used for estimating loop
870+
// trip counts? They should also be updated to use
871+
// LLVMLoopEstimatedTripCount when present... or to just call this
872+
// function.
873+
if (auto EstimatedTripCount =
874+
getOptionalIntLoopAttribute(L, LLVMLoopEstimatedTripCount))
875+
return EstimatedTripCount;
867876
return *EstTripCount;
868877
}
869878
}
870879
return std::nullopt;
871880
}
872881

873-
bool llvm::setLoopEstimatedTripCount(Loop *L, unsigned EstimatedTripCount,
874-
unsigned EstimatedloopInvocationWeight) {
882+
bool llvm::setLoopEstimatedTripCount(
883+
Loop *L, unsigned EstimatedTripCount,
884+
std::optional<unsigned> EstimatedloopInvocationWeight) {
875885
// At the moment, we currently support changing the estimate trip count of
876886
// the latch branch only. We could extend this API to manipulate estimated
877887
// trip counts for any exit.
878888
BranchInst *LatchBranch = getExpectedExitLoopLatchBranch(L);
879889
if (!LatchBranch)
880890
return false;
881891

892+
addStringMetadataToLoop(L, LLVMLoopEstimatedTripCount, EstimatedTripCount);
893+
if (!EstimatedloopInvocationWeight)
894+
return true;
895+
882896
// Calculate taken and exit weights.
883897
unsigned LatchExitWeight = 0;
884898
unsigned BackedgeTakenWeight = 0;
885899

886900
if (EstimatedTripCount > 0) {
887-
LatchExitWeight = EstimatedloopInvocationWeight;
901+
LatchExitWeight = *EstimatedloopInvocationWeight;
888902
BackedgeTakenWeight = (EstimatedTripCount - 1) * LatchExitWeight;
889903
}
890904

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals
2+
; RUN: opt < %s -S -passes=loop-unroll -unroll-force-peel-count=2 2>&1 | FileCheck %s
3+
4+
declare void @f(i32)
5+
6+
; Test branch weights and estimated trip count metadata for simple loop after
7+
; peeling.
8+
define void @test(i32 %n) {
9+
; CHECK-LABEL: @test(
10+
; CHECK-NEXT: entry:
11+
; CHECK-NEXT: br label [[DO_BODY_PEEL_BEGIN:%.*]]
12+
; CHECK: do.body.peel.begin:
13+
; CHECK-NEXT: br label [[DO_BODY_PEEL:%.*]]
14+
; CHECK: do.body.peel:
15+
; CHECK-NEXT: [[INC_PEEL:%.*]] = add i32 0, 1
16+
; CHECK-NEXT: call void @f(i32 0)
17+
; CHECK-NEXT: [[C_PEEL:%.*]] = icmp sge i32 [[INC_PEEL]], [[N:%.*]]
18+
; CHECK-NEXT: br i1 [[C_PEEL]], label [[DO_END:%.*]], label [[DO_BODY_PEEL_NEXT:%.*]], !prof [[PROF0:![0-9]+]]
19+
; CHECK: do.body.peel.next:
20+
; CHECK-NEXT: br label [[DO_BODY_PEEL2:%.*]]
21+
; CHECK: do.body.peel2:
22+
; CHECK-NEXT: [[INC_PEEL3:%.*]] = add i32 [[INC_PEEL]], 1
23+
; CHECK-NEXT: call void @f(i32 [[INC_PEEL]])
24+
; CHECK-NEXT: [[C_PEEL4:%.*]] = icmp sge i32 [[INC_PEEL3]], [[N]]
25+
; CHECK-NEXT: br i1 [[C_PEEL4]], label [[DO_END]], label [[DO_BODY_PEEL_NEXT1:%.*]], !prof [[PROF0]]
26+
; CHECK: do.body.peel.next1:
27+
; CHECK-NEXT: br label [[DO_BODY_PEEL_NEXT5:%.*]]
28+
; CHECK: do.body.peel.next5:
29+
; CHECK-NEXT: br label [[ENTRY_PEEL_NEWPH:%.*]]
30+
; CHECK: entry.peel.newph:
31+
; CHECK-NEXT: br label [[DO_BODY:%.*]]
32+
; CHECK: do.body:
33+
; CHECK-NEXT: [[I:%.*]] = phi i32 [ [[INC_PEEL3]], [[ENTRY_PEEL_NEWPH]] ], [ [[INC:%.*]], [[DO_BODY]] ]
34+
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I]], 1
35+
; CHECK-NEXT: call void @f(i32 [[I]])
36+
; CHECK-NEXT: [[C:%.*]] = icmp sge i32 [[INC]], [[N]]
37+
; CHECK-NEXT: br i1 [[C]], label [[DO_END_LOOPEXIT:%.*]], label [[DO_BODY]], !prof [[PROF0]], !llvm.loop [[LOOP1:![0-9]+]]
38+
; CHECK: do.end.loopexit:
39+
; CHECK-NEXT: br label [[DO_END]]
40+
; CHECK: do.end:
41+
; CHECK-NEXT: ret void
42+
;
43+
44+
entry:
45+
br label %do.body
46+
47+
do.body:
48+
%i = phi i32 [ 0, %entry ], [ %inc, %do.body ]
49+
%inc = add i32 %i, 1
50+
call void @f(i32 %i)
51+
%c = icmp sge i32 %inc, %n
52+
br i1 %c, label %do.end, label %do.body, !prof !0
53+
54+
do.end:
55+
ret void
56+
}
57+
58+
!0 = !{!"branch_weights", i32 1, i32 9}
59+
60+
;.
61+
; CHECK: [[PROF0]] = !{!"branch_weights", i32 1, i32 9}
62+
; CHECK: [[LOOP1]] = distinct !{[[LOOP1]], [[META2:![0-9]+]], [[META3:![0-9]+]], [[META4:![0-9]+]]}
63+
; CHECK: [[META2]] = !{!"llvm.loop.peeled.count", i32 2}
64+
; CHECK: [[META3]] = !{!"llvm.loop.estimated_trip_count", i32 8}
65+
; CHECK: [[META4]] = !{!"llvm.loop.unroll.disable"}
66+
;.

0 commit comments

Comments
 (0)