Skip to content

Commit cec331a

Browse files
committed
[LoopPeel] Fix branch weights' effect on block frequencies
For example: ``` declare void @f(i32) define void @test(i32 %n) { entry: br label %do.body do.body: %i = phi i32 [ 0, %entry ], [ %inc, %do.body ] %inc = add i32 %i, 1 tail call void @f(i32 %i) #3 %c = icmp sge i32 %inc, %n br i1 %c, label %do.end, label %do.body, !prof !0 do.end: ret void } !0 = !{!"branch_weights", i32 1, i32 9} ``` Given those branch weights, once any loop iteration is actually reached, the probability of the loop exiting at the iteration's end is 1/(1+9). That is, the loop is likely to exit every 10 iterations and thus has an estimated trip count of 10. `opt -passes='print<block-freq>'` shows that 10 is indeed the frequency of the loop body: ``` Printing analysis results of BFI for function 'test': block-frequency-info: test - entry: float = 1.0, int = 1801439852625920 - do.body: float = 10.0, int = 18014398509481984 - do.end: float = 1.0, int = 1801439852625920 ``` Key Observation: The frequency of reaching any particular iteration is less than for the previous iteration because the previous iteration has a non-zero probability of exiting the loop. This observation holds even though every loop iteration, once actually reached, has exactly the same probability of exiting and thus exactly the same branch weights. Now we use `opt -unroll-force-peel-count=2 -passes=loop-unroll` to peel 2 iterations and insert them before the remainder loop. We expect the above observations not to change, but they do under the implementation without this patch. The block frequency becomes 1.0 for the first iteration, 0.9 for the second, and 6.4 for the main loop body. Again, a decreasing frequency is expected, but it decreases too much: the total frequency of the original loop body becomes 8.3. The new branch weights reveal the problem: ``` !0 = !{!"branch_weights", i32 1, i32 9} !1 = !{!"branch_weights", i32 1, i32 8} !2 = !{!"branch_weights", i32 1, i32 7} ``` The exit probability is now 1/10 for the first peeled iteration, 1/9 for the second, and 1/8 for the remaining loop iterations. It seems this behavior was trying to ensure a decreasing block frequency. However, as in the key observation above for the original loop, that happens correctly without decreasing the branch weights across iterations. This patch changes the peeling implementation not to decrease the branch weights across loop iterations so that the frequency for every iteration is the same as it was in the original loop. The total frequency of the loop body, summed across all its occurrences, thus remains 10 after peeling. Unfortunately, that change means a later analysis cannot accurately estimate the trip count of the remainder loop while examining the remainder loop in isolation without considering the probability of actually reaching it. For that purpose, this patch stores the new trip count as separate metadata named `llvm.loop.estimated_trip_count` and extends `llvm::getLoopEstimatedTripCount` to prefer it, if present, over branch weights. An alternative fix is for `llvm::getLoopEstimatedTripCount` to subtract the `llvm.loop.peeled.count` metadata from the trip count estimated by a loop's branch weights. However, there might be other loop transformations that still corrupt block frequencies in a similar manner and require a similar fix. `llvm.loop.estimated_trip_count` is intended to provide a general way to store estimated trip counts when branch weights cannot directly store them. This patch introduces several fixme comments that need to be addressed before it can land.
1 parent 0be3f13 commit cec331a

File tree

8 files changed

+208
-154
lines changed

8 files changed

+208
-154
lines changed

llvm/include/llvm/Transforms/Utils/LoopUtils.h

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -315,7 +315,8 @@ TransformationMode hasLICMVersioningTransformation(const Loop *L);
315315
void addStringMetadataToLoop(Loop *TheLoop, const char *MDString,
316316
unsigned V = 0);
317317

318-
/// Returns a loop's estimated trip count based on branch weight metadata.
318+
/// Returns a loop's estimated trip count based on
319+
/// llvm.loop.estimated_trip_count metadata or, if none, branch weight metadata.
319320
/// In addition if \p EstimatedLoopInvocationWeight is not null it is
320321
/// initialized with weight of loop's latch leading to the exit.
321322
/// Returns 0 when the count is estimated to be 0, or std::nullopt when a
@@ -324,13 +325,21 @@ std::optional<unsigned>
324325
getLoopEstimatedTripCount(Loop *L,
325326
unsigned *EstimatedLoopInvocationWeight = nullptr);
326327

327-
/// Set a loop's branch weight metadata to reflect that loop has \p
328-
/// EstimatedTripCount iterations and \p EstimatedLoopInvocationWeight exits
329-
/// through latch. Returns true if metadata is successfully updated, false
330-
/// otherwise. Note that loop must have a latch block which controls loop exit
331-
/// in order to succeed.
332-
bool setLoopEstimatedTripCount(Loop *L, unsigned EstimatedTripCount,
333-
unsigned EstimatedLoopInvocationWeight);
328+
/// Set a loop's llvm.loop.estimated_trip_count metadata and, if \p
329+
/// EstimatedLoopInvocationWeight, branch weight metadata to reflect that loop
330+
/// has \p EstimatedTripCount iterations and \p EstimatedLoopInvocationWeight
331+
/// exit weight through latch. Returns true if metadata is successfully updated,
332+
/// false otherwise. Note that loop must have a latch block which controls loop
333+
/// exit in order to succeed.
334+
///
335+
/// The use case for not setting branch weight metadata is when the original
336+
/// branch weight metadata is correct for computing block frequencies but the
337+
/// trip count has changed due to a loop transformation. The branch weight
338+
/// metadata cannot be adjusted to reflect the new trip count, so we store the
339+
/// new trip count separately.
340+
bool setLoopEstimatedTripCount(
341+
Loop *L, unsigned EstimatedTripCount,
342+
std::optional<unsigned> EstimatedLoopInvocationWeight);
334343

335344
/// Check inner loop (L) backedge count is known to be invariant on all
336345
/// iterations of its outer loop. If the loop has no parent, this is trivially

llvm/lib/Transforms/Utils/LoopPeel.cpp

Lines changed: 52 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -657,84 +657,6 @@ void llvm::computePeelCount(Loop *L, unsigned LoopSize,
657657
}
658658
}
659659

660-
struct WeightInfo {
661-
// Weights for current iteration.
662-
SmallVector<uint32_t> Weights;
663-
// Weights to subtract after each iteration.
664-
const SmallVector<uint32_t> SubWeights;
665-
};
666-
667-
/// Update the branch weights of an exiting block of a peeled-off loop
668-
/// iteration.
669-
/// Let F is a weight of the edge to continue (fallthrough) into the loop.
670-
/// Let E is a weight of the edge to an exit.
671-
/// F/(F+E) is a probability to go to loop and E/(F+E) is a probability to
672-
/// go to exit.
673-
/// Then, Estimated ExitCount = F / E.
674-
/// For I-th (counting from 0) peeled off iteration we set the weights for
675-
/// the peeled exit as (EC - I, 1). It gives us reasonable distribution,
676-
/// The probability to go to exit 1/(EC-I) increases. At the same time
677-
/// the estimated exit count in the remainder loop reduces by I.
678-
/// To avoid dealing with division rounding we can just multiple both part
679-
/// of weights to E and use weight as (F - I * E, E).
680-
static void updateBranchWeights(Instruction *Term, WeightInfo &Info) {
681-
setBranchWeights(*Term, Info.Weights, /*IsExpected=*/false);
682-
for (auto [Idx, SubWeight] : enumerate(Info.SubWeights))
683-
if (SubWeight != 0)
684-
// Don't set the probability of taking the edge from latch to loop header
685-
// to less than 1:1 ratio (meaning Weight should not be lower than
686-
// SubWeight), as this could significantly reduce the loop's hotness,
687-
// which would be incorrect in the case of underestimating the trip count.
688-
Info.Weights[Idx] =
689-
Info.Weights[Idx] > SubWeight
690-
? std::max(Info.Weights[Idx] - SubWeight, SubWeight)
691-
: SubWeight;
692-
}
693-
694-
/// Initialize the weights for all exiting blocks.
695-
static void initBranchWeights(DenseMap<Instruction *, WeightInfo> &WeightInfos,
696-
Loop *L) {
697-
SmallVector<BasicBlock *> ExitingBlocks;
698-
L->getExitingBlocks(ExitingBlocks);
699-
for (BasicBlock *ExitingBlock : ExitingBlocks) {
700-
Instruction *Term = ExitingBlock->getTerminator();
701-
SmallVector<uint32_t> Weights;
702-
if (!extractBranchWeights(*Term, Weights))
703-
continue;
704-
705-
// See the comment on updateBranchWeights() for an explanation of what we
706-
// do here.
707-
uint32_t FallThroughWeights = 0;
708-
uint32_t ExitWeights = 0;
709-
for (auto [Succ, Weight] : zip(successors(Term), Weights)) {
710-
if (L->contains(Succ))
711-
FallThroughWeights += Weight;
712-
else
713-
ExitWeights += Weight;
714-
}
715-
716-
// Don't try to update weights for degenerate case.
717-
if (FallThroughWeights == 0)
718-
continue;
719-
720-
SmallVector<uint32_t> SubWeights;
721-
for (auto [Succ, Weight] : zip(successors(Term), Weights)) {
722-
if (!L->contains(Succ)) {
723-
// Exit weights stay the same.
724-
SubWeights.push_back(0);
725-
continue;
726-
}
727-
728-
// Subtract exit weights on each iteration, distributed across all
729-
// fallthrough edges.
730-
double W = (double)Weight / (double)FallThroughWeights;
731-
SubWeights.push_back((uint32_t)(ExitWeights * W));
732-
}
733-
734-
WeightInfos.insert({Term, {std::move(Weights), std::move(SubWeights)}});
735-
}
736-
}
737-
738660
/// Clones the body of the loop L, putting it between \p InsertTop and \p
739661
/// InsertBot.
740662
/// \param IterNumber The serial number of the iteration currently being
@@ -1008,11 +930,6 @@ bool llvm::peelLoop(Loop *L, unsigned PeelCount, LoopInfo *LI,
1008930
Instruction *LatchTerm =
1009931
cast<Instruction>(cast<BasicBlock>(Latch)->getTerminator());
1010932

1011-
// If we have branch weight information, we'll want to update it for the
1012-
// newly created branches.
1013-
DenseMap<Instruction *, WeightInfo> Weights;
1014-
initBranchWeights(Weights, L);
1015-
1016933
// Identify what noalias metadata is inside the loop: if it is inside the
1017934
// loop, the associated metadata must be cloned for each iteration.
1018935
SmallVector<MDNode *, 6> LoopLocalNoAliasDeclScopes;
@@ -1040,11 +957,6 @@ bool llvm::peelLoop(Loop *L, unsigned PeelCount, LoopInfo *LI,
1040957
assert(DT.verify(DominatorTree::VerificationLevel::Fast));
1041958
#endif
1042959

1043-
for (auto &[Term, Info] : Weights) {
1044-
auto *TermCopy = cast<Instruction>(VMap[Term]);
1045-
updateBranchWeights(TermCopy, Info);
1046-
}
1047-
1048960
// Remove Loop metadata from the latch branch instruction
1049961
// because it is not the Loop's latch branch anymore.
1050962
auto *LatchTermCopy = cast<Instruction>(VMap[LatchTerm]);
@@ -1070,15 +982,62 @@ bool llvm::peelLoop(Loop *L, unsigned PeelCount, LoopInfo *LI,
1070982
PHI->setIncomingValueForBlock(NewPreHeader, NewVal);
1071983
}
1072984

1073-
for (const auto &[Term, Info] : Weights) {
1074-
setBranchWeights(*Term, Info.Weights, /*IsExpected=*/false);
1075-
}
1076-
1077985
// Update Metadata for count of peeled off iterations.
1078986
unsigned AlreadyPeeled = 0;
1079987
if (auto Peeled = getOptionalIntLoopAttribute(L, PeeledCountMetaData))
1080988
AlreadyPeeled = *Peeled;
1081-
addStringMetadataToLoop(L, PeeledCountMetaData, AlreadyPeeled + PeelCount);
989+
unsigned TotalPeeled = AlreadyPeeled + PeelCount;
990+
addStringMetadataToLoop(L, PeeledCountMetaData, TotalPeeled);
991+
992+
// Update metadata for the estimated trip count. The original branch weight
993+
// metadata is already correct for both the remainder loop and the peeled loop
994+
// iterations, so don't adjust it.
995+
//
996+
// For example, consider what happens when peeling 2 iterations from a loop
997+
// with an estimated trip count of 10 and inserting them before the remainder
998+
// loop. Each of the peeled iterations and each iteration in the remainder
999+
// loop still has the same probability of exiting the *entire original* loop
1000+
// as it did when in the original loop, and thus it should still have the same
1001+
// branch weights. The peeled iterations' non-zero probabilities of exiting
1002+
// already appropriately reduce the probability of reaching the remainder
1003+
// iterations just as they did in the original loop. Trying to also adjust
1004+
// the remainder loop's branch weights to reflect its new trip count of 8 will
1005+
// erroneously further reduce its block frequencies. However, in case an
1006+
// analysis later needs to determine the trip count of the remainder loop
1007+
// while examining it in isolation without considering the probability of
1008+
// actually reaching it, we store the new trip count as separate metadata.
1009+
//
1010+
// FIXME: getLoopEstimatedTripCount and setLoopEstimatedTripCount skip loops
1011+
// that don't match the restrictions of getExpectedExitLoopLatchBranch in
1012+
// LoopUtils.cpp. For example,
1013+
// llvm/tests/Transforms/LoopUnroll/peel-branch-weights.ll (introduced by
1014+
// b43a4d0850d5) has multiple exits. Should we try to extend them to handle
1015+
// such cases? For now, we just don't try to record
1016+
// llvm.loop.estimated_trip_count for such cases, so the original branch
1017+
// weights will have to do.
1018+
if (auto EstimatedTripCount = getLoopEstimatedTripCount(L)) {
1019+
// FIXME: The previous updateBranchWeights implementation had this
1020+
// comment:
1021+
//
1022+
// Don't set the probability of taking the edge from latch to loop header
1023+
// to less than 1:1 ratio (meaning Weight should not be lower than
1024+
// SubWeight), as this could significantly reduce the loop's hotness,
1025+
// which would be incorrect in the case of underestimating the trip count.
1026+
//
1027+
// See e8d5db206c2f commit log for further discussion. That seems to
1028+
// suggest that we should avoid ever setting a trip count of < 2 here
1029+
// (equal chance of continuing and exiting means the loop will likely
1030+
// continue once and then exit once). Should we then set a minimum of
1031+
// 2 or maybe 1 here? Or is keeping the original branch weights already a
1032+
// sufficient improvement for whatever analysis cares about this case?
1033+
unsigned EstimatedTripCountNew = *EstimatedTripCount;
1034+
if (EstimatedTripCountNew < TotalPeeled)
1035+
EstimatedTripCountNew = 0;
1036+
else
1037+
EstimatedTripCountNew -= TotalPeeled;
1038+
setLoopEstimatedTripCount(L, EstimatedTripCountNew,
1039+
/*EstimatedLoopInvocationWeight=*/std::nullopt);
1040+
}
10821041

10831042
if (Loop *ParentLoop = L->getParentLoop())
10841043
L = ParentLoop;

llvm/lib/Transforms/Utils/LoopUtils.cpp

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,8 @@ using namespace llvm::PatternMatch;
5353

5454
static const char *LLVMLoopDisableNonforced = "llvm.loop.disable_nonforced";
5555
static const char *LLVMLoopDisableLICM = "llvm.licm.disable";
56+
static const char *LLVMLoopEstimatedTripCount =
57+
"llvm.loop.estimated_trip_count";
5658

5759
bool llvm::formDedicatedExitBlocks(Loop *L, DominatorTree *DT, LoopInfo *LI,
5860
MemorySSAUpdater *MSSAU,
@@ -859,27 +861,39 @@ llvm::getLoopEstimatedTripCount(Loop *L,
859861
getEstimatedTripCount(LatchBranch, L, ExitWeight)) {
860862
if (EstimatedLoopInvocationWeight)
861863
*EstimatedLoopInvocationWeight = ExitWeight;
864+
// FIXME: Where else are branch weights directly used for estimating loop
865+
// trip counts? They should also be updated to use
866+
// LLVMLoopEstimatedTripCount when present... or to just call this
867+
// function.
868+
if (auto EstimatedTripCount =
869+
getOptionalIntLoopAttribute(L, LLVMLoopEstimatedTripCount))
870+
return EstimatedTripCount;
862871
return *EstTripCount;
863872
}
864873
}
865874
return std::nullopt;
866875
}
867876

868-
bool llvm::setLoopEstimatedTripCount(Loop *L, unsigned EstimatedTripCount,
869-
unsigned EstimatedloopInvocationWeight) {
877+
bool llvm::setLoopEstimatedTripCount(
878+
Loop *L, unsigned EstimatedTripCount,
879+
std::optional<unsigned> EstimatedloopInvocationWeight) {
870880
// At the moment, we currently support changing the estimate trip count of
871881
// the latch branch only. We could extend this API to manipulate estimated
872882
// trip counts for any exit.
873883
BranchInst *LatchBranch = getExpectedExitLoopLatchBranch(L);
874884
if (!LatchBranch)
875885
return false;
876886

887+
addStringMetadataToLoop(L, LLVMLoopEstimatedTripCount, EstimatedTripCount);
888+
if (!EstimatedloopInvocationWeight)
889+
return true;
890+
877891
// Calculate taken and exit weights.
878892
unsigned LatchExitWeight = 0;
879893
unsigned BackedgeTakenWeight = 0;
880894

881895
if (EstimatedTripCount > 0) {
882-
LatchExitWeight = EstimatedloopInvocationWeight;
896+
LatchExitWeight = *EstimatedloopInvocationWeight;
883897
BackedgeTakenWeight = (EstimatedTripCount - 1) * LatchExitWeight;
884898
}
885899

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals
2+
; RUN: opt < %s -S -passes=loop-unroll -unroll-force-peel-count=2 2>&1 | FileCheck %s
3+
4+
declare void @f(i32)
5+
6+
; Test branch weights and estimated trip count metadata for simple loop after
7+
; peeling.
8+
define void @test(i32 %n) {
9+
; CHECK-LABEL: @test(
10+
; CHECK-NEXT: entry:
11+
; CHECK-NEXT: br label [[DO_BODY_PEEL_BEGIN:%.*]]
12+
; CHECK: do.body.peel.begin:
13+
; CHECK-NEXT: br label [[DO_BODY_PEEL:%.*]]
14+
; CHECK: do.body.peel:
15+
; CHECK-NEXT: [[INC_PEEL:%.*]] = add i32 0, 1
16+
; CHECK-NEXT: tail call void @f(i32 0)
17+
; CHECK-NEXT: [[C_PEEL:%.*]] = icmp sge i32 [[INC_PEEL]], [[N:%.*]]
18+
; CHECK-NEXT: br i1 [[C_PEEL]], label [[DO_END:%.*]], label [[DO_BODY_PEEL_NEXT:%.*]], !prof [[PROF0:![0-9]+]]
19+
; CHECK: do.body.peel.next:
20+
; CHECK-NEXT: br label [[DO_BODY_PEEL2:%.*]]
21+
; CHECK: do.body.peel2:
22+
; CHECK-NEXT: [[INC_PEEL3:%.*]] = add i32 [[INC_PEEL]], 1
23+
; CHECK-NEXT: tail call void @f(i32 [[INC_PEEL]])
24+
; CHECK-NEXT: [[C_PEEL4:%.*]] = icmp sge i32 [[INC_PEEL3]], [[N]]
25+
; CHECK-NEXT: br i1 [[C_PEEL4]], label [[DO_END]], label [[DO_BODY_PEEL_NEXT1:%.*]], !prof [[PROF0]]
26+
; CHECK: do.body.peel.next1:
27+
; CHECK-NEXT: br label [[DO_BODY_PEEL_NEXT5:%.*]]
28+
; CHECK: do.body.peel.next5:
29+
; CHECK-NEXT: br label [[ENTRY_PEEL_NEWPH:%.*]]
30+
; CHECK: entry.peel.newph:
31+
; CHECK-NEXT: br label [[DO_BODY:%.*]]
32+
; CHECK: do.body:
33+
; CHECK-NEXT: [[I:%.*]] = phi i32 [ [[INC_PEEL3]], [[ENTRY_PEEL_NEWPH]] ], [ [[INC:%.*]], [[DO_BODY]] ]
34+
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I]], 1
35+
; CHECK-NEXT: tail call void @f(i32 [[I]])
36+
; CHECK-NEXT: [[C:%.*]] = icmp sge i32 [[INC]], [[N]]
37+
; CHECK-NEXT: br i1 [[C]], label [[DO_END_LOOPEXIT:%.*]], label [[DO_BODY]], !prof [[PROF0]], !llvm.loop [[LOOP1:![0-9]+]]
38+
; CHECK: do.end.loopexit:
39+
; CHECK-NEXT: br label [[DO_END]]
40+
; CHECK: do.end:
41+
; CHECK-NEXT: ret void
42+
;
43+
44+
entry:
45+
br label %do.body
46+
47+
do.body:
48+
%i = phi i32 [ 0, %entry ], [ %inc, %do.body ]
49+
%inc = add i32 %i, 1
50+
tail call void @f(i32 %i) #3
51+
%c = icmp sge i32 %inc, %n
52+
br i1 %c, label %do.end, label %do.body, !prof !0
53+
54+
do.end:
55+
ret void
56+
}
57+
58+
!0 = !{!"branch_weights", i32 1, i32 9}
59+
60+
;.
61+
; CHECK: [[PROF0]] = !{!"branch_weights", i32 1, i32 9}
62+
; CHECK: [[LOOP1]] = distinct !{[[LOOP1]], [[META2:![0-9]+]], [[META3:![0-9]+]], [[META4:![0-9]+]]}
63+
; CHECK: [[META2]] = !{!"llvm.loop.peeled.count", i32 2}
64+
; CHECK: [[META3]] = !{!"llvm.loop.estimated_trip_count", i32 8}
65+
; CHECK: [[META4]] = !{!"llvm.loop.unroll.disable"}
66+
;.

0 commit comments

Comments
 (0)