-
Notifications
You must be signed in to change notification settings - Fork 2.4k
fix(dispatch): remove waiting ag routines #4632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This change significantely reduces the number of sleeping go routines created per aggregation group and waiting for a timer tick. Instead use time.AfterFunc to schedule the next call to flush. Signed-off-by: Siavash Safi <[email protected]>
|
Do you have any profile captured to show before/after effects of this change? |
SuperQ
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
|
Yes, it would be nice to post a pprof profile and/or metrics to show the results of this change. |
|
Here are some metrics, in both cases I run the same config for Prometheus and Alertmanager which results in 1500 unique alerts and Aggregation Groups: From # HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 1500
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 1532From this branch: # HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 1500
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 32Looking at From goroutine profile: total 1529
1500 @ 0x100e0e160 0x100dec7cc 0x1016c2480 0x100e16a04
# 0x1016c247f github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run+0x3ff alertmanager/dispatch/dispatch.go:446
...From this branch no (Note that when flush happens we see a lot of go routines still but those are from |
|
It's less about how many goroutines, but how much this impacts CPU and memory churn. For example, |
|
I think this is a safe change to run in our production canary which has usually ~8k AGs, so I'll back-port it to v0.27 and then compare metrics. |
| ctx = notify.WithMuteTimeIntervals(ctx, ag.opts.MuteTimeIntervals) | ||
| ctx = notify.WithActiveTimeIntervals(ctx, ag.opts.ActiveTimeIntervals) | ||
| ctx = notify.WithRouteID(ctx, ag.routeID) | ||
| // Flush before resetting timer to maintain backpressure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can make this change without breaking high availability.
High availability requires that the same aggregation group in each Alertmanager in a highly available cluster ticks at not just the same interval but at the same instant. I.e. their timers must be in sync.
If we reset the timer after the flush it causes the timers to drift out of sync. The amount they drift depends on the duration of the flush which is affected by the duration of the integration (i.e. a webhook).
To show this I added a simple fmt.Println to the start of onTimer that prints the current time:
diff --git a/dispatch/dispatch.go b/dispatch/dispatch.go
index 9e565b74..185915fe 100644
--- a/dispatch/dispatch.go
+++ b/dispatch/dispatch.go
@@ -450,6 +450,7 @@ func (ag *aggrGroup) String() string {
}
func (ag *aggrGroup) onTimer() {
+ fmt.Println("on timer", time.Now())
// Check if context is done before processing
select {
case <-ag.ctx.Done():And I created a webhook with a 5 second delay (of course in the real world this delay can be highly random).
With a group_wait of 15s and a group_interval of 30s, the first tick should be at 09:05:45 and the second tick at 09:15:45, but in fact the second tick occurred at 09:06:20:
on timer 2025-10-28 09:05:45.440439 +0000 GMT m=+19.361033418
time=2025-10-28T09:05:45.440Z level=DEBUG source=dispatch.go:559 msg=flushing component=dispatcher aggrGroup={}:{} alerts=[[3fe32c2][active]]
time=2025-10-28T09:05:50.445Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=test integration=webhook[0] aggrGroup={}:{} attempts=1 duration=5.004619917s alerts=[[3fe32c2][active]]
on timer 2025-10-28 09:06:20.44742 +0000 GMT m=+54.367559459
time=2025-10-28T09:06:20.447Z level=DEBUG source=dispatch.go:559 msg=flushing component=dispatcher aggrGroup={}:{} alerts=[[3fe32c2][active]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I think we can reschedule in the top select instead to avoid slow flush affecting the next schedule.
And we should add an acceptance test to capture this behaviour.
This change significantely reduces the number of sleeping go routines created per aggregation group and waiting for a timer tick.
Instead use time.AfterFunc to schedule the next call to flush.
Closes #4503