Skip to content

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Oct 27, 2025

This change significantely reduces the number of sleeping go routines created per aggregation group and waiting for a timer tick.

Instead use time.AfterFunc to schedule the next call to flush.

Closes #4503

This change significantely reduces the number of sleeping go routines
created per aggregation group and waiting for a timer tick.

Instead use time.AfterFunc to schedule the next call to flush.

Signed-off-by: Siavash Safi <[email protected]>
@rajagopalanand
Copy link
Contributor

Do you have any profile captured to show before/after effects of this change?

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@SuperQ
Copy link
Member

SuperQ commented Oct 27, 2025

Yes, it would be nice to post a pprof profile and/or metrics to show the results of this change.

@siavashs
Copy link
Contributor Author

siavashs commented Oct 27, 2025

Here are some metrics, in both cases I run the same config for Prometheus and Alertmanager which results in 1500 unique alerts and Aggregation Groups:

From main:

# HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 1500

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 1532

From this branch:

# HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 1500

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 32

Looking at pprof/goroutines?debug=1:

From main:

goroutine profile: total 1529
1500 @ 0x100e0e160 0x100dec7cc 0x1016c2480 0x100e16a04
#	0x1016c247f	github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run+0x3ff	alertmanager/dispatch/dispatch.go:446
...

From this branch no dispatch.(*aggrGroup).run exists to show.

(Note that when flush happens we see a lot of go routines still but those are from notify which we will fix in #4633)

@SuperQ
Copy link
Member

SuperQ commented Oct 28, 2025

It's less about how many goroutines, but how much this impacts CPU and memory churn. For example, rate(go_memstats_alloc_bytes_total[5m]) can show how much memory is being allocated. Less allocations, less GC, less CPU use.

@siavashs
Copy link
Contributor Author

I think this is a safe change to run in our production canary which has usually ~8k AGs, so I'll back-port it to v0.27 and then compare metrics.

ctx = notify.WithMuteTimeIntervals(ctx, ag.opts.MuteTimeIntervals)
ctx = notify.WithActiveTimeIntervals(ctx, ag.opts.ActiveTimeIntervals)
ctx = notify.WithRouteID(ctx, ag.routeID)
// Flush before resetting timer to maintain backpressure.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can make this change without breaking high availability.

High availability requires that the same aggregation group in each Alertmanager in a highly available cluster ticks at not just the same interval but at the same instant. I.e. their timers must be in sync.

If we reset the timer after the flush it causes the timers to drift out of sync. The amount they drift depends on the duration of the flush which is affected by the duration of the integration (i.e. a webhook).

To show this I added a simple fmt.Println to the start of onTimer that prints the current time:

diff --git a/dispatch/dispatch.go b/dispatch/dispatch.go
index 9e565b74..185915fe 100644
--- a/dispatch/dispatch.go
+++ b/dispatch/dispatch.go
@@ -450,6 +450,7 @@ func (ag *aggrGroup) String() string {
 }

 func (ag *aggrGroup) onTimer() {
+       fmt.Println("on timer", time.Now())
        // Check if context is done before processing
        select {
        case <-ag.ctx.Done():

And I created a webhook with a 5 second delay (of course in the real world this delay can be highly random).

With a group_wait of 15s and a group_interval of 30s, the first tick should be at 09:05:45 and the second tick at 09:15:45, but in fact the second tick occurred at 09:06:20:

on timer 2025-10-28 09:05:45.440439 +0000 GMT m=+19.361033418
time=2025-10-28T09:05:45.440Z level=DEBUG source=dispatch.go:559 msg=flushing component=dispatcher aggrGroup={}:{} alerts=[[3fe32c2][active]]
time=2025-10-28T09:05:50.445Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=test integration=webhook[0] aggrGroup={}:{} attempts=1 duration=5.004619917s alerts=[[3fe32c2][active]]
on timer 2025-10-28 09:06:20.44742 +0000 GMT m=+54.367559459
time=2025-10-28T09:06:20.447Z level=DEBUG source=dispatch.go:559 msg=flushing component=dispatcher aggrGroup={}:{} alerts=[[3fe32c2][active]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I think we can reschedule in the top select instead to avoid slow flush affecting the next schedule.
And we should add an acceptance test to capture this behaviour.

@siavashs siavashs marked this pull request as draft October 28, 2025 16:56
@siavashs siavashs self-assigned this Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Aggregation Groups result in too many go routines

4 participants