fix(dispatch): remove waiting ag routines #4632

siavashs · 2025-10-27T11:36:54Z

This change significantely reduces the number of sleeping go routines created per aggregation group and waiting for a timer tick.

Instead use time.AfterFunc to schedule the next call to flush.

Closes #4503

This change significantely reduces the number of sleeping go routines created per aggregation group and waiting for a timer tick. Instead use time.AfterFunc to schedule the next call to flush. Signed-off-by: Siavash Safi <[email protected]>

rajagopalanand · 2025-10-27T16:57:49Z

Do you have any profile captured to show before/after effects of this change?

SuperQ

Nice

SuperQ · 2025-10-27T17:52:02Z

Yes, it would be nice to post a pprof profile and/or metrics to show the results of this change.

siavashs · 2025-10-27T20:25:07Z

Here are some metrics, in both cases I run the same config for Prometheus and Alertmanager which results in 1500 unique alerts and Aggregation Groups:

From main:

# HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 1500

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 1532

From this branch:

# HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 1500

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 32

Looking at pprof/goroutines?debug=1:

From main:

goroutine profile: total 1529
1500 @ 0x100e0e160 0x100dec7cc 0x1016c2480 0x100e16a04
#	0x1016c247f	github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run+0x3ff	alertmanager/dispatch/dispatch.go:446
...

From this branch no dispatch.(*aggrGroup).run exists to show.

(Note that when flush happens we see a lot of go routines still but those are from notify which we will fix in #4633)

SuperQ · 2025-10-28T07:15:53Z

It's less about how many goroutines, but how much this impacts CPU and memory churn. For example, rate(go_memstats_alloc_bytes_total[5m]) can show how much memory is being allocated. Less allocations, less GC, less CPU use.

siavashs · 2025-10-28T08:55:58Z

I think this is a safe change to run in our production canary which has usually ~8k AGs, so I'll back-port it to v0.27 and then compare metrics.

grobinson-grafana · 2025-10-28T09:11:44Z

dispatch/dispatch.go

+	ctx = notify.WithMuteTimeIntervals(ctx, ag.opts.MuteTimeIntervals)
+	ctx = notify.WithActiveTimeIntervals(ctx, ag.opts.ActiveTimeIntervals)
+	ctx = notify.WithRouteID(ctx, ag.routeID)
+	// Flush before resetting timer to maintain backpressure.


I don't think we can make this change without breaking high availability.

High availability requires that the same aggregation group in each Alertmanager in a highly available cluster ticks at not just the same interval but at the same instant. I.e. their timers must be in sync.

If we reset the timer after the flush it causes the timers to drift out of sync. The amount they drift depends on the duration of the flush which is affected by the duration of the integration (i.e. a webhook).

To show this I added a simple fmt.Println to the start of onTimer that prints the current time:

diff --git a/dispatch/dispatch.go b/dispatch/dispatch.go index 9e565b74..185915fe 100644 --- a/dispatch/dispatch.go +++ b/dispatch/dispatch.go @@ -450,6 +450,7 @@ func (ag *aggrGroup) String() string { } func (ag *aggrGroup) onTimer() { + fmt.Println("on timer", time.Now()) // Check if context is done before processing select { case <-ag.ctx.Done():

And I created a webhook with a 5 second delay (of course in the real world this delay can be highly random).

With a group_wait of 15s and a group_interval of 30s, the first tick should be at 09:05:45 and the second tick at 09:15:45, but in fact the second tick occurred at 09:06:20:

on timer 2025-10-28 09:05:45.440439 +0000 GMT m=+19.361033418 time=2025-10-28T09:05:45.440Z level=DEBUG source=dispatch.go:559 msg=flushing component=dispatcher aggrGroup={}:{} alerts=[[3fe32c2][active]] time=2025-10-28T09:05:50.445Z level=DEBUG source=notify.go:878 msg="Notify success" component=dispatcher receiver=test integration=webhook[0] aggrGroup={}:{} attempts=1 duration=5.004619917s alerts=[[3fe32c2][active]] on timer 2025-10-28 09:06:20.44742 +0000 GMT m=+54.367559459 time=2025-10-28T09:06:20.447Z level=DEBUG source=dispatch.go:559 msg=flushing component=dispatcher aggrGroup={}:{} alerts=[[3fe32c2][active]]

Good catch, I think we can reschedule in the top select instead to avoid slow flush affecting the next schedule.
And we should add an acceptance test to capture this behaviour.

fix(dispatch): remove waiting ag routines

91e2cdc

This change significantely reduces the number of sleeping go routines created per aggregation group and waiting for a timer tick. Instead use time.AfterFunc to schedule the next call to flush. Signed-off-by: Siavash Safi <[email protected]>

siavashs mentioned this pull request Oct 27, 2025

Alertmanager cluster peers have spiky routines #4605

Open

SuperQ approved these changes Oct 27, 2025

View reviewed changes

grobinson-grafana reviewed Oct 28, 2025

View reviewed changes

siavashs marked this pull request as draft October 28, 2025 16:56

siavashs self-assigned this Nov 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(dispatch): remove waiting ag routines #4632

fix(dispatch): remove waiting ag routines #4632

Uh oh!

siavashs commented Oct 27, 2025 •

edited

Loading

Uh oh!

rajagopalanand commented Oct 27, 2025

Uh oh!

SuperQ left a comment

Uh oh!

SuperQ commented Oct 27, 2025

Uh oh!

siavashs commented Oct 27, 2025 •

edited

Loading

Uh oh!

SuperQ commented Oct 28, 2025

Uh oh!

siavashs commented Oct 28, 2025

Uh oh!

grobinson-grafana Oct 28, 2025

Uh oh!

siavashs Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix(dispatch): remove waiting ag routines #4632

Are you sure you want to change the base?

fix(dispatch): remove waiting ag routines #4632

Uh oh!

Conversation

siavashs commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rajagopalanand commented Oct 27, 2025

Uh oh!

SuperQ left a comment

Choose a reason for hiding this comment

Uh oh!

SuperQ commented Oct 27, 2025

Uh oh!

siavashs commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SuperQ commented Oct 28, 2025

Uh oh!

siavashs commented Oct 28, 2025

Uh oh!

grobinson-grafana Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

siavashs Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

siavashs commented Oct 27, 2025 •

edited

Loading

siavashs commented Oct 27, 2025 •

edited

Loading