feat(promhttp): add CoalesceGather option to deduplicate concurrent Gather calls#1969
Open
kakkoyun wants to merge 3 commits intoprometheus:mainfrom
Open
feat(promhttp): add CoalesceGather option to deduplicate concurrent Gather calls#1969kakkoyun wants to merge 3 commits intoprometheus:mainfrom
kakkoyun wants to merge 3 commits intoprometheus:mainfrom
Conversation
840d68c to
f9e46a2
Compare
…ather calls When a collector's Collect() is slower than the scrape interval, each incoming HTTP request triggers an independent Gather() that spawns its own goroutine pipeline. With no upper bound, this causes goroutine pile-up proportional to (scrape rate / collection time) — the apparent "goroutine leak" reported in prometheus#1477. Add HandlerOpts.CoalesceGather bool. When true, HandlerForTransactional wraps the underlying TransactionalGatherer in a coalescingGatherer that allows only one Gather to run at a time. Concurrent requests join the in-flight cycle and receive the same result once it completes. Reference counting ensures the TransactionalGatherer done() callback is called exactly once, after the last handler finishes encoding. Design decisions: - Per-cycle gatherCycle object (not fields on the struct) prevents the race where a new cycle overwrites result fields while prior waiters are still reading them. - Mutex-based ref counting (not atomic) ensures c.cycle = nil is cleared before cy.done() is called, ruling out double-done on a stale pointer. - close(cy.ready) happens-before <-cy.ready returns (Go memory model), so cy.mfs/err/done are safely readable without additional locking. - Zero overhead when disabled: single if-branch in HandlerForTransactional setup, identical code path when CoalesceGather is false. Fixes prometheus#1477 Signed-off-by: Kemal Akkoyun <kemal.akkoyun@datadoghq.com>
f9e46a2 to
81c415a
Compare
|
@kakkoyun LGTM- I tested with the roughly same approach as for my hack implementation (ref.: ncabatoff/process-exporter@master...initialed85:process-exporter:kakkoyun/goroutine_leak) Worked as expected- with a fake 1s sleep added to the scrape method in process-exporter:
So I think we're in good shape- lots of timing out of requests, no climbing FDs; works nicely. Awesome work! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this solve?
Issue #1477 reports apparent goroutine leaks when using
prometheus/client_golang. The root cause is not a WaitGroup bug —Registry.Gather()logic is correct. The real problem:When a collector's
Collect()is slower than the scrape interval, each incoming HTTP scrape triggers an independentGather(). Each call spawns its own goroutine pipeline. With no upper bound, concurrent pipelines grow proportionally to(scrape rate / collection time)— a goroutine pile-up that looks like a leak but is unbounded concurrent work.This was originally investigated by @initialed85 in this comment, who validated the issue and prototyped a "piggybacking" (singleflight) approach. This PR implements that idea cleanly inside the HTTP handler, with zero new public types and zero overhead when disabled.
Solution
Add
HandlerOpts.CoalesceGather bool. Whentrue,HandlerForTransactionalwraps theTransactionalGathererin an unexportedcoalescingGathererthat allows only oneGather()to run at a time. Concurrent HTTP requests arriving while a gather is in-flight join the existing cycle and receive the same result.Usage
Design decisions
HandlerOptsfield, not a public wrapper typegatherCycleobjectc.cycle = nilis cleared beforecy.done()is called, ruling out double-done on a stale pointerclose(cy.ready)as the visibility barriercy.mfs/err/doneare safely readable after<-cy.readywithout additional lockingfalseChanges
prometheus/promhttp/http.gocoalescingGatherer+gatherCycletypes; addHandlerOpts.CoalesceGather; wire intoHandlerForTransactionalprometheus/promhttp/http_test.goTest plan
go test -race -count=1 ./prometheus/promhttp/...— all pass, no data racesgo vet ./prometheus/...— cleanTestCoalesceGatherGoroutineLeakFreeusesgoleak.VerifyNoneto assert no leaked goroutines under concurrent slow-collector loadTestCoalesceGatherDoneCalledExactlyOnceverifies theTransactionalGatherercontract:done()calls ==Gather()calls regardless of concurrencyCloses #1477