fix(reposerver): context-aware revision lock to prevent convoy deadlock#26867
fix(reposerver): context-aware revision lock to prevent convoy deadlock#26867marionebl wants to merge 4 commits intoargoproj:masterfrom
Conversation
🔴 Preview Environment stopped on BunnyshellSee: Environment Details | Pipeline Logs Available commands (reply to this comment):
|
Make Lock() in reposerver/repository/lock.go context-aware to prevent
convoy deadlocks under rapid commit bursts. Previously, sync.Cond.Wait()
blocked indefinitely with no cancellation, causing goroutines for newer
revisions to pile up behind the current revision.
Replace sync.Cond with sync.Mutex + chan struct{} broadcast channel and
use select to wait on both the broadcast and ctx.Done(), allowing callers
to cancel waiting via context.
Resolves argoproj#26866
Signed-off-by: Mario Nebl <hello@mario-nebl.de>
e99f72e to
979fd5c
Compare
7990021 to
74a7567
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #26867 +/- ##
==========================================
+ Coverage 63.17% 63.20% +0.02%
==========================================
Files 414 414
Lines 56461 56468 +7
==========================================
+ Hits 35669 35688 +19
+ Misses 17422 17413 -9
+ Partials 3370 3367 -3 ☔ View full report in Codecov by Sentry. |
dudinea
left a comment
There was a problem hiding this comment.
@marionebl thank you for working on this! The change to the locking mechanism looks good to me, have not found any problems with the locking change itself.
Please see my comment for some minor change request.
Some questions:
Do you have any data/graphs on performance improvement that this PR gives?
As a result of the PR some Repo Server methods might start returning context Canceled or DeadLine exceeded errors which they never did before. Have you seen any new/unexpected user visible behavior when running a loaded Argo CD instance with this PR: like any unexpected error messages or transient errors in UI, changes in error statuses in Applications or ApplicationSets?
alexymantha
left a comment
There was a problem hiding this comment.
LGTM too, thanks for fixing this!
ppapapetrou76
left a comment
There was a problem hiding this comment.
I have a couple of concerns/discussion points
-
The current implementation creates a new channel on every broadcast/lock release
In high-contention scenarios, this could create GC pressure. WDYT? -
When a lock is released, all waiting goroutines will try to acquire the lock at the same time. Only one succeeds, and the others go back to waiting. This could be optimized using a semaphore or a queue; otherwise, in theory, a goroutine might be delayed too long if it's unlucky to acquire the lock. WDYT?
-
Consider adding UTs for the following cases
- context cancellation during init() callback
- context cancellation after lock acquisition
and ideally a UT with 100+ concurrent goroutines
| if notify { | ||
| state.cond.Broadcast() | ||
| close(state.broadcast) | ||
| state.broadcast = make(chan struct{}) | ||
| } | ||
| state.mu.Unlock() |
There was a problem hiding this comment.
How about changing the code to the following to avoid the following race condition? After unlocking state.mu, another goroutine could acquire the lock again and read the old broadcast channel reference just as it's being closed. This timing window is small but exists.
if notify {
close(state.broadcast)
state.mu.Unlock()
// Create new channel while others are processing
state.mu.Lock()
state.broadcast = make(chan struct{})
state.mu.Unlock()
} else {
state.mu.Unlock()
}
There was a problem hiding this comment.
After unlocking state.mu, another goroutine could acquire the lock again and read the old broadcast channel reference just as it's being closed
in the PR code all reads/updates of the state.broadcast field
are between Lock and Unlock calls, so how can another goroutine read an obsolete reference?
To my understanding the change in GC pressure should be negligible - about 112 bytes per allocation. This shouldn't matter compared to the memory consumed for managing the git operations themselves (200kb - 1MB+).
I opted to fix just the bug at hand for now; maybe we can improve the waiting / queuing logic in a follow up PR?
This would imply changing
I'm not sure how I'd achieve this. Once Lock returns (closer, nil), the caller owns the lock. The Lock function is done - it has no further interaction with the context. Cancellation after that point is the caller's responsibility (they call closer.Close()).
will do, adapting the test |
74a7567 to
9ccd941
Compare
- Move convoy tests from lock_convoy_test.go into lock_test.go - Add early context cancellation check at start of Lock() Signed-off-by: Mario Nebl <hello@mario-nebl.de> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9ccd941 to
38ebbc9
Compare
No - the problem doesn't surface as performance problem, but as a live lock. The affected repo server pod would stop consuming resources and all GenerateManifest calls would time out.
We haven't observed any changes in user visible behaviour in our internal testing. |
Fixes testifylint lint errors requiring require instead of assert for error assertions. Signed-off-by: Mario Nebl <hello@mario-nebl.de> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Lock()inreposerver/repository/lock.gocontext-aware to prevent convoy deadlocks under rapid commit burstssync.Cond(blocks indefinitely) withsync.Mutex+chan struct{}broadcast channel andselectonctx.Done()ctxthrough all 7Lock()call sites inrepository.goResolves #26866
Test plan
TestLock_WaiterForDifferentRevision_CannotBeUnblocked,TestLock_ConvoyFormsUnderSequentialRevisions)go build ./reposerver/...succeedsgo build ./controller/...succeeds