Skip to content

fix(reposerver): context-aware revision lock to prevent convoy deadlock#26867

Open
marionebl wants to merge 4 commits intoargoproj:masterfrom
marionebl:fix/reposerver-convoy-deadlock
Open

fix(reposerver): context-aware revision lock to prevent convoy deadlock#26867
marionebl wants to merge 4 commits intoargoproj:masterfrom
marionebl:fix/reposerver-convoy-deadlock

Conversation

@marionebl
Copy link

Summary

  • Make Lock() in reposerver/repository/lock.go context-aware to prevent convoy deadlocks under rapid commit bursts
  • Replace sync.Cond (blocks indefinitely) with sync.Mutex + chan struct{} broadcast channel and select on ctx.Done()
  • Pass ctx through all 7 Lock() call sites in repository.go

Resolves #26866

Test plan

  • New convoy deadlock tests pass (TestLock_WaiterForDifferentRevision_CannotBeUnblocked, TestLock_ConvoyFormsUnderSequentialRevisions)
  • All existing lock tests pass unchanged
  • go build ./reposerver/... succeeds
  • go build ./controller/... succeeds

@marionebl marionebl requested a review from a team as a code owner March 17, 2026 10:56
@bunnyshell
Copy link

bunnyshell bot commented Mar 17, 2026

🔴 Preview Environment stopped on Bunnyshell

See: Environment Details | Pipeline Logs

Available commands (reply to this comment):

  • 🔵 /bns:start to start the environment
  • 🚀 /bns:deploy to redeploy the environment
  • /bns:delete to remove the environment

Make Lock() in reposerver/repository/lock.go context-aware to prevent
convoy deadlocks under rapid commit bursts. Previously, sync.Cond.Wait()
blocked indefinitely with no cancellation, causing goroutines for newer
revisions to pile up behind the current revision.

Replace sync.Cond with sync.Mutex + chan struct{} broadcast channel and
use select to wait on both the broadcast and ctx.Done(), allowing callers
to cancel waiting via context.

Resolves argoproj#26866

Signed-off-by: Mario Nebl <hello@mario-nebl.de>
@marionebl marionebl force-pushed the fix/reposerver-convoy-deadlock branch from e99f72e to 979fd5c Compare March 17, 2026 11:39
@marionebl marionebl requested a review from a team as a code owner March 17, 2026 12:08
@marionebl marionebl force-pushed the fix/reposerver-convoy-deadlock branch 3 times, most recently from 7990021 to 74a7567 Compare March 17, 2026 12:39
@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 91.42857% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.20%. Comparing base (6b35246) to head (318ef90).

Files with missing lines Patch % Lines
reposerver/repository/lock.go 88.88% 1 Missing and 1 partial ⚠️
reposerver/repository/repository.go 94.11% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #26867      +/-   ##
==========================================
+ Coverage   63.17%   63.20%   +0.02%     
==========================================
  Files         414      414              
  Lines       56461    56468       +7     
==========================================
+ Hits        35669    35688      +19     
+ Misses      17422    17413       -9     
+ Partials     3370     3367       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@dudinea dudinea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marionebl thank you for working on this! The change to the locking mechanism looks good to me, have not found any problems with the locking change itself.

Please see my comment for some minor change request.

Some questions:

Do you have any data/graphs on performance improvement that this PR gives?

As a result of the PR some Repo Server methods might start returning context Canceled or DeadLine exceeded errors which they never did before. Have you seen any new/unexpected user visible behavior when running a loaded Argo CD instance with this PR: like any unexpected error messages or transient errors in UI, changes in error statuses in Applications or ApplicationSets?

Copy link
Member

@alexymantha alexymantha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too, thanks for fixing this!

Copy link
Contributor

@ppapapetrou76 ppapapetrou76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of concerns/discussion points

  1. The current implementation creates a new channel on every broadcast/lock release
    In high-contention scenarios, this could create GC pressure. WDYT?

  2. When a lock is released, all waiting goroutines will try to acquire the lock at the same time. Only one succeeds, and the others go back to waiting. This could be optimized using a semaphore or a queue; otherwise, in theory, a goroutine might be delayed too long if it's unlucky to acquire the lock. WDYT?

  3. Consider adding UTs for the following cases

  • context cancellation during init() callback
  • context cancellation after lock acquisition

and ideally a UT with 100+ concurrent goroutines

Comment on lines 46 to +50
if notify {
state.cond.Broadcast()
close(state.broadcast)
state.broadcast = make(chan struct{})
}
state.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing the code to the following to avoid the following race condition? After unlocking state.mu, another goroutine could acquire the lock again and read the old broadcast channel reference just as it's being closed. This timing window is small but exists.

if notify {
    close(state.broadcast)
    state.mu.Unlock()
    // Create new channel while others are processing
    state.mu.Lock()
    state.broadcast = make(chan struct{})
    state.mu.Unlock()
} else {
    state.mu.Unlock()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppapapetrou76

After unlocking state.mu, another goroutine could acquire the lock again and read the old broadcast channel reference just as it's being closed

in the PR code all reads/updates of the state.broadcast field
are between Lock and Unlock calls, so how can another goroutine read an obsolete reference?

@marionebl
Copy link
Author

marionebl commented Mar 23, 2026

The current implementation creates a new channel on every broadcast/lock release
In high-contention scenarios, this could create GC pressure. WDYT?

To my understanding the change in GC pressure should be negligible - about 112 bytes per allocation. This shouldn't matter compared to the memory consumed for managing the git operations themselves (200kb - 1MB+).

When a lock is released, all waiting goroutines will try to acquire the lock at the same time. Only one succeeds, and the others go back to waiting. This could be optimized using a semaphore or a queue; otherwise, in theory, a goroutine might be delayed too long if it's unlucky to acquire the lock. WDYT?

I opted to fix just the bug at hand for now; maybe we can improve the waiting / queuing logic in a follow up PR?

Consider adding UTs for the following cases

context cancellation during init() callback

This would imply changing init to receive context, which it does not today. Can we split this out into a separate PR?

context cancellation after lock acquisition

I'm not sure how I'd achieve this. Once Lock returns (closer, nil), the caller owns the lock. The Lock function is done - it has no further interaction with the context. Cancellation after that point is the caller's responsibility (they call closer.Close()).

and ideally a UT with 100+ concurrent goroutines

will do, adapting the test

@marionebl marionebl force-pushed the fix/reposerver-convoy-deadlock branch from 74a7567 to 9ccd941 Compare March 23, 2026 09:49
- Move convoy tests from lock_convoy_test.go into lock_test.go
- Add early context cancellation check at start of Lock()

Signed-off-by: Mario Nebl <hello@mario-nebl.de>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@marionebl marionebl force-pushed the fix/reposerver-convoy-deadlock branch from 9ccd941 to 38ebbc9 Compare March 23, 2026 09:51
@marionebl
Copy link
Author

Do you have any data/graphs on performance improvement that this PR gives?

No - the problem doesn't surface as performance problem, but as a live lock. The affected repo server pod would stop consuming resources and all GenerateManifest calls would time out.

As a result of the PR some Repo Server methods might start returning context Canceled or DeadLine exceeded errors which they never did before. Have you seen any new/unexpected user visible behavior when running a loaded Argo CD instance with this PR: like any unexpected error messages or transient errors in UI, changes in error statuses in Applications or ApplicationSets?

We haven't observed any changes in user visible behaviour in our internal testing.

marionebl and others added 2 commits March 23, 2026 12:10
Fixes testifylint lint errors requiring require instead of assert for
error assertions.

Signed-off-by: Mario Nebl <hello@mario-nebl.de>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dead lock in reposerver for commit bursts in quick succession

4 participants