fix(reposerver): context-aware revision lock to prevent convoy deadlock by marionebl · Pull Request #26867 · argoproj/argo-cd

marionebl · 2026-03-17T10:56:38Z

Summary

Make Lock() in reposerver/repository/lock.go context-aware to prevent convoy deadlocks under rapid commit bursts
Replace sync.Cond (blocks indefinitely) with sync.Mutex + chan struct{} broadcast channel and select on ctx.Done()
Pass ctx through all 7 Lock() call sites in repository.go

Resolves #26866

Test plan

New convoy deadlock tests pass (TestLock_WaiterForDifferentRevision_CannotBeUnblocked, TestLock_ConvoyFormsUnderSequentialRevisions)
All existing lock tests pass unchanged
go build ./reposerver/... succeeds
go build ./controller/... succeeds

bunnyshell · 2026-03-17T10:56:45Z

🔴 Preview Environment stopped on Bunnyshell

See: Environment Details | Pipeline Logs

Available commands (reply to this comment):

🔵 /bns:start to start the environment
🚀 /bns:deploy to redeploy the environment
❌ /bns:delete to remove the environment

Make Lock() in reposerver/repository/lock.go context-aware to prevent convoy deadlocks under rapid commit bursts. Previously, sync.Cond.Wait() blocked indefinitely with no cancellation, causing goroutines for newer revisions to pile up behind the current revision. Replace sync.Cond with sync.Mutex + chan struct{} broadcast channel and use select to wait on both the broadcast and ctx.Done(), allowing callers to cancel waiting via context. Resolves argoproj#26866 Signed-off-by: Mario Nebl <hello@mario-nebl.de>

codecov · 2026-03-17T13:19:31Z

Codecov Report

❌ Patch coverage is 91.42857% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.20%. Comparing base (6b35246) to head (318ef90).

Files with missing lines	Patch %	Lines
reposerver/repository/lock.go	88.88%	1 Missing and 1 partial ⚠️
reposerver/repository/repository.go	94.11%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #26867      +/-   ##
==========================================
+ Coverage   63.17%   63.20%   +0.02%     
==========================================
  Files         414      414              
  Lines       56461    56468       +7     
==========================================
+ Hits        35669    35688      +19     
+ Misses      17422    17413       -9     
+ Partials     3370     3367       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dudinea

@marionebl thank you for working on this! The change to the locking mechanism looks good to me, have not found any problems with the locking change itself.

Please see my comment for some minor change request.

Some questions:

Do you have any data/graphs on performance improvement that this PR gives?

As a result of the PR some Repo Server methods might start returning context Canceled or DeadLine exceeded errors which they never did before. Have you seen any new/unexpected user visible behavior when running a loaded Argo CD instance with this PR: like any unexpected error messages or transient errors in UI, changes in error statuses in Applications or ApplicationSets?

reposerver/repository/lock_convoy_test.go

alexymantha

LGTM too, thanks for fixing this!

ppapapetrou76

I have a couple of concerns/discussion points

The current implementation creates a new channel on every broadcast/lock release
In high-contention scenarios, this could create GC pressure. WDYT?
When a lock is released, all waiting goroutines will try to acquire the lock at the same time. Only one succeeds, and the others go back to waiting. This could be optimized using a semaphore or a queue; otherwise, in theory, a goroutine might be delayed too long if it's unlucky to acquire the lock. WDYT?
Consider adding UTs for the following cases

context cancellation during init() callback
context cancellation after lock acquisition

and ideally a UT with 100+ concurrent goroutines

ppapapetrou76 · 2026-03-19T10:23:11Z

reposerver/repository/lock.go

 		if notify {
-			state.cond.Broadcast()
+			close(state.broadcast)
+			state.broadcast = make(chan struct{})
 		}
+		state.mu.Unlock()


How about changing the code to the following to avoid the following race condition? After unlocking state.mu, another goroutine could acquire the lock again and read the old broadcast channel reference just as it's being closed. This timing window is small but exists.

if notify { close(state.broadcast) state.mu.Unlock() // Create new channel while others are processing state.mu.Lock() state.broadcast = make(chan struct{}) state.mu.Unlock() } else { state.mu.Unlock() }

@ppapapetrou76

After unlocking state.mu, another goroutine could acquire the lock again and read the old broadcast channel reference just as it's being closed

in the PR code all reads/updates of the state.broadcast field
are between Lock and Unlock calls, so how can another goroutine read an obsolete reference?

reposerver/repository/lock.go

marionebl · 2026-03-23T09:20:54Z

The current implementation creates a new channel on every broadcast/lock release
In high-contention scenarios, this could create GC pressure. WDYT?

To my understanding the change in GC pressure should be negligible - about 112 bytes per allocation. This shouldn't matter compared to the memory consumed for managing the git operations themselves (200kb - 1MB+).

When a lock is released, all waiting goroutines will try to acquire the lock at the same time. Only one succeeds, and the others go back to waiting. This could be optimized using a semaphore or a queue; otherwise, in theory, a goroutine might be delayed too long if it's unlucky to acquire the lock. WDYT?

I opted to fix just the bug at hand for now; maybe we can improve the waiting / queuing logic in a follow up PR?

Consider adding UTs for the following cases

context cancellation during init() callback

This would imply changing init to receive context, which it does not today. Can we split this out into a separate PR?

context cancellation after lock acquisition

I'm not sure how I'd achieve this. Once Lock returns (closer, nil), the caller owns the lock. The Lock function is done - it has no further interaction with the context. Cancellation after that point is the caller's responsibility (they call closer.Close()).

and ideally a UT with 100+ concurrent goroutines

will do, adapting the test

- Move convoy tests from lock_convoy_test.go into lock_test.go - Add early context cancellation check at start of Lock() Signed-off-by: Mario Nebl <hello@mario-nebl.de> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

marionebl · 2026-03-23T09:53:41Z

Do you have any data/graphs on performance improvement that this PR gives?

No - the problem doesn't surface as performance problem, but as a live lock. The affected repo server pod would stop consuming resources and all GenerateManifest calls would time out.

As a result of the PR some Repo Server methods might start returning context Canceled or DeadLine exceeded errors which they never did before. Have you seen any new/unexpected user visible behavior when running a loaded Argo CD instance with this PR: like any unexpected error messages or transient errors in UI, changes in error statuses in Applications or ApplicationSets?

We haven't observed any changes in user visible behaviour in our internal testing.

Fixes testifylint lint errors requiring require instead of assert for error assertions. Signed-off-by: Mario Nebl <hello@mario-nebl.de> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

marionebl requested a review from a team as a code owner March 17, 2026 10:56

marionebl force-pushed the fix/reposerver-convoy-deadlock branch from e99f72e to 979fd5c Compare March 17, 2026 11:39

marionebl requested a review from a team as a code owner March 17, 2026 12:08

marionebl force-pushed the fix/reposerver-convoy-deadlock branch 3 times, most recently from 7990021 to 74a7567 Compare March 17, 2026 12:39

dudinea suggested changes Mar 18, 2026

View reviewed changes

reposerver/repository/lock_convoy_test.go Outdated Show resolved Hide resolved

alexymantha approved these changes Mar 19, 2026

View reviewed changes

ppapapetrou76 reviewed Mar 19, 2026

View reviewed changes

marionebl force-pushed the fix/reposerver-convoy-deadlock branch from 74a7567 to 9ccd941 Compare March 23, 2026 09:49

marionebl force-pushed the fix/reposerver-convoy-deadlock branch from 9ccd941 to 38ebbc9 Compare March 23, 2026 09:51

marionebl and others added 2 commits March 23, 2026 12:10

fix(reposerver): use require.ErrorIs for error assertions in lock tests

5d212ff

Fixes testifylint lint errors requiring require instead of assert for error assertions. Signed-off-by: Mario Nebl <hello@mario-nebl.de> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'master' into fix/reposerver-convoy-deadlock

318ef90

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(reposerver): context-aware revision lock to prevent convoy deadlock#26867

fix(reposerver): context-aware revision lock to prevent convoy deadlock#26867
marionebl wants to merge 4 commits intoargoproj:masterfrom
marionebl:fix/reposerver-convoy-deadlock

marionebl commented Mar 17, 2026

Uh oh!

bunnyshell bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

dudinea left a comment

Uh oh!

Uh oh!

alexymantha left a comment

Uh oh!

ppapapetrou76 left a comment

Uh oh!

ppapapetrou76 Mar 19, 2026

Uh oh!

dudinea Mar 19, 2026

Uh oh!

Uh oh!

marionebl commented Mar 23, 2026 •

edited

Loading

Uh oh!

marionebl commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

marionebl commented Mar 17, 2026

Summary

Test plan

Uh oh!

bunnyshell bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Preview Environment stopped on Bunnyshell

Uh oh!

codecov bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dudinea left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexymantha left a comment

Choose a reason for hiding this comment

Uh oh!

ppapapetrou76 left a comment

Choose a reason for hiding this comment

Uh oh!

ppapapetrou76 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

dudinea Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marionebl commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marionebl commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bunnyshell bot commented Mar 17, 2026 •

edited

Loading

codecov bot commented Mar 17, 2026 •

edited

Loading

marionebl commented Mar 23, 2026 •

edited

Loading