Skip to content

fix: Send sync operation in a separate event to prevent overwriting#834

Open
chetan-rns wants to merge 4 commits intoargoproj-labs:mainfrom
chetan-rns:op-race
Open

fix: Send sync operation in a separate event to prevent overwriting#834
chetan-rns wants to merge 4 commits intoargoproj-labs:mainfrom
chetan-rns:op-race

Conversation

@chetan-rns
Copy link
Copy Markdown
Collaborator

@chetan-rns chetan-rns commented Mar 25, 2026

What does this PR do / why we need it:

Currently, changes to the operation field are sent via the regular spec update events. However, when the agent and the principal are disconnected, incoming spec update events will overwrite the old events in the queue, and the operation update may never reach the agent. In this PR, we send the operation updates in a separate event type so that it won't be overwritten by regular spec updates.

Which issue(s) this PR fixes:

Fixes #809

How to test changes / Special notes to the reviewer:
Steps to reproduce are provided in the issue

Checklist

  • Documentation update is required by this PR (and has been updated) OR no documentation update is required.

Summary by CodeRabbit

  • New Features
    • Introduced a SetOperation event to deliver Application operations separately from spec updates; trace context is included for both events.
  • Behavior Changes
    • Spec update events now have operations stripped; operation transitions (nil→non-nil) are emitted as a separate SetOperation event.
    • Managed agents validate incoming operation source identity and reject mismatches to prevent unintended overrides.
  • Tests
    • Added tests covering SetOperation emission and handling in managed and autonomous modes.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d25d457b-c13c-435f-94c1-67c02c6b4ed2

📥 Commits

Reviewing files that changed from the base of the PR and between d1e4d61 and 0f153ea.

📒 Files selected for processing (6)
  • agent/inbound.go
  • agent/inbound_test.go
  • internal/event/event.go
  • internal/manager/application/application.go
  • principal/callbacks.go
  • principal/callbacks_test.go
✅ Files skipped from review due to trivial changes (1)
  • internal/event/event.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • principal/callbacks.go
  • principal/callbacks_test.go
  • internal/manager/application/application.go
  • agent/inbound_test.go

📝 Walkthrough

Walkthrough

Adds a new SetOperation event and end-to-end handling: the principal emits a separate SetOperation when an Application's .operation transitions nil→non-nil; the agent handles event.SetOperation and calls ApplicationManager.SetOperation, which patches only the .operation field.

Changes

Cohort / File(s) Summary
Event Infra
internal/event/event.go
Added SetOperation event type constant.
Principal callbacks & tests
principal/callbacks.go, principal/callbacks_test.go
On Application update nil→non-nil operation: emit SpecUpdate with Operation cleared, then emit SetOperation (trace injected); tests updated to expect both events.
Agent inbound handler & tests
agent/inbound.go, agent/inbound_test.go
Handle event.SetOperation in processIncomingApplication: validate source UID in managed mode, call appManager.SetOperation; added tests for managed/autonomous flows and UID-mismatch rejection.
Application manager
internal/manager/application/application.go
Added ApplicationManager.SetOperation(ctx, incoming *Application) to compute/apply a patch that updates only .operation (agent-role gating, namespace handling, ignore-change call).

Sequence Diagram

sequenceDiagram
    participant Principal as Principal
    participant Agent as Agent
    participant AppMgr as ApplicationManager

    Principal->>Principal: Detect Application.Operation nil→non-nil
    Principal->>Agent: Send SpecUpdate (Operation cleared) [trace]
    Principal->>Agent: Send SetOperation (Operation present) [trace]
    Agent->>Agent: processIncomingApplication(event.SetOperation)
    Agent->>AppMgr: SetOperation(ctx, incoming Application)
    AppMgr->>AppMgr: Compute patch updating only .operation
    AppMgr->>AppMgr: Apply update/patch
    AppMgr-->>Agent: Return updated Application / error
    Agent->>Agent: Log result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • jannfis
  • mikeshng

Poem

🐰 I split the ops and hopped away,
Spec first, then op to safely play.
No more lost sync when links disconnect,
A tiny patch to keep things correct,
Hoppity hop — the agent's saved the day! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the primary change: sending sync operation via a separate event to prevent overwriting.
Linked Issues check ✅ Passed The PR implements a solution for issue #809 by creating a separate SetOperation event to prevent operation updates from being overwritten by regular spec updates.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing issue #809 by introducing the SetOperation event handling and splitting operation updates from spec updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent/inbound.go`:
- Around line 230-236: The SetOperation branch currently routes autonomous
agents to a.appManager.UpdateOperation which copies annotations, labels, and
status and can clobber unrelated state; change the autonomous branch (case
types.AgentModeAutonomous in the SetOperation handling) to call a dedicated
operation-only updater on a.appManager (create a method named similarly to
SetManagedOperation, e.g., SetAutonomousOperation or SetOperationOnly) that only
writes the operation field and does not copy annotations/labels/status, then
call that new method instead of UpdateOperation so the SetOperation event
remains operation-scoped.

In `@principal/ha_integration.go`:
- Around line 561-563: The switch is wrongly sending event.SetOperation through
server.appManager.Upsert which clears .Operation for principal-role managers;
remove SetOperation from the Upsert case and add a dedicated branch that calls a
replica-specific persistence method (e.g., server.appManager.PersistSetOperation
or server.appManager.UpdateOperationReplica) after h.remapAppSetOwnerRefs(ctx,
app), ensuring that only the Operation field is stored and that the normal
Upsert path (server.appManager.Upsert) is not used for SetOperation so the
pending operation is retained on the standby; implement the new persistence
method to update/persist only the Operation on the replica without invoking the
principal-role clearing logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7676ebfe-50e3-413c-a953-b9238ecc647a

📥 Commits

Reviewing files that changed from the base of the PR and between b2c37f3 and 94058fb.

📒 Files selected for processing (7)
  • agent/inbound.go
  • agent/inbound_test.go
  • internal/event/event.go
  • internal/manager/application/application.go
  • principal/callbacks.go
  • principal/callbacks_test.go
  • principal/ha_integration.go

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 37.20930% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.53%. Comparing base (63a79be) to head (0f153ea).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/manager/application/application.go 0.00% 23 Missing ⚠️
agent/inbound.go 60.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #834      +/-   ##
==========================================
- Coverage   46.55%   46.53%   -0.03%     
==========================================
  Files         122      122              
  Lines       17456    17495      +39     
==========================================
+ Hits         8127     8141      +14     
- Misses       8585     8609      +24     
- Partials      744      745       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/manager/application/application.go (1)

589-623: Add focused tests for SetOperation behavior.

Given this method is central to issue #809 and changed-line coverage is currently missing for this file, please add direct tests for: operation-only patching, namespace mapping behavior, and operationToUse termination guard handling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/manager/application/application.go` around lines 589 - 623, Add
focused unit tests for ApplicationManager.SetOperation to cover (1)
operation-only patching by asserting update is invoked with only Operation
changed (exercise update callback and jsondiff behavior via update or a test
double), (2) namespace mapping when destinationBasedMapping is false by creating
incoming Application without namespace and verifying SetNamespace(m.namespace)
results in namespaced update, and (3) operationToUse termination guard by
testing cases where existing.Operation prevents overwriting (e.g., terminal
state) and ensuring behavior matches operationToUse logic; use symbols
SetOperation, operationToUse, update, IgnoreChange, destinationBasedMapping,
QualifiedName and ResourceVersion to locate and assert expected side-effects and
logs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent/inbound.go`:
- Around line 230-235: In the SetOperation branch, prevent setting .operation
when the incoming app is from a different source UID in managed mode: before
calling a.appManager.SetOperation(a.context, incomingApp) verify the app exists
and that its source UID matches (use the existing exists/sourceUIDMatch logic or
helper) and only call SetOperation if the UIDs match; if the UIDs differ, skip
the SetOperation and log/return an error to block the mutation for incomingApp
in managed mode.

---

Nitpick comments:
In `@internal/manager/application/application.go`:
- Around line 589-623: Add focused unit tests for
ApplicationManager.SetOperation to cover (1) operation-only patching by
asserting update is invoked with only Operation changed (exercise update
callback and jsondiff behavior via update or a test double), (2) namespace
mapping when destinationBasedMapping is false by creating incoming Application
without namespace and verifying SetNamespace(m.namespace) results in namespaced
update, and (3) operationToUse termination guard by testing cases where
existing.Operation prevents overwriting (e.g., terminal state) and ensuring
behavior matches operationToUse logic; use symbols SetOperation, operationToUse,
update, IgnoreChange, destinationBasedMapping, QualifiedName and ResourceVersion
to locate and assert expected side-effects and logs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4361d30e-d84a-45ec-b4b2-b49c5899074f

📥 Commits

Reviewing files that changed from the base of the PR and between 94058fb and 7e88031.

📒 Files selected for processing (3)
  • agent/inbound.go
  • agent/inbound_test.go
  • internal/manager/application/application.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • agent/inbound_test.go

@drewbailey
Copy link
Copy Markdown
Collaborator

drewbailey commented Mar 27, 2026

thanks @chetan-rns I was encountering this in a slightly different scenario from principal disconnect in #840. I've tested this branch and it fixes the issue 🙌

s.ha.ForwardEventForReplication(event.New(ev, event.TargetApplication), agentName, replication.DirectionOutbound)
logCtx.WithField("event_type", ev.Type()).Tracef("Added app to send queue, total length now %d", q.Len())

// When a new operation appears (nil → non-nil), send it as a separate
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case where the operation field is set while the principal controller is not running (pod is stopped due to OOM, node is down, operator is being upgraded, etc)?

If the controller is not running, it will not see the watch event which tells it that the .operation field of the Application was modified.

We will still reconcile the Application on startup, but the principal has no way of knowing whether or not the .operation field change was previously communicated to the agent.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be fair to remove any existing .operation field from the Application CR on startup?

If it has been transmitted to the agent prior, the sync will proceed on the agent's cluster and the status will be propagated to the principal eventually. For the UI, it would look like no sync in progress though.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On restart, the principal's informer sends a Create event to the agent. So, the operation will be propagated to the agent if it still exists on the principal. The agent will run the operation and clear the field in the next status update.

There can be two cases where the operation field could still exist when the principal restarts:

  1. The user triggered the operation when the principal was down. This is handled because the operation will be propagated via the Create event once the principal restarts.
  2. The principal sent the operation to the agent and shut down before receiving the status update (which clears the operation field). So, the operation field still exists when the principal restarts. There is a small window, where the principal will send the Create event (with operation) before it receives the status update. So, there is a chance the operation might run again on the agent cluster. This might be okay given it happens rarely?

Copy link
Copy Markdown
Member

@jgwest jgwest Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be okay given it happens rarely?

I'm fine with this.

Brainstorming a possible solution (maybe for another PR, if we like the idea)... If we wanted to keep track of whether we had already sent a specific operation request to agent, we could do something like this:

  1. User triggers an operation on principal by modifying operation field, as per usual.

    • When principal (in managed mode) sees that Application .operation field has changed (via informer), it injects a randomly generated UUID in .operation.info field, something like "agent-operation-uuid: (randomly generated uuid)".
      • That value is written to principal's Application CR .operation field.
      • AFAICT it is safe for us to inject custom values into this field, but my search was not exhaustive.
    • Next, after that new value is written, agent sends the full contents of that operation (including the new field) via SetOperation, as per this PR.
  2. Agent receives SetOperation from principal, from above.

    • When it receives SetOperation, it compares the agent-operation-uuid: (...) value in .operation.info from SetOperation (from above) with what is currently set on the agent's Application CR .operation.info field.
      • If the UUIDs match, then the SetOperation is ignored (since we assume it was already sent)
      • If the UUIDs don't match, then SetOperation is processed as usual.

Does this make sense?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of persisting the UUID with the resource since it survives the principal/agent restarts. But we should take care not to overwrite the user-added operation info. One more option is to use a dedicated label/annotation to track this operation ID.

Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

User-triggered sync operation is never sent when principal and managed-agent are temporarily disconnected

5 participants