fix: Send sync operation in a separate event to prevent overwriting#834
fix: Send sync operation in a separate event to prevent overwriting#834chetan-rns wants to merge 4 commits intoargoproj-labs:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughAdds a new Changes
Sequence DiagramsequenceDiagram
participant Principal as Principal
participant Agent as Agent
participant AppMgr as ApplicationManager
Principal->>Principal: Detect Application.Operation nil→non-nil
Principal->>Agent: Send SpecUpdate (Operation cleared) [trace]
Principal->>Agent: Send SetOperation (Operation present) [trace]
Agent->>Agent: processIncomingApplication(event.SetOperation)
Agent->>AppMgr: SetOperation(ctx, incoming Application)
AppMgr->>AppMgr: Compute patch updating only .operation
AppMgr->>AppMgr: Apply update/patch
AppMgr-->>Agent: Return updated Application / error
Agent->>Agent: Log result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@agent/inbound.go`:
- Around line 230-236: The SetOperation branch currently routes autonomous
agents to a.appManager.UpdateOperation which copies annotations, labels, and
status and can clobber unrelated state; change the autonomous branch (case
types.AgentModeAutonomous in the SetOperation handling) to call a dedicated
operation-only updater on a.appManager (create a method named similarly to
SetManagedOperation, e.g., SetAutonomousOperation or SetOperationOnly) that only
writes the operation field and does not copy annotations/labels/status, then
call that new method instead of UpdateOperation so the SetOperation event
remains operation-scoped.
In `@principal/ha_integration.go`:
- Around line 561-563: The switch is wrongly sending event.SetOperation through
server.appManager.Upsert which clears .Operation for principal-role managers;
remove SetOperation from the Upsert case and add a dedicated branch that calls a
replica-specific persistence method (e.g., server.appManager.PersistSetOperation
or server.appManager.UpdateOperationReplica) after h.remapAppSetOwnerRefs(ctx,
app), ensuring that only the Operation field is stored and that the normal
Upsert path (server.appManager.Upsert) is not used for SetOperation so the
pending operation is retained on the standby; implement the new persistence
method to update/persist only the Operation on the replica without invoking the
principal-role clearing logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7676ebfe-50e3-413c-a953-b9238ecc647a
📒 Files selected for processing (7)
agent/inbound.goagent/inbound_test.gointernal/event/event.gointernal/manager/application/application.goprincipal/callbacks.goprincipal/callbacks_test.goprincipal/ha_integration.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #834 +/- ##
==========================================
- Coverage 46.55% 46.53% -0.03%
==========================================
Files 122 122
Lines 17456 17495 +39
==========================================
+ Hits 8127 8141 +14
- Misses 8585 8609 +24
- Partials 744 745 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/manager/application/application.go (1)
589-623: Add focused tests forSetOperationbehavior.Given this method is central to issue
#809and changed-line coverage is currently missing for this file, please add direct tests for: operation-only patching, namespace mapping behavior, andoperationToUsetermination guard handling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/manager/application/application.go` around lines 589 - 623, Add focused unit tests for ApplicationManager.SetOperation to cover (1) operation-only patching by asserting update is invoked with only Operation changed (exercise update callback and jsondiff behavior via update or a test double), (2) namespace mapping when destinationBasedMapping is false by creating incoming Application without namespace and verifying SetNamespace(m.namespace) results in namespaced update, and (3) operationToUse termination guard by testing cases where existing.Operation prevents overwriting (e.g., terminal state) and ensuring behavior matches operationToUse logic; use symbols SetOperation, operationToUse, update, IgnoreChange, destinationBasedMapping, QualifiedName and ResourceVersion to locate and assert expected side-effects and logs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@agent/inbound.go`:
- Around line 230-235: In the SetOperation branch, prevent setting .operation
when the incoming app is from a different source UID in managed mode: before
calling a.appManager.SetOperation(a.context, incomingApp) verify the app exists
and that its source UID matches (use the existing exists/sourceUIDMatch logic or
helper) and only call SetOperation if the UIDs match; if the UIDs differ, skip
the SetOperation and log/return an error to block the mutation for incomingApp
in managed mode.
---
Nitpick comments:
In `@internal/manager/application/application.go`:
- Around line 589-623: Add focused unit tests for
ApplicationManager.SetOperation to cover (1) operation-only patching by
asserting update is invoked with only Operation changed (exercise update
callback and jsondiff behavior via update or a test double), (2) namespace
mapping when destinationBasedMapping is false by creating incoming Application
without namespace and verifying SetNamespace(m.namespace) results in namespaced
update, and (3) operationToUse termination guard by testing cases where
existing.Operation prevents overwriting (e.g., terminal state) and ensuring
behavior matches operationToUse logic; use symbols SetOperation, operationToUse,
update, IgnoreChange, destinationBasedMapping, QualifiedName and ResourceVersion
to locate and assert expected side-effects and logs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4361d30e-d84a-45ec-b4b2-b49c5899074f
📒 Files selected for processing (3)
agent/inbound.goagent/inbound_test.gointernal/manager/application/application.go
🚧 Files skipped from review as they are similar to previous changes (1)
- agent/inbound_test.go
|
thanks @chetan-rns I was encountering this in a slightly different scenario from principal disconnect in #840. I've tested this branch and it fixes the issue 🙌 |
| s.ha.ForwardEventForReplication(event.New(ev, event.TargetApplication), agentName, replication.DirectionOutbound) | ||
| logCtx.WithField("event_type", ev.Type()).Tracef("Added app to send queue, total length now %d", q.Len()) | ||
|
|
||
| // When a new operation appears (nil → non-nil), send it as a separate |
There was a problem hiding this comment.
What about the case where the operation field is set while the principal controller is not running (pod is stopped due to OOM, node is down, operator is being upgraded, etc)?
If the controller is not running, it will not see the watch event which tells it that the .operation field of the Application was modified.
We will still reconcile the Application on startup, but the principal has no way of knowing whether or not the .operation field change was previously communicated to the agent.
There was a problem hiding this comment.
Would it be fair to remove any existing .operation field from the Application CR on startup?
If it has been transmitted to the agent prior, the sync will proceed on the agent's cluster and the status will be propagated to the principal eventually. For the UI, it would look like no sync in progress though.
There was a problem hiding this comment.
On restart, the principal's informer sends a Create event to the agent. So, the operation will be propagated to the agent if it still exists on the principal. The agent will run the operation and clear the field in the next status update.
There can be two cases where the operation field could still exist when the principal restarts:
- The user triggered the operation when the principal was down. This is handled because the operation will be propagated via the
Createevent once the principal restarts. - The principal sent the operation to the agent and shut down before receiving the status update (which clears the operation field). So, the operation field still exists when the principal restarts. There is a small window, where the principal will send the Create event (with operation) before it receives the status update. So, there is a chance the operation might run again on the agent cluster. This might be okay given it happens rarely?
There was a problem hiding this comment.
This might be okay given it happens rarely?
I'm fine with this.
Brainstorming a possible solution (maybe for another PR, if we like the idea)... If we wanted to keep track of whether we had already sent a specific operation request to agent, we could do something like this:
-
User triggers an operation on principal by modifying operation field, as per usual.
- When principal (in managed mode) sees that Application
.operationfield has changed (via informer), it injects a randomly generated UUID in.operation.infofield, something like "agent-operation-uuid: (randomly generated uuid)".- That value is written to principal's Application CR .operation field.
- AFAICT it is safe for us to inject custom values into this field, but my search was not exhaustive.
- Next, after that new value is written, agent sends the full contents of that operation (including the new field) via SetOperation, as per this PR.
- When principal (in managed mode) sees that Application
-
Agent receives SetOperation from principal, from above.
- When it receives SetOperation, it compares the
agent-operation-uuid: (...)value in.operation.infofrom SetOperation (from above) with what is currently set on the agent's Application CR.operation.infofield.- If the UUIDs match, then the SetOperation is ignored (since we assume it was already sent)
- If the UUIDs don't match, then SetOperation is processed as usual.
- When it receives SetOperation, it compares the
Does this make sense?
There was a problem hiding this comment.
I like the idea of persisting the UUID with the resource since it survives the principal/agent restarts. But we should take care not to overwrite the user-added operation info. One more option is to use a dedicated label/annotation to track this operation ID.
Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
What does this PR do / why we need it:
Currently, changes to the operation field are sent via the regular spec update events. However, when the agent and the principal are disconnected, incoming spec update events will overwrite the old events in the queue, and the operation update may never reach the agent. In this PR, we send the operation updates in a separate event type so that it won't be overwritten by regular spec updates.
Which issue(s) this PR fixes:
Fixes #809
How to test changes / Special notes to the reviewer:
Steps to reproduce are provided in the issue
Checklist
Summary by CodeRabbit