-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Sporadically, when updating an agent field of a workload to switch the workload to an other agent, for the new agent on which the workload is switched to, a wrong workload state Stopping(RequestedAtRuntime) is reported.
You can see the wrong workload state in the output of the ank-cli:
e.g. you switch the workload new_workload from agent_A to agent_B, then even if new_workload is running on the new agent agent_B, the agent_B forwards Stopping(RequestedAtRuntime) as wrong workload state.
> ank set state desiredState.workloads.new_workload ./target/new_state.yaml
Successfully applied the manifest(s).
Waiting for workload(s) to reach desired states (press Ctrl+C to interrupt).
WORKLOAD NAME AGENT RUNTIME EXECUTION STATE ADDITIONAL INFO
new_workload agent_B podman Stopping(RequestedAtRuntime)
new_workload agent_A Removed
^C
Current Behavior
When switching a workload to an other agent (agent field update of the workload) a wrong workload state is reported. This is sporadically.
Expected Behavior
Steps to Reproduce
- ank-server -k -c path/to/a/manifest/with/one/workload.yaml
- ank-agent -k -n agent_A
- ank-agent -k -n agent_B
- ank set state desiredState.workloads.new_workload state_with_updated_agent.yaml
You can see in the wait mode that a wrong workload state is reported for the switched workload.
Context (Environment)
All supported platforms
Logs
You can see in the logs, that the new agent of the workload forwards the wrong workload state:
[2025-12-01T14:18:51Z DEBUG ank_agent::agent_manager] Storing and forwarding local workload state 'WorkloadStateSpec { instance_name: WorkloadInstanceNameSpec { workload_name: "new_workload", agent_name: "agent_B", id: "7d5537dfe2f8036c80908f50e2d35f77b1f67c3600eaff45fefb386fa9fe41f0" }, execution_state: ExecutionStateSpec { additional_info: "", execution_state_enum: Stopping(RequestedAtRuntime) } }'.
...
[2025-12-01T14:18:52Z TRACE ank_agent::runtime_connectors::podman::podman_runtime] Returning the state 'Running(Ok)' for the workload 'ffc1d6193c4d7d425677f7c98f4c9bcd5b2e02cb0f4da757adf6dfddc0e76f8d'
Additional Information
The issue is in the workload state transition implementation ankaios_api/src/ank_base/workload_state.rs, which does not validate agent name with the workload state.
The condition there says, when the current workload state is Stopping(RequestedAtRuntime) and the incoming one is Running(Ok), then the current workload state is forwarded and sent to the Ankaios server. This happens because the agent, where the workload is removed forwards first the Stopping(RequestedAtRuntime) and when the new agent deploys the workload very quickly before it receives a Remoed of the old agent, then it matches the new Running(Ok) against the Stopping(RequestedAtRuntime) of the other agent in its internal workload state map and returns Stopping(RequestedAtRuntime) again.
The solution is to validate the workload states against the own agent name as well.
Final result
To be filled by the one closing the issue.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status