Skip to content

Agent field update results in Stopping(RequestedAtRuntime) workload state reported from the new agent #634

@inf17101

Description

@inf17101

Sporadically, when updating an agent field of a workload to switch the workload to an other agent, for the new agent on which the workload is switched to, a wrong workload state Stopping(RequestedAtRuntime) is reported.

You can see the wrong workload state in the output of the ank-cli:

e.g. you switch the workload new_workload from agent_A to agent_B, then even if new_workload is running on the new agent agent_B, the agent_B forwards Stopping(RequestedAtRuntime) as wrong workload state.

> ank set state desiredState.workloads.new_workload ./target/new_state.yaml 
Successfully applied the manifest(s).
Waiting for workload(s) to reach desired states (press Ctrl+C to interrupt).

WORKLOAD NAME   AGENT      RUNTIME     EXECUTION STATE                ADDITIONAL INFO
new_workload    agent_B    podman      Stopping(RequestedAtRuntime)                  
new_workload    agent_A                Removed                                       
^C

Current Behavior

When switching a workload to an other agent (agent field update of the workload) a wrong workload state is reported. This is sporadically.

Expected Behavior

Steps to Reproduce

  1. ank-server -k -c path/to/a/manifest/with/one/workload.yaml
  2. ank-agent -k -n agent_A
  3. ank-agent -k -n agent_B
  4. ank set state desiredState.workloads.new_workload state_with_updated_agent.yaml

You can see in the wait mode that a wrong workload state is reported for the switched workload.

Context (Environment)

All supported platforms

Logs

You can see in the logs, that the new agent of the workload forwards the wrong workload state:

[2025-12-01T14:18:51Z DEBUG ank_agent::agent_manager] Storing and forwarding local workload state 'WorkloadStateSpec { instance_name: WorkloadInstanceNameSpec { workload_name: "new_workload", agent_name: "agent_B", id: "7d5537dfe2f8036c80908f50e2d35f77b1f67c3600eaff45fefb386fa9fe41f0" }, execution_state: ExecutionStateSpec { additional_info: "", execution_state_enum: Stopping(RequestedAtRuntime) } }'.
...
[2025-12-01T14:18:52Z TRACE ank_agent::runtime_connectors::podman::podman_runtime] Returning the state 'Running(Ok)' for the workload 'ffc1d6193c4d7d425677f7c98f4c9bcd5b2e02cb0f4da757adf6dfddc0e76f8d'

Additional Information

The issue is in the workload state transition implementation ankaios_api/src/ank_base/workload_state.rs, which does not validate agent name with the workload state.

The condition there says, when the current workload state is Stopping(RequestedAtRuntime) and the incoming one is Running(Ok), then the current workload state is forwarded and sent to the Ankaios server. This happens because the agent, where the workload is removed forwards first the Stopping(RequestedAtRuntime) and when the new agent deploys the workload very quickly before it receives a Remoed of the old agent, then it matches the new Running(Ok) against the Stopping(RequestedAtRuntime) of the other agent in its internal workload state map and returns Stopping(RequestedAtRuntime) again.

The solution is to validate the workload states against the own agent name as well.

Final result

To be filled by the one closing the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working. Issue will appear in the change log "Bug Fixes"

    Type

    No type

    Projects

    Status

    No status

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions