Skip to content

Conversation

@dan-blanchard
Copy link

@dan-blanchard dan-blanchard commented Nov 20, 2025

Fixes #8024 by updating filter_pods to sort pods by name (and numeric worker suffix).

I believe the issue in #8024 was caused by inconsistent ordering of pods returned by the Kubernetes API and the lack of sorting for k8s clusters in CloudVmRayResourceHandle.update_cluster_ips.

The Kubernetes API makes no guarantees on the order of pods returned, so without explicit sorting, the ordering of pods could end up with the worker pods in a different order than expected. Because insertion order is preserved with dicts in Python 3.7+, this ordering was preserved all the way through to SSHConfigHelper.add_cluster.

This PR add logic to filter_pods to ensure that pods are always sorted by name, with worker pods sorted by their numeric suffix (cluster-head, cluster-worker1, cluster-worker2, etc.).

The reason I did not just enable sorting in CloudVmRayResourceHandle.update_cluster_ips is that that currently sorts based on IP addresses, which may not always correspond to the desired pod ordering, and it seems more natural to have filter_pods return pods in a consistent order to prevent other subtle issues like this one.

I added tests to verify that filter_pods returns pods in the expected order.

I also setup a 16-node GKE cluster that I launched via simple test template:

resources:
  cpus: 2+
  
num_nodes: 16

setup: |
  echo "Setup on $(hostname)"

run: |
  echo "Running on $(hostname) - Node $SKYPILOT_NODE_RANK"
  echo "Node IP: $(hostname -I | awk '{print $1}')"

I ran sky launch --infra k8s/gke_modern-rhythm-478817-v4_us-central1-c_skypilot-test-cluster test_multinode.yaml and after the cluster was up I ran the following bash script to use the ssh aliases in the config to verify that all of the hostnames matched what expected:

#!/bin/bash

echo "Testing head node..."
remote_hostname=$(ssh sky-3795-danielblanchard "hostname")
if [[ "$remote_hostname" == *"head" ]]; then
    echo "MATCHED sky-3795-danielblanchard -> $remote_hostname"
else
    echo "FAILED sky-3795-danielblanchard -> $remote_hostname (expected suffix: head)"
fi

for i in {1..15}; do
    alias="sky-3795-danielblanchard-worker$i"
    expected_suffix="worker$i"
    echo "Testing $alias..."
    remote_hostname=$(ssh "$alias" "hostname")
    if [[ "$remote_hostname" == *"$expected_suffix" ]]; then
        echo "MATCHED $alias -> $remote_hostname"
    else
        echo "FAILED $alias -> $remote_hostname (expected suffix: $expected_suffix)"
    fi
done

The output of which showed everything was good:

Testing head node...
MATCHED sky-3795-danielblanchard -> sky-3795-danielblanchard-efc0a1e0-head
Testing sky-3795-danielblanchard-worker1...
MATCHED sky-3795-danielblanchard-worker1 -> sky-3795-danielblanchard-efc0a1e0-worker1
Testing sky-3795-danielblanchard-worker2...
MATCHED sky-3795-danielblanchard-worker2 -> sky-3795-danielblanchard-efc0a1e0-worker2
Testing sky-3795-danielblanchard-worker3...
MATCHED sky-3795-danielblanchard-worker3 -> sky-3795-danielblanchard-efc0a1e0-worker3
Testing sky-3795-danielblanchard-worker4...
MATCHED sky-3795-danielblanchard-worker4 -> sky-3795-danielblanchard-efc0a1e0-worker4
Testing sky-3795-danielblanchard-worker5...
MATCHED sky-3795-danielblanchard-worker5 -> sky-3795-danielblanchard-efc0a1e0-worker5
Testing sky-3795-danielblanchard-worker6...
MATCHED sky-3795-danielblanchard-worker6 -> sky-3795-danielblanchard-efc0a1e0-worker6
Testing sky-3795-danielblanchard-worker7...
MATCHED sky-3795-danielblanchard-worker7 -> sky-3795-danielblanchard-efc0a1e0-worker7
Testing sky-3795-danielblanchard-worker8...
MATCHED sky-3795-danielblanchard-worker8 -> sky-3795-danielblanchard-efc0a1e0-worker8
Testing sky-3795-danielblanchard-worker9...
MATCHED sky-3795-danielblanchard-worker9 -> sky-3795-danielblanchard-efc0a1e0-worker9
Testing sky-3795-danielblanchard-worker10...
MATCHED sky-3795-danielblanchard-worker10 -> sky-3795-danielblanchard-efc0a1e0-worker10
Testing sky-3795-danielblanchard-worker11...
MATCHED sky-3795-danielblanchard-worker11 -> sky-3795-danielblanchard-efc0a1e0-worker11
Testing sky-3795-danielblanchard-worker12...
MATCHED sky-3795-danielblanchard-worker12 -> sky-3795-danielblanchard-efc0a1e0-worker12
Testing sky-3795-danielblanchard-worker13...
MATCHED sky-3795-danielblanchard-worker13 -> sky-3795-danielblanchard-efc0a1e0-worker13
Testing sky-3795-danielblanchard-worker14...
MATCHED sky-3795-danielblanchard-worker14 -> sky-3795-danielblanchard-efc0a1e0-worker14
Testing sky-3795-danielblanchard-worker15...
MATCHED sky-3795-danielblanchard-worker15 -> sky-3795-danielblanchard-efc0a1e0-worker15

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Fixes skypilot-org#8024 by updating `filter_pods` to
sort pods by name (and numeric worker suffix).
This ordering is then used downstream
when creating command runners for each pod.
This ensures that the worker numbers used by users
match the actual pod numbering/names in Kubernetes.
@dan-blanchard
Copy link
Author

I updated the description to show that I ran end-to-end tests that showed that this works with a 16-node GKE cluster.

@dan-blanchard
Copy link
Author

I ran all of the kubernetes-related smoke tests via python -m pytest tests/smoke_tests -m kubernetes --kubernetes -v and all of them passed except for test_kubernetes_context_failover (because I didn't have the second cluster setup).

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dan-blanchard! The PR looks good to me with a minor comment below : )

# worker2, worker3, ...) even when Kubernetes API returns them in
# arbitrary order. This works even if there were somehow pod names other
# than head/worker ones, but that may be overkill.
def get_pod_sort_key(pod: V1Pod) -> Tuple[int, Union[int, str]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this Union[int, str] seems a bit concerning, i.e. what if some pods returns a str but some returns a int? We will get the following error during sorting?

>>> a = (0, 1)
>>> b = (0, 'my-worker')
>>> a > b
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'int' and 'str'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of the second value in the tuple depends on the value of the first, so this would never happen with this function. We only ever have int values when the first item is 1. I've updated the type signature to be more precise about this fact (although a big uglier to read).

Comment on lines 3198 to 3206
if '-worker' in name:
try:
return (1, int(name.split('-worker')[-1]))
except (ValueError, IndexError):
return (2, name)
elif '-head' in name:
return (0, name)
else:
return (2, name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this may fail to keep head node as the first if we have a cluster name my-workers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I've update the code so that we look only at the final suffixes for the name now and check what those suffixes start with. I also updated the tests so that we check for this specific case, and I verified that before the fix the new tests failed, whereas they pass now.

@Michaelvll
Copy link
Collaborator

/smoke-test --kubernetes
/smoke-test --kubernetes --remote-server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Kubernetes] Pod name and worker index mismatch

2 participants