-
Notifications
You must be signed in to change notification settings - Fork 857
[k8s] Fix pod name and worker index mismatch #8037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[k8s] Fix pod name and worker index mismatch #8037
Conversation
Fixes skypilot-org#8024 by updating `filter_pods` to sort pods by name (and numeric worker suffix). This ordering is then used downstream when creating command runners for each pod. This ensures that the worker numbers used by users match the actual pod numbering/names in Kubernetes.
|
I updated the description to show that I ran end-to-end tests that showed that this works with a 16-node GKE cluster. |
|
I ran all of the kubernetes-related smoke tests via |
Michaelvll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dan-blanchard! The PR looks good to me with a minor comment below : )
sky/provision/kubernetes/utils.py
Outdated
| # worker2, worker3, ...) even when Kubernetes API returns them in | ||
| # arbitrary order. This works even if there were somehow pod names other | ||
| # than head/worker ones, but that may be overkill. | ||
| def get_pod_sort_key(pod: V1Pod) -> Tuple[int, Union[int, str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this Union[int, str] seems a bit concerning, i.e. what if some pods returns a str but some returns a int? We will get the following error during sorting?
>>> a = (0, 1)
>>> b = (0, 'my-worker')
>>> a > b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'int' and 'str'There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of the second value in the tuple depends on the value of the first, so this would never happen with this function. We only ever have int values when the first item is 1. I've updated the type signature to be more precise about this fact (although a big uglier to read).
sky/provision/kubernetes/utils.py
Outdated
| if '-worker' in name: | ||
| try: | ||
| return (1, int(name.split('-worker')[-1])) | ||
| except (ValueError, IndexError): | ||
| return (2, name) | ||
| elif '-head' in name: | ||
| return (0, name) | ||
| else: | ||
| return (2, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, this may fail to keep head node as the first if we have a cluster name my-workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I've update the code so that we look only at the final suffixes for the name now and check what those suffixes start with. I also updated the tests so that we check for this specific case, and I verified that before the fix the new tests failed, whereas they pass now.
|
/smoke-test --kubernetes |
Fixes #8024 by updating
filter_podsto sort pods by name (and numeric worker suffix).I believe the issue in #8024 was caused by inconsistent ordering of pods returned by the Kubernetes API and the lack of sorting for k8s clusters in
CloudVmRayResourceHandle.update_cluster_ips.The Kubernetes API makes no guarantees on the order of pods returned, so without explicit sorting, the ordering of pods could end up with the worker pods in a different order than expected. Because insertion order is preserved with dicts in Python 3.7+, this ordering was preserved all the way through to
SSHConfigHelper.add_cluster.This PR add logic to
filter_podsto ensure that pods are always sorted by name, with worker pods sorted by their numeric suffix (cluster-head,cluster-worker1,cluster-worker2, etc.).The reason I did not just enable sorting in
CloudVmRayResourceHandle.update_cluster_ipsis that that currently sorts based on IP addresses, which may not always correspond to the desired pod ordering, and it seems more natural to havefilter_podsreturn pods in a consistent order to prevent other subtle issues like this one.I added tests to verify that
filter_podsreturns pods in the expected order.I also setup a 16-node GKE cluster that I launched via simple test template:
I ran
sky launch --infra k8s/gke_modern-rhythm-478817-v4_us-central1-c_skypilot-test-cluster test_multinode.yamland after the cluster was up I ran the following bash script to use thesshaliases in the config to verify that all of the hostnames matched what expected:The output of which showed everything was good:
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)