-
Notifications
You must be signed in to change notification settings - Fork 859
[Core] Ensure --retry-until-up Tries Launch After Checking All Zones
#8079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
We should make sure we have a smoke test for this also |
I don't see an easy path to adding a smoke test since it would require having all of the zones of a cloud fail to provision and then work on a subsequent try, open to suggestions though. |
|
/quicktest-core |
Added a smoke test! |
Michaelvll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc'ing @aylei here as well
|
/smoke-test -k test_launch_retry_until_up |
--retry-until-up--retry-until-up Tries Launch After Checking All Zones
aylei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lloyd-brown ! Good catch!
| 'launch-retry-until-up', | ||
| [ | ||
| # Launch something we'll never get. | ||
| f's=$(timeout {timeout} sky launch -c {cluster_name} --gpus B200:8 --infra aws echo hi -y -d --retry-until-up --use-spot 2>&1 || true) && ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why this works, if we will never get B200:8 , wouldn't we block on this command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command will timeout and we will get the logs back and we just parse them to figure out the outcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, love this idea!
|
/smoke-test -k test_launch_retry_until_up |
|
/quicktest-core |
cg505
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lloyd-brown!
#8079) * Retry if launch failed. * Fix at API server level. * Add smoke test. * Remove unnecessary statements. * Assert instead of log.
Before this PR we had logic to reschedule a request if it raised a
exceptions.ExecutionRetryableErrorerror if launching on all zone failed and--retry-until-upwas passed, but because the request status wasRUNNINGinstead ofPENDINGwe did no further processing on the request. This bug was introduced in https://github.com/skypilot-org/skypilot/pull/7511/files since before that point we wouldn't check if the task wasPENDING.Now when handling the try we reset the process status to
PENDINGwhich enables the . I verified this by adding some code locally to fail provisioning on the first attempt at every zone and ensuring we eventually launch. I also added a unit test that makes sure that requests that fail with this error 1. get put back onto the queue 2. on subsequent get call we will submit the request to an executor.I added a smoke test that attempts to launch resources using spot instances that will never succeed and ensure we keep trying to launch.
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)