Skip to content

Conversation

@lloyd-brown
Copy link
Collaborator

@lloyd-brown lloyd-brown commented Nov 24, 2025

Before this PR we had logic to reschedule a request if it raised a exceptions.ExecutionRetryableError error if launching on all zone failed and --retry-until-up was passed, but because the request status was RUNNING instead of PENDING we did no further processing on the request. This bug was introduced in https://github.com/skypilot-org/skypilot/pull/7511/files since before that point we wouldn't check if the task was PENDING.

Now when handling the try we reset the process status to PENDING which enables the . I verified this by adding some code locally to fail provisioning on the first attempt at every zone and ensuring we eventually launch. I also added a unit test that makes sure that requests that fail with this error 1. get put back onto the queue 2. on subsequent get call we will submit the request to an executor.

I added a smoke test that attempts to launch resources using spot instances that will never succeed and ensure we keep trying to launch.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@cg505
Copy link
Collaborator

cg505 commented Nov 24, 2025

We should make sure we have a smoke test for this also

@lloyd-brown lloyd-brown marked this pull request as ready for review November 24, 2025 21:33
@lloyd-brown
Copy link
Collaborator Author

We should make sure we have a smoke test for this also

I don't see an easy path to adding a smoke test since it would require having all of the zones of a cloud fail to provision and then work on a subsequent try, open to suggestions though.

@lloyd-brown lloyd-brown requested review from aylei and cg505 November 24, 2025 21:36
@lloyd-brown
Copy link
Collaborator Author

/quicktest-core

@lloyd-brown
Copy link
Collaborator Author

We should make sure we have a smoke test for this also

Added a smoke test!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc'ing @aylei here as well

@lloyd-brown
Copy link
Collaborator Author

/smoke-test -k test_launch_retry_until_up

@cg505 cg505 added this to the v0.11.0 milestone Nov 25, 2025
@lloyd-brown lloyd-brown changed the title [Core] Fix --retry-until-up [Core] Ensure --retry-until-up Tries Launch After Checking All Zones Nov 25, 2025
Copy link
Collaborator

@aylei aylei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lloyd-brown ! Good catch!

'launch-retry-until-up',
[
# Launch something we'll never get.
f's=$(timeout {timeout} sky launch -c {cluster_name} --gpus B200:8 --infra aws echo hi -y -d --retry-until-up --use-spot 2>&1 || true) && '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why this works, if we will never get B200:8 , wouldn't we block on this command?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command will timeout and we will get the logs back and we just parse them to figure out the outcome!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, love this idea!

@lloyd-brown
Copy link
Collaborator Author

/smoke-test -k test_launch_retry_until_up

@lloyd-brown
Copy link
Collaborator Author

/quicktest-core

Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lloyd-brown!

@lloyd-brown lloyd-brown merged commit b50e80e into master Nov 25, 2025
21 of 22 checks passed
@lloyd-brown lloyd-brown deleted the lloyd/fix-retry-until-up branch November 25, 2025 01:55
cg505 pushed a commit that referenced this pull request Nov 25, 2025
#8079)

* Retry if launch failed.

* Fix at API server level.

* Add smoke test.

* Remove unnecessary statements.

* Assert instead of log.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants