-
Notifications
You must be signed in to change notification settings - Fork 205
[Feature] Add SkyPilot examples #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces first-class support for SkyPilot, enabling AReaL experiments to be run on cloud and Kubernetes infrastructure. The changes include a new SkyPilotLauncherConfig
, the SkyPilotLauncher
implementation, and extensive documentation and examples.
The overall implementation is solid and follows SkyPilot's best practices. The new launcher is well-structured, handling cluster provisioning, job submission, and state management correctly. The documentation is also comprehensive and will be very helpful for users.
I've found a few issues that should be addressed:
- There are hardcoded network ports in the launcher, which could cause conflicts.
- There's a bug in the calculation of trainer nodes, leading to incorrect resource allocation.
- The example
ray_cluster.yaml
and its corresponding documentation contain a shell script with syntax errors and a logic bug that would cause worker nodes to terminate prematurely.
Addressing these points will improve the robustness and correctness of the SkyPilot integration. Great work on adding this powerful feature!
examples/skypilot/README.md
Outdated
future launches. | ||
```bash | ||
sky volumes apply storage-volume.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to make it clear from where the user needs to execute steps from this README.
Here it assumes examples/skypilot
, but later it assumes the root of the repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could make contents about cloud buckets and volumes shorter, and refer to SkyPilot cloud bucket and volume guide.
Also, I have checked other places to ensure that users can execute these commands in the root of the repo.
examples/skypilot/README.md
Outdated
/storage: areal-shared-storage | ||
setup: | | ||
pip3 install -e . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth creating a virtual env instead of installing with pip as root
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AReaL repo root directory as workdir and our image ensure that we do not need pip install -e .
(or any other installation) before launching the experiment. Therefore setup
section here is removed.
examples/skypilot/README.md
Outdated
|
||
```bash | ||
export WANDB_API_KEY=<your-wandb-api-key> | ||
sky launch -c areal --secret WANDB_API_KEY examples/skypilot/ray_cluster.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command fails for me with this:
(head, rank=0, pid=4232) Executing training script on head node...
(worker1, rank=1, pid=3359, ip=10.170.27.163) Node setup complete for rank 1.
(head, rank=0, pid=4232) 2025-10-11 02:34:06,774 WARNING services.py:394 -- Found multiple active Ray instances: {'10.156.61.243:6380', '10.156.61.243:6379'}. Connecting to latest cluster at 10.156.61.243:6379. You can override this by setting the `--address` flag or `RAY_ADDRESS` environment variable.
(head, rank=0, pid=4232) 2025-10-11 02:34:06,774 INFO worker.py:1771 -- Connecting to existing Ray cluster at address: 10.156.61.243:6379...
(head, rank=0, pid=4232) 2025-10-11 02:34:06,785 INFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265
(head, rank=0, pid=4232) Traceback (most recent call last):
(head, rank=0, pid=4232) File "<frozen runpy>", line 198, in _run_module_as_main
(head, rank=0, pid=4232) File "<frozen runpy>", line 88, in _run_code
(head, rank=0, pid=4232) File "/root/sky_workdir/areal/launcher/ray.py", line 591, in <module>
(head, rank=0, pid=4232) main()
(head, rank=0, pid=4232) File "/root/sky_workdir/areal/launcher/ray.py", line 330, in main
(head, rank=0, pid=4232) config, _ = parse_cli_args(sys.argv[1:])
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/root/sky_workdir/areal/api/cli_args.py", line 1308, in parse_cli_args
(head, rank=0, pid=4232) cfg = hydra_compose(
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/usr/local/lib/python3.12/dist-packages/hydra/compose.py", line 38, in compose
(head, rank=0, pid=4232) cfg = gh.hydra.compose_config(
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/hydra.py", line 594, in compose_config
(head, rank=0, pid=4232) cfg = self.config_loader.load_configuration(
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py", line 142, in load_configuration
(head, rank=0, pid=4232) return self._load_configuration_impl(
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py", line 244, in _load_configuration_impl
(head, rank=0, pid=4232) parsed_overrides, caching_repo = self._parse_overrides_and_create_caching_repo(
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py", line 228, in _parse_overrides_and_create_caching_repo
(head, rank=0, pid=4232) parsed_overrides = parser.parse_overrides(overrides=overrides)
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/usr/local/lib/python3.12/dist-packages/hydra/core/override_parser/overrides_parser.py", line 99, in parse_overrides
(head, rank=0, pid=4232) raise OverrideParseException(
(head, rank=0, pid=4232) hydra.errors.OverrideParseException: mismatched input '=' expecting <EOF>
(head, rank=0, pid=4232) See https://hydra.cc/docs/1.2/advanced/override_grammar/basic for details
(head, rank=0, pid=4232) Node setup complete for rank 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is caused by +trainer_env_vars="WANDB_API_KEY=$WANDB_API_KEY"
. This is a limitation of hydra, which does not allow =
to appear in the command line arguments. Currently, users can only set environment variables in the yaml config file. We are finding workarounds for users to set environment variables in the command lines.
Now I think we just remove WANDB_API_KEY from examples to make it clear and runnable.
examples/skypilot/README.md
Outdated
--config examples/skypilot/gsm8k_grpo_ray.yaml \ | ||
experiment_name=<your experiment name> \ | ||
trial_name=<your trial name> \ | ||
trainer_env_vars="WANDB_API_KEY=$WANDB_API_KEY" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong and needs to be replaced with '+launcher.trainer_env_vars="WANDB_API_KEY='$WANDB_API_KEY'"'
otherwise it fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with | ||
AReaL. Check [here](../examples/skypilot.md) for a detailed example to run AReaL with | ||
SkyPilot. For more options and details for SkyPilot, see the official | ||
[SkyPilot installation guide](https://docs.skypilot.co/en/latest/getting-started/installation.html). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to link to this page on how to configure K8s with work with SkyPilot: https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added reference in the Kubernetes setup
section above.
examples/skypilot/README.md
Outdated
resolve and distributed checkpointing. The following guideline shows how to use SkyPilot | ||
volumes to setup a high-performance shared storage. | ||
|
||
1. **Define the volume.** Create a YAML file describing the volume you want SkyPilot to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While using volumes is fine, this is not required. And using cloud buckets could be simpler: https://docs.skypilot.co/en/latest/reference/storage.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added cloud bucket usage in the example.
docs/tutorial/installation.md
Outdated
``` | ||
|
||
If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with | ||
AReaL. Check [here](../examples/skypilot.md) for a detailed example to run AReaL with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file doesn't exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we changed this link to https://github.com/inclusionAI/AReaL/blob/main/examples/skypilot/README.md
to ensure the link is available in our documentation pages after this PR is merged into main.
docs/tutorial/installation.md
Outdated
```bash | ||
# Ensure your kubeconfig is at ~/.kube/config | ||
mkdir -p ~/.kube | ||
cp /path/to/kubeconfig ~/.kube/config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear where /path/to/kubeconfig
comes from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this and referred to skypilot k8s setup guide instead.
examples/skypilot/README.md
Outdated
```yaml | ||
resources: | ||
accelerators: H100:8 | ||
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this image is configured to use a custom PyPI index https://pypi.antfin-inc.com/simple
.
It doesn't work for me. Here's what I see:
(setup pid=4496) Looking in indexes: https://pypi.antfin-inc.com/simple
(setup pid=4496) Obtaining file:///root/sky_workdir
(setup pid=4496) Installing build dependencies: started
(setup pid=3465, ip=10.170.27.38) Looking in indexes: https://pypi.antfin-inc.com/simple
(setup pid=3465, ip=10.170.27.38) Obtaining file:///root/sky_workdir
(setup pid=3465, ip=10.170.27.38) Installing build dependencies: started
(setup pid=4496) Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38) Installing build dependencies: still running...
(setup pid=4496) Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38) Installing build dependencies: still running...
(setup pid=4496) Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38) Installing build dependencies: still running...
(setup pid=4496) Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38) Installing build dependencies: still running...
(setup pid=4496) Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38) Installing build dependencies: still running...
(setup pid=4496) Installing build dependencies: still running...
(setup pid=4496) Installing build dependencies: finished with status 'error'
(setup pid=4496) error: subprocess-exited-with-error
(setup pid=4496)
(setup pid=4496) × pip subprocess to install build dependencies did not run successfully.
(setup pid=4496) │ exit code: 1
(setup pid=4496) ╰─> [8 lines of output]
(setup pid=4496) Looking in indexes: https://pypi.antfin-inc.com/simple
(setup pid=4496) WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80fda300>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496) WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca2d0>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496) WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca480>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496) WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca5d0>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496) WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca780>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496) ERROR: Could not find a version that satisfies the requirement setuptools>=61.0 (from versions: none)
(setup pid=4496) ERROR: No matching distribution found for setuptools>=61.0
(setup pid=4496) [end of output]
(setup pid=4496)
(setup pid=4496) note: This error originates from a subprocess, and is likely not a problem with pip.
(setup pid=4496) error: subprocess-exited-with-error
(setup pid=4496)
(setup pid=4496) × pip subprocess to install build dependencies did not run successfully.
(setup pid=4496) │ exit code: 1
(setup pid=4496) ╰─> See above for output.
(setup pid=4496)
(setup pid=4496) note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Job 1's setup failed with return code list: [137, 1]
✓ Job finished (status: FAILED_SETUP).
command terminated with exit code 100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not requires any installation to run experiment now. However the PyPI index is still custom for our public image. We will mark this and fix this problem in our next image release.
examples/skypilot/README.md
Outdated
echo "Starting Ray head node..." | ||
ray start --head --port=6379 | ||
while [ $(ray nodes | grep NODE_ID | wc -l) -lt $num_nodes ]; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray nodes
command doesn't exist:
(head, rank=0, pid=4484) Usage: ray [OPTIONS] COMMAND [ARGS]...
(head, rank=0, pid=4484) Try 'ray --help' for help.
(head, rank=0, pid=4484)
(head, rank=0, pid=4484) Error: No such command 'nodes'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this by using ray status
instead.
examples/skypilot/README.md
Outdated
echo "Executing training script on head node..." | ||
python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \ | ||
--config examples/skypilot/gsm8k_grpo_ray.yaml \ | ||
experiment_name=<your experiment name> \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use envs
and secrets
to set these as env vars:
envs:
EXPERIMENT_NAME: my-areal-experiment
TRIAL_NAME: my-trial-name
secrets:
WANDB_API_KEY: null
and then:
experiment_name=$EXPERIMENT_NAME\
trial_name=$TRIAL_NAME \
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR a bit raw.
The training job doesn't run due to incorrect syntax in several places:
- non-existent commands
- incorrect parameters
- etc.
Thanks for your review! We have GCP access now and we will be able to test and debug this PR by ourselves. We will start fixing this PR right away. |
I have separated SkyPilot examples and launchers into 2 separate PRs for a clearer view: #464 SkyPilot Launcher is currently under testing and will hopefully be ready in the next week. Currently the examples are tested on GCP, using 2 1xA100 instances. |
This pull request adds comprehensive support and documentation for running AReaL experiments with SkyPilot on cloud and Kubernetes infrastructures. It introduces example YAML configurations for both single-node and multi-node experiments, a detailed README for SkyPilot usage, and step-by-step installation instructions. These changes make it much easier to launch distributed AReaL experiments on GCP or Kubernetes using SkyPilot.
SkyPilot Integration and Documentation
docs/tutorial/installation.md
with step-by-step instructions for installing and verifying SkyPilot, including GCP and Kubernetes setup guidance.examples/skypilot/README.md
providing detailed usage examples, explanations, and command lines for running AReaL experiments with SkyPilot, covering both single-node and multi-node setups.Example Configurations for SkyPilot
examples/skypilot/local.yaml
as a template for launching a single-node AReaL experiment with SkyPilot on GCP, specifying resources, storage, and launch commands.examples/skypilot/ray_cluster.yaml
for launching a multi-node Ray cluster with SkyPilot, including setup for distributed training and shared storage.examples/skypilot/gsm8k_grpo_ray.yaml
as a sample AReaL experiment configuration for Ray-based distributed training, detailing experiment parameters and resource allocation.UPDATE: Separated examples and launcher into 2 PRs: #464