Skip to content

Conversation

nuzant
Copy link
Collaborator

@nuzant nuzant commented Oct 9, 2025

This pull request adds comprehensive support and documentation for running AReaL experiments with SkyPilot on cloud and Kubernetes infrastructures. It introduces example YAML configurations for both single-node and multi-node experiments, a detailed README for SkyPilot usage, and step-by-step installation instructions. These changes make it much easier to launch distributed AReaL experiments on GCP or Kubernetes using SkyPilot.

SkyPilot Integration and Documentation

  • Added a new section to docs/tutorial/installation.md with step-by-step instructions for installing and verifying SkyPilot, including GCP and Kubernetes setup guidance.
  • Created examples/skypilot/README.md providing detailed usage examples, explanations, and command lines for running AReaL experiments with SkyPilot, covering both single-node and multi-node setups.

Example Configurations for SkyPilot

  • Added examples/skypilot/local.yaml as a template for launching a single-node AReaL experiment with SkyPilot on GCP, specifying resources, storage, and launch commands.
  • Added examples/skypilot/ray_cluster.yaml for launching a multi-node Ray cluster with SkyPilot, including setup for distributed training and shared storage.
  • Added examples/skypilot/gsm8k_grpo_ray.yaml as a sample AReaL experiment configuration for Ray-based distributed training, detailing experiment parameters and resource allocation.

UPDATE: Separated examples and launcher into 2 PRs: #464

@nuzant nuzant changed the title Mzy/skypilot [Feature] Add SkyPilot launcher and examples Oct 9, 2025
@nuzant
Copy link
Collaborator Author

nuzant commented Oct 9, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces first-class support for SkyPilot, enabling AReaL experiments to be run on cloud and Kubernetes infrastructure. The changes include a new SkyPilotLauncherConfig, the SkyPilotLauncher implementation, and extensive documentation and examples.

The overall implementation is solid and follows SkyPilot's best practices. The new launcher is well-structured, handling cluster provisioning, job submission, and state management correctly. The documentation is also comprehensive and will be very helpful for users.

I've found a few issues that should be addressed:

  • There are hardcoded network ports in the launcher, which could cause conflicts.
  • There's a bug in the calculation of trainer nodes, leading to incorrect resource allocation.
  • The example ray_cluster.yaml and its corresponding documentation contain a shell script with syntax errors and a logic bug that would cause worker nodes to terminate prematurely.

Addressing these points will improve the robustness and correctness of the SkyPilot integration. Great work on adding this powerful feature!

future launches.
```bash
sky volumes apply storage-volume.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to make it clear from where the user needs to execute steps from this README.
Here it assumes examples/skypilot, but later it assumes the root of the repo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make contents about cloud buckets and volumes shorter, and refer to SkyPilot cloud bucket and volume guide.

Also, I have checked other places to ensure that users can execute these commands in the root of the repo.

/storage: areal-shared-storage
setup: |
pip3 install -e .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth creating a virtual env instead of installing with pip as root

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AReaL repo root directory as workdir and our image ensure that we do not need pip install -e . (or any other installation) before launching the experiment. Therefore setup section here is removed.


```bash
export WANDB_API_KEY=<your-wandb-api-key>
sky launch -c areal --secret WANDB_API_KEY examples/skypilot/ray_cluster.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command fails for me with this:

(head, rank=0, pid=4232) Executing training script on head node...
(worker1, rank=1, pid=3359, ip=10.170.27.163) Node setup complete for rank 1.
(head, rank=0, pid=4232) 2025-10-11 02:34:06,774        WARNING services.py:394 -- Found multiple active Ray instances: {'10.156.61.243:6380', '10.156.61.243:6379'}. Connecting to latest cluster at 10.156.61.243:6379. You can override this by setting the `--address` flag or `RAY_ADDRESS` environment variable.
(head, rank=0, pid=4232) 2025-10-11 02:34:06,774        INFO worker.py:1771 -- Connecting to existing Ray cluster at address: 10.156.61.243:6379...
(head, rank=0, pid=4232) 2025-10-11 02:34:06,785        INFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 
(head, rank=0, pid=4232) Traceback (most recent call last):
(head, rank=0, pid=4232)   File "<frozen runpy>", line 198, in _run_module_as_main
(head, rank=0, pid=4232)   File "<frozen runpy>", line 88, in _run_code
(head, rank=0, pid=4232)   File "/root/sky_workdir/areal/launcher/ray.py", line 591, in <module>
(head, rank=0, pid=4232)     main()
(head, rank=0, pid=4232)   File "/root/sky_workdir/areal/launcher/ray.py", line 330, in main
(head, rank=0, pid=4232)     config, _ = parse_cli_args(sys.argv[1:])
(head, rank=0, pid=4232)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/root/sky_workdir/areal/api/cli_args.py", line 1308, in parse_cli_args
(head, rank=0, pid=4232)     cfg = hydra_compose(
(head, rank=0, pid=4232)           ^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/usr/local/lib/python3.12/dist-packages/hydra/compose.py", line 38, in compose
(head, rank=0, pid=4232)     cfg = gh.hydra.compose_config(
(head, rank=0, pid=4232)           ^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/hydra.py", line 594, in compose_config
(head, rank=0, pid=4232)     cfg = self.config_loader.load_configuration(
(head, rank=0, pid=4232)           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py", line 142, in load_configuration
(head, rank=0, pid=4232)     return self._load_configuration_impl(
(head, rank=0, pid=4232)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py", line 244, in _load_configuration_impl
(head, rank=0, pid=4232)     parsed_overrides, caching_repo = self._parse_overrides_and_create_caching_repo(
(head, rank=0, pid=4232)                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py", line 228, in _parse_overrides_and_create_caching_repo
(head, rank=0, pid=4232)     parsed_overrides = parser.parse_overrides(overrides=overrides)
(head, rank=0, pid=4232)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/usr/local/lib/python3.12/dist-packages/hydra/core/override_parser/overrides_parser.py", line 99, in parse_overrides
(head, rank=0, pid=4232)     raise OverrideParseException(
(head, rank=0, pid=4232) hydra.errors.OverrideParseException: mismatched input '=' expecting <EOF>
(head, rank=0, pid=4232) See https://hydra.cc/docs/1.2/advanced/override_grammar/basic for details
(head, rank=0, pid=4232) Node setup complete for rank 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is caused by +trainer_env_vars="WANDB_API_KEY=$WANDB_API_KEY". This is a limitation of hydra, which does not allow = to appear in the command line arguments. Currently, users can only set environment variables in the yaml config file. We are finding workarounds for users to set environment variables in the command lines.

Now I think we just remove WANDB_API_KEY from examples to make it clear and runnable.

--config examples/skypilot/gsm8k_grpo_ray.yaml \
experiment_name=<your experiment name> \
trial_name=<your trial name> \
trainer_env_vars="WANDB_API_KEY=$WANDB_API_KEY"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong and needs to be replaced with '+launcher.trainer_env_vars="WANDB_API_KEY='$WANDB_API_KEY'"'
otherwise it fails

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Comment on lines 123 to 127
If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with
AReaL. Check [here](../examples/skypilot.md) for a detailed example to run AReaL with
SkyPilot. For more options and details for SkyPilot, see the official
[SkyPilot installation guide](https://docs.skypilot.co/en/latest/getting-started/installation.html).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to link to this page on how to configure K8s with work with SkyPilot: https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added reference in the Kubernetes setup section above.

resolve and distributed checkpointing. The following guideline shows how to use SkyPilot
volumes to setup a high-performance shared storage.

1. **Define the volume.** Create a YAML file describing the volume you want SkyPilot to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While using volumes is fine, this is not required. And using cloud buckets could be simpler: https://docs.skypilot.co/en/latest/reference/storage.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added cloud bucket usage in the example.

```

If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with
AReaL. Check [here](../examples/skypilot.md) for a detailed example to run AReaL with

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file doesn't exist

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we changed this link to https://github.com/inclusionAI/AReaL/blob/main/examples/skypilot/README.md to ensure the link is available in our documentation pages after this PR is merged into main.

```bash
# Ensure your kubeconfig is at ~/.kube/config
mkdir -p ~/.kube
cp /path/to/kubeconfig ~/.kube/config

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear where /path/to/kubeconfig comes from

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this and referred to skypilot k8s setup guide instead.

```yaml
resources:
accelerators: H100:8
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this image is configured to use a custom PyPI index https://pypi.antfin-inc.com/simple.
It doesn't work for me. Here's what I see:

(setup pid=4496) Looking in indexes: https://pypi.antfin-inc.com/simple
(setup pid=4496) Obtaining file:///root/sky_workdir
(setup pid=4496)   Installing build dependencies: started
(setup pid=3465, ip=10.170.27.38) Looking in indexes: https://pypi.antfin-inc.com/simple
(setup pid=3465, ip=10.170.27.38) Obtaining file:///root/sky_workdir
(setup pid=3465, ip=10.170.27.38)   Installing build dependencies: started
(setup pid=4496)   Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38)   Installing build dependencies: still running...
(setup pid=4496)   Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38)   Installing build dependencies: still running...
(setup pid=4496)   Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38)   Installing build dependencies: still running...
(setup pid=4496)   Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38)   Installing build dependencies: still running...
(setup pid=4496)   Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38)   Installing build dependencies: still running...
(setup pid=4496)   Installing build dependencies: still running...
(setup pid=4496)   Installing build dependencies: finished with status 'error'
(setup pid=4496)   error: subprocess-exited-with-error
(setup pid=4496)   
(setup pid=4496)   × pip subprocess to install build dependencies did not run successfully.
(setup pid=4496)   │ exit code: 1
(setup pid=4496)   ╰─> [8 lines of output]
(setup pid=4496)       Looking in indexes: https://pypi.antfin-inc.com/simple
(setup pid=4496)       WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80fda300>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496)       WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca2d0>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496)       WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca480>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496)       WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca5d0>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496)       WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca780>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496)       ERROR: Could not find a version that satisfies the requirement setuptools>=61.0 (from versions: none)
(setup pid=4496)       ERROR: No matching distribution found for setuptools>=61.0
(setup pid=4496)       [end of output]
(setup pid=4496)   
(setup pid=4496)   note: This error originates from a subprocess, and is likely not a problem with pip.
(setup pid=4496) error: subprocess-exited-with-error
(setup pid=4496) 
(setup pid=4496) × pip subprocess to install build dependencies did not run successfully.
(setup pid=4496) │ exit code: 1
(setup pid=4496) ╰─> See above for output.
(setup pid=4496) 
(setup pid=4496) note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Job 1's setup failed with return code list: [137, 1]
✓ Job finished (status: FAILED_SETUP).
command terminated with exit code 100

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not requires any installation to run experiment now. However the PyPI index is still custom for our public image. We will mark this and fix this problem in our next image release.

echo "Starting Ray head node..."
ray start --head --port=6379
while [ $(ray nodes | grep NODE_ID | wc -l) -lt $num_nodes ]; do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray nodes command doesn't exist:

(head, rank=0, pid=4484) Usage: ray [OPTIONS] COMMAND [ARGS]...
(head, rank=0, pid=4484) Try 'ray --help' for help.
(head, rank=0, pid=4484) 
(head, rank=0, pid=4484) Error: No such command 'nodes'.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this by using ray status instead.

echo "Executing training script on head node..."
python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \
--config examples/skypilot/gsm8k_grpo_ray.yaml \
experiment_name=<your experiment name> \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use envs and secrets to set these as env vars:

envs: 
  EXPERIMENT_NAME: my-areal-experiment
  TRIAL_NAME: my-trial-name

secrets:
  WANDB_API_KEY: null

and then:

experiment_name=$EXPERIMENT_NAME\
            trial_name=$TRIAL_NAME \
            

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

@alex000kim alex000kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR a bit raw.
The training job doesn't run due to incorrect syntax in several places:

  • non-existent commands
  • incorrect parameters
  • etc.

@nuzant
Copy link
Collaborator Author

nuzant commented Oct 13, 2025

I think this PR a bit raw. The training job doesn't run due to incorrect syntax in several places:

  • non-existent commands
  • incorrect parameters
  • etc.

Thanks for your review! We have GCP access now and we will be able to test and debug this PR by ourselves. We will start fixing this PR right away.

@nuzant nuzant changed the title [Feature] Add SkyPilot launcher and examples [Feature] Add SkyPilot examples Oct 17, 2025
@nuzant nuzant marked this pull request as ready for review October 17, 2025 09:26
@nuzant
Copy link
Collaborator Author

nuzant commented Oct 17, 2025

I have separated SkyPilot examples and launchers into 2 separate PRs for a clearer view: #464 SkyPilot Launcher is currently under testing and will hopefully be ready in the next week.

Currently the examples are tested on GCP, using 2 1xA100 instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants