Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions docs/tutorial/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,54 @@ However, `flash-attn==2.8.3` is not compatible with Megatron training backend. I
want to use Megatron training backend, please compile and install `flash-attn==2.8.1` in
your custom environment, or use docker installation instead.

## (Optional) Install SkyPilot

SkyPilot helps you run AReaL easily on cloud or Kubernetes infrastructures. Below shows

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a link to SkyPilot docs + mention that it supports 17+ clouds

the minimal steps to setup skypilot on GCP or Kubernetes.

### Install SkyPilot

```bash
# In your conda environment
# NOTE: SkyPilot requires 3.7 <= python <= 3.13
pip install -U "skypilot[gcp,kubernetes]"
```

### GCP setup

```bash
# Install Google Cloud SDK
conda install -y -c conda-forge google-cloud-sdk

# Initialize gcloud and select your account/project
gcloud init

# (Optional) choose a project explicitly
gcloud config set project <PROJECT_ID>

# Create Application Default Credentials
gcloud auth application-default login
```

### Kubernetes setup

Check
[here](https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html)
for a comprehensive guide on how to set up a kubernetes cluster for SkyPilot.

### Verify

```bash
sky check
```

If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with
AReaL. Check
[here](https://github.com/inclusionAI/AReaL/blob/main/examples/skypilot/README.md) for a
detailed example to run AReaL with SkyPilot. For more options and details for SkyPilot,
see the official
[SkyPilot installation guide](https://docs.skypilot.co/en/latest/getting-started/installation.html).

## (Optional) Launch Ray Cluster for Distributed Training

On the first node, start the Ray Head:
Expand Down
158 changes: 158 additions & 0 deletions examples/skypilot/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# Running AReaL with SkyPilot

This README includes examples and guidelines to running AReaL experiments with SkyPilot.
Make sure you have SkyPilot properly installed following
[our installation guide](../../docs/tutorial/installation.md#optional-install-skypilot)
before running this example. Note that all command lines shown in this file are assumed
to be execute under the root of AReaL repository.

## Running a Single Node Experiment

To run a single node experiment, you only need to setup the node with SkyPilot and
launch the experiment with AReaL local launcher. [The following file](local.yaml) shows
a SkyPilot yaml that could launch a simple GSM8K GRPO experiment in a single command
line. This example runs on GCP, but could be easily migrated to other cloud or K8S
cluster by changing `resource.infra` field in SkyPilot YAML file.

```yaml
name: areal-test-skypilot

resources:
infra: gcp
accelerators: A100:2
autostop:
idle_minutes: 10
down: true
cpus: 8+
memory: 32GB+
disk_size: 256GB
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4


num_nodes: 1

workdir: .

envs:
EXPERIMENT_NAME: my-areal-experiment
TRIAL_NAME: my-trial-name

run: |
python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \
--config examples/math/gsm8k_grpo.yaml \
experiment_name=$EXPERIMENT_NAME \
trial_name=$TRIAL_NAME \
cluster.n_gpus_per_node=2 \
allocation_mode=sglang.d1+d1 \
train_dataset.batch_size=4 \
actor.mb_spec.max_tokens_per_mb=4096
```
To run the experiment, execute:
```bash
sky launch -c areal-test examples/skypilot/local.yaml
```

## Running a Multi-Node Experiment

### Running AReaL with Ray Launcher

The following example shows how to setup a ray cluster with SkyPilot and then use AReaL
to run GRPO with GSM8K dataset on 2 nodes, each with 1 A100 GPU. This example runs on

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of using "2 nodes, each with 1 A100 GPU" instead of a single node with several GPUs?
This training will be much slower due to the slow interconnectivity speed between nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine as an MVP to show the distributed training : )

GCP, but could be easily migrated to other cloud or K8S cluster by changing
`resource.infra` field in SkyPilot YAML file.

Specify the resources and image used to run the experiment.

```yaml
resources:
infra: gcp
accelerators: A100:1
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4
memory: 256+
cpus: 32+

num_nodes: 2

workdir: .
```
Designate shared storage. You could either use an existing cloud bucket or volume:
```yaml
file_mounts:
/storage: gs://areal-default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of hard-coding gs, can we use something like

file_mounts:
  /my_data:
    source: s3://my-bucket/  # or gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>, oci://<bucket_name>
    mode: MOUNT  # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional.

as per https://docs.skypilot.co/en/latest/reference/storage.html

```
or create a new bucket or volume with SkyPilot:
```yaml
file_mounts:
/storage:
name: areal-test
store: gcs
```
For more information about shared storage with SkyPilot, check
[SkyPilot Cloud Buckets](https://docs.skypilot.co/en/latest/reference/storage.html) and
[SkyPilot Volume](https://docs.skypilot.co/en/latest/reference/volumes.html).
Next, prepare commands used to setup ray cluster and run the experiment.
```yaml
envs:
EXPERIMENT_NAME: my-areal-experiment
TRIAL_NAME: my-trial-name

run: |
# Get the Head node's IP and total number of nodes (environment variables injected by SkyPilot).
head_ip=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
num_nodes=$(echo "$SKYPILOT_NODE_IPS" | wc -l)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with an env var

if [ "$SKYPILOT_NODE_RANK" = "0" ]; then
echo "Starting Ray head node..."
ray start --head --port=6379
while [ $(ray status | grep node_ | wc -l) -lt $num_nodes ]; do
echo "Waiting for all nodes to join... Current nodes: $(ray status | grep node_ | wc -l) / $num_nodes"
sleep 5
done
echo "Executing training script on head node..."
python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \
--config examples/skypilot/gsm8k_grpo_ray.yaml \
experiment_name=$EXPERIMENT_NAME \
trial_name=$TRIAL_NAME
else
sleep 10
echo "Starting Ray worker node..."
ray start --address $head_ip:6379
sleep 5
fi
echo "Node setup complete for rank $SKYPILOT_NODE_RANK."
```
### Launch the Ray Cluster and AReaL
Then you are ready to run AReaL with command line:
```bash
sky launch -c areal-test examples/skypilot/ray_cluster.yaml
```

You should be able to see your AReaL running and producing training logs in your
terminal.

Successfully launched 2 nodes on GCP and deployed a ray cluster:
<img align="center" alt="Launching Ray Cluster" src="ray_launch.png" width="100%">

Successfully ran a training step:
<img align="center" alt="Running a train step" src="train_step_success.png" width="100%">

### Running AReaL with SkyPilot Launcher

AReaL plans to support a SkyPilot native launcher with
[SkyPilot Python SDK](https://docs.skypilot.co/en/latest/reference/api.html), which is
currently under development.
153 changes: 153 additions & 0 deletions examples/skypilot/gsm8k_grpo_ray.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
experiment_name: gsm8k-grpo-on-ray
trial_name: trial0

seed: 1
total_train_epochs: 10
tokenizer_path: ${actor.path}
async_training: true

cluster:
n_nodes: 2
n_gpus_per_node: 1
Comment on lines +10 to +11
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is n_nodes=2 correct? What should be the desired n_gpus_per_node?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the goal of using a 2-node cluster with 1 GPU on each node instead of a single node with 2x, 4x or even 8x GPUs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using SKYPILOT_NUM_NODES and SKYPILOT_NUM_GPUS_PER_NODE instead. See https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html

fileroot: /storage/experiments
name_resolve:
type: ray
ray_actor_name: ray_kv_store

allocation_mode: sglang.d1+d1

rollout:
experiment_name: ${experiment_name}
trial_name: ${trial_name}
max_concurrent_rollouts: 256
queue_size: null
consumer_batch_size: ${train_dataset.batch_size}
max_head_offpolicyness: 2
enable_rollout_tracing: false

gconfig:
n_samples: 4
min_new_tokens: 0
max_new_tokens: 1024
greedy: false
temperature: 1.0

actor:
experiment_name: ${experiment_name}
trial_name: ${trial_name}
path: Qwen/Qwen2.5-1.5B-Instruct
init_from_scratch: false
disable_dropout: true
gradient_checkpointing: false
dtype: bfloat16
mb_spec:
max_tokens_per_mb: 4096
optimizer:
type: adam
lr: 1.70e-5
weight_decay: 0.017
beta1: 0.9
beta2: 0.999
eps: 1e-8
lr_scheduler_type: constant
gradient_clipping: 1.0
warmup_steps_proportion: 0.001
backend: fsdp
group_size: ${gconfig.n_samples}
eps_clip: 0.4
temperature: ${gconfig.temperature}
reward_scaling: 10.0
reward_bias: -0.5
kl_ctl: 0.0
ppo_n_minibatches: 1
recompute_logprob: true
use_decoupled_loss: true
behav_imp_weight_cap: 5.0
dynamic_sampling: false
reward_norm:
mean_level: group
std_level: group
group_size: ${gconfig.n_samples}
adv_norm:
mean_level: batch
std_level: batch
max_new_tokens: ${gconfig.max_new_tokens}

ref:
experiment_name: ${experiment_name}
trial_name: ${trial_name}
path: ${actor.path}
init_from_scratch: false
disable_dropout: true
dtype: ${actor.dtype}
mb_spec:
max_tokens_per_mb: 10240
optimizer: null
backend: fsdp

# SGLang
sglang:
model_path: ${actor.path}
random_seed: ${seed}
skip_tokenizer_init: true
dtype: ${actor.dtype}
max_running_requests: null
context_length: 32768
mem_fraction_static: 0.8

# datasets
train_dataset:
batch_size: 4
shuffle: true
pin_memory: true
num_workers: 4
path: openai/gsm8k
type: rl
max_length: 1024

valid_dataset:
batch_size: 4
shuffle: true
pin_memory: true
num_workers: 4
path: openai/gsm8k
type: rl

# Utilities
saver:
experiment_name: ${experiment_name}
trial_name: ${trial_name}
fileroot: ${cluster.fileroot}
freq_epochs: 1
freq_steps: null
freq_secs: null

recover:
mode: disabled
experiment_name: ${experiment_name}
trial_name: ${trial_name}
fileroot: ${cluster.fileroot}
freq_epochs: 1
freq_steps: null
freq_secs: 3600

evaluator:
experiment_name: ${experiment_name}
trial_name: ${trial_name}
fileroot: ${cluster.fileroot}
freq_epochs: 1
freq_steps: null
freq_secs: null

stats_logger:
experiment_name: ${experiment_name}
trial_name: ${trial_name}
fileroot: ${cluster.fileroot}
wandb:
mode: disabled

launcher:
inference_server_cpus_per_gpu: 4
inference_server_mem_per_gpu: 32768
trainer_cpus_per_gpu: 4
trainer_mem_per_gpu: 32768
29 changes: 29 additions & 0 deletions examples/skypilot/local.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: areal-test-skypilot

resources:
infra: gcp
accelerators: A100:2
autostop:
idle_minutes: 10
down: true
cpus: 8+
memory: 32GB+
disk_size: 256GB
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4

num_nodes: 1

file_mounts:
/storage: gs://areal-default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth pointing out that /storage is set in examples/skypilot/gsm8k_grpo_ray.yaml


workdir: .

run: |
python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \
--config examples/math/gsm8k_grpo.yaml \
experiment_name=gsm8k-grpo \
trial_name=trial0 \
cluster.n_gpus_per_node=2 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to yaml, or specify both n_nodes and n_gpus_per_node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using SKYPILOT_NUM_NODES and SKYPILOT_NUM_GPUS_PER_NODE instead. See https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html

allocation_mode=sglang.d1+d1 \
train_dataset.batch_size=4 \
actor.mb_spec.max_tokens_per_mb=4096
Loading
Loading