-
Notifications
You must be signed in to change notification settings - Fork 207
[Feature] Add SkyPilot examples #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
3037f9a
936c55e
c30f5b0
a617678
b1bece4
f8e10df
d3dbc7b
77f062f
1acb09a
bc421eb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
# Running AReaL with SkyPilot | ||
|
||
This README includes examples and guidelines to running AReaL experiments with SkyPilot. | ||
Make sure you have SkyPilot properly installed following | ||
[our installation guide](../../docs/tutorial/installation.md#optional-install-skypilot) | ||
before running this example. Note that all command lines shown in this file are assumed | ||
to be execute under the root of AReaL repository. | ||
|
||
## Running a Single Node Experiment | ||
|
||
To run a single node experiment, you only need to setup the node with SkyPilot and | ||
launch the experiment with AReaL local launcher. [The following file](local.yaml) shows | ||
a SkyPilot yaml that could launch a simple GSM8K GRPO experiment in a single command | ||
line. This example runs on GCP, but could be easily migrated to other cloud or K8S | ||
cluster by changing `resource.infra` field in SkyPilot YAML file. | ||
|
||
```yaml | ||
name: areal-test-skypilot | ||
|
||
resources: | ||
infra: gcp | ||
accelerators: A100:2 | ||
autostop: | ||
idle_minutes: 10 | ||
down: true | ||
cpus: 8+ | ||
memory: 32GB+ | ||
disk_size: 256GB | ||
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4 | ||
|
||
|
||
num_nodes: 1 | ||
|
||
workdir: . | ||
|
||
envs: | ||
EXPERIMENT_NAME: my-areal-experiment | ||
TRIAL_NAME: my-trial-name | ||
|
||
run: | | ||
python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \ | ||
--config examples/math/gsm8k_grpo.yaml \ | ||
experiment_name=$EXPERIMENT_NAME \ | ||
trial_name=$TRIAL_NAME \ | ||
cluster.n_gpus_per_node=2 \ | ||
allocation_mode=sglang.d1+d1 \ | ||
train_dataset.batch_size=4 \ | ||
actor.mb_spec.max_tokens_per_mb=4096 | ||
``` | ||
To run the experiment, execute: | ||
```bash | ||
sky launch -c areal-test examples/skypilot/local.yaml | ||
``` | ||
|
||
## Running a Multi-Node Experiment | ||
|
||
### Running AReaL with Ray Launcher | ||
|
||
The following example shows how to setup a ray cluster with SkyPilot and then use AReaL | ||
to run GRPO with GSM8K dataset on 2 nodes, each with 1 A100 GPU. This example runs on | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the point of using "2 nodes, each with 1 A100 GPU" instead of a single node with several GPUs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should be fine as an MVP to show the distributed training : ) |
||
GCP, but could be easily migrated to other cloud or K8S cluster by changing | ||
`resource.infra` field in SkyPilot YAML file. | ||
|
||
Specify the resources and image used to run the experiment. | ||
|
||
```yaml | ||
resources: | ||
infra: gcp | ||
accelerators: A100:1 | ||
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4 | ||
memory: 256+ | ||
cpus: 32+ | ||
|
||
num_nodes: 2 | ||
|
||
workdir: . | ||
``` | ||
Designate shared storage. You could either use an existing cloud bucket or volume: | ||
```yaml | ||
file_mounts: | ||
/storage: gs://areal-default | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of hard-coding gs, can we use something like
as per https://docs.skypilot.co/en/latest/reference/storage.html |
||
``` | ||
or create a new bucket or volume with SkyPilot: | ||
```yaml | ||
file_mounts: | ||
/storage: | ||
name: areal-test | ||
store: gcs | ||
``` | ||
For more information about shared storage with SkyPilot, check | ||
[SkyPilot Cloud Buckets](https://docs.skypilot.co/en/latest/reference/storage.html) and | ||
[SkyPilot Volume](https://docs.skypilot.co/en/latest/reference/volumes.html). | ||
Next, prepare commands used to setup ray cluster and run the experiment. | ||
```yaml | ||
envs: | ||
EXPERIMENT_NAME: my-areal-experiment | ||
TRIAL_NAME: my-trial-name | ||
|
||
run: | | ||
# Get the Head node's IP and total number of nodes (environment variables injected by SkyPilot). | ||
head_ip=$(echo "$SKYPILOT_NODE_IPS" | head -n1) | ||
num_nodes=$(echo "$SKYPILOT_NODE_IPS" | wc -l) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. replace with an env var |
||
if [ "$SKYPILOT_NODE_RANK" = "0" ]; then | ||
echo "Starting Ray head node..." | ||
ray start --head --port=6379 | ||
while [ $(ray status | grep node_ | wc -l) -lt $num_nodes ]; do | ||
echo "Waiting for all nodes to join... Current nodes: $(ray status | grep node_ | wc -l) / $num_nodes" | ||
sleep 5 | ||
done | ||
echo "Executing training script on head node..." | ||
python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \ | ||
--config examples/skypilot/gsm8k_grpo_ray.yaml \ | ||
experiment_name=$EXPERIMENT_NAME \ | ||
trial_name=$TRIAL_NAME | ||
else | ||
sleep 10 | ||
echo "Starting Ray worker node..." | ||
ray start --address $head_ip:6379 | ||
sleep 5 | ||
fi | ||
echo "Node setup complete for rank $SKYPILOT_NODE_RANK." | ||
``` | ||
### Launch the Ray Cluster and AReaL | ||
Then you are ready to run AReaL with command line: | ||
```bash | ||
sky launch -c areal-test examples/skypilot/ray_cluster.yaml | ||
``` | ||
|
||
You should be able to see your AReaL running and producing training logs in your | ||
terminal. | ||
|
||
Successfully launched 2 nodes on GCP and deployed a ray cluster: | ||
<img align="center" alt="Launching Ray Cluster" src="ray_launch.png" width="100%"> | ||
|
||
Successfully ran a training step: | ||
<img align="center" alt="Running a train step" src="train_step_success.png" width="100%"> | ||
|
||
### Running AReaL with SkyPilot Launcher | ||
|
||
AReaL plans to support a SkyPilot native launcher with | ||
[SkyPilot Python SDK](https://docs.skypilot.co/en/latest/reference/api.html), which is | ||
currently under development. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
experiment_name: gsm8k-grpo-on-ray | ||
trial_name: trial0 | ||
|
||
seed: 1 | ||
total_train_epochs: 10 | ||
tokenizer_path: ${actor.path} | ||
async_training: true | ||
|
||
cluster: | ||
n_nodes: 2 | ||
n_gpus_per_node: 1 | ||
Comment on lines
+10
to
+11
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the goal of using a 2-node cluster with 1 GPU on each node instead of a single node with 2x, 4x or even 8x GPUs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd recommend using SKYPILOT_NUM_NODES and SKYPILOT_NUM_GPUS_PER_NODE instead. See https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html |
||
fileroot: /storage/experiments | ||
name_resolve: | ||
type: ray | ||
ray_actor_name: ray_kv_store | ||
|
||
allocation_mode: sglang.d1+d1 | ||
|
||
rollout: | ||
experiment_name: ${experiment_name} | ||
trial_name: ${trial_name} | ||
max_concurrent_rollouts: 256 | ||
queue_size: null | ||
consumer_batch_size: ${train_dataset.batch_size} | ||
max_head_offpolicyness: 2 | ||
enable_rollout_tracing: false | ||
|
||
gconfig: | ||
n_samples: 4 | ||
min_new_tokens: 0 | ||
max_new_tokens: 1024 | ||
greedy: false | ||
temperature: 1.0 | ||
|
||
actor: | ||
experiment_name: ${experiment_name} | ||
trial_name: ${trial_name} | ||
path: Qwen/Qwen2.5-1.5B-Instruct | ||
init_from_scratch: false | ||
disable_dropout: true | ||
gradient_checkpointing: false | ||
dtype: bfloat16 | ||
mb_spec: | ||
max_tokens_per_mb: 4096 | ||
optimizer: | ||
type: adam | ||
lr: 1.70e-5 | ||
weight_decay: 0.017 | ||
beta1: 0.9 | ||
beta2: 0.999 | ||
eps: 1e-8 | ||
lr_scheduler_type: constant | ||
gradient_clipping: 1.0 | ||
warmup_steps_proportion: 0.001 | ||
backend: fsdp | ||
group_size: ${gconfig.n_samples} | ||
eps_clip: 0.4 | ||
temperature: ${gconfig.temperature} | ||
reward_scaling: 10.0 | ||
reward_bias: -0.5 | ||
kl_ctl: 0.0 | ||
ppo_n_minibatches: 1 | ||
recompute_logprob: true | ||
use_decoupled_loss: true | ||
behav_imp_weight_cap: 5.0 | ||
dynamic_sampling: false | ||
reward_norm: | ||
mean_level: group | ||
std_level: group | ||
group_size: ${gconfig.n_samples} | ||
adv_norm: | ||
mean_level: batch | ||
std_level: batch | ||
max_new_tokens: ${gconfig.max_new_tokens} | ||
|
||
ref: | ||
experiment_name: ${experiment_name} | ||
trial_name: ${trial_name} | ||
path: ${actor.path} | ||
init_from_scratch: false | ||
disable_dropout: true | ||
dtype: ${actor.dtype} | ||
mb_spec: | ||
max_tokens_per_mb: 10240 | ||
optimizer: null | ||
backend: fsdp | ||
|
||
# SGLang | ||
sglang: | ||
model_path: ${actor.path} | ||
random_seed: ${seed} | ||
skip_tokenizer_init: true | ||
dtype: ${actor.dtype} | ||
max_running_requests: null | ||
context_length: 32768 | ||
mem_fraction_static: 0.8 | ||
|
||
# datasets | ||
train_dataset: | ||
batch_size: 4 | ||
shuffle: true | ||
pin_memory: true | ||
num_workers: 4 | ||
path: openai/gsm8k | ||
type: rl | ||
max_length: 1024 | ||
|
||
valid_dataset: | ||
batch_size: 4 | ||
shuffle: true | ||
pin_memory: true | ||
num_workers: 4 | ||
path: openai/gsm8k | ||
type: rl | ||
|
||
# Utilities | ||
saver: | ||
experiment_name: ${experiment_name} | ||
trial_name: ${trial_name} | ||
fileroot: ${cluster.fileroot} | ||
freq_epochs: 1 | ||
freq_steps: null | ||
freq_secs: null | ||
|
||
recover: | ||
mode: disabled | ||
experiment_name: ${experiment_name} | ||
trial_name: ${trial_name} | ||
fileroot: ${cluster.fileroot} | ||
freq_epochs: 1 | ||
freq_steps: null | ||
freq_secs: 3600 | ||
|
||
evaluator: | ||
experiment_name: ${experiment_name} | ||
trial_name: ${trial_name} | ||
fileroot: ${cluster.fileroot} | ||
freq_epochs: 1 | ||
freq_steps: null | ||
freq_secs: null | ||
|
||
stats_logger: | ||
experiment_name: ${experiment_name} | ||
trial_name: ${trial_name} | ||
fileroot: ${cluster.fileroot} | ||
wandb: | ||
mode: disabled | ||
|
||
launcher: | ||
inference_server_cpus_per_gpu: 4 | ||
inference_server_mem_per_gpu: 32768 | ||
trainer_cpus_per_gpu: 4 | ||
trainer_mem_per_gpu: 32768 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
name: areal-test-skypilot | ||
|
||
resources: | ||
infra: gcp | ||
accelerators: A100:2 | ||
autostop: | ||
idle_minutes: 10 | ||
down: true | ||
cpus: 8+ | ||
memory: 32GB+ | ||
disk_size: 256GB | ||
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4 | ||
|
||
num_nodes: 1 | ||
|
||
file_mounts: | ||
/storage: gs://areal-default | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be worth pointing out that |
||
|
||
workdir: . | ||
|
||
run: | | ||
python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \ | ||
--config examples/math/gsm8k_grpo.yaml \ | ||
experiment_name=gsm8k-grpo \ | ||
trial_name=trial0 \ | ||
cluster.n_gpus_per_node=2 \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Move this to yaml, or specify both There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd recommend using |
||
allocation_mode=sglang.d1+d1 \ | ||
train_dataset.batch_size=4 \ | ||
actor.mb_spec.max_tokens_per_mb=4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a link to SkyPilot docs + mention that it supports 17+ clouds