Skip to content

[fully_async, rollout] feat: enable online policy distillation in fully async training#6056

Open
xiefan46 wants to merge 5 commits intoverl-project:mainfrom
xiefan46:async-opd
Open

[fully_async, rollout] feat: enable online policy distillation in fully async training#6056
xiefan46 wants to merge 5 commits intoverl-project:mainfrom
xiefan46:async-opd

Conversation

@xiefan46
Copy link
Copy Markdown
Contributor

@xiefan46 xiefan46 commented Apr 18, 2026

What does this PR do?

Extend of PR: #6051. Enables Online Policy Distillation (OPD) in fully async training mode.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, vllm_omni, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

  • Student: Qwen3-VL-2B-Instruct
  • Teachers: Qwen3-4B-Instruct-2507 (GSM8K, text-only) + Qwen3-VL-4B-Instruct (Geo3K, vision)
  • Datasets: GSM8K + Geometry3K
  • Algorithm: GRPO + k1 distillation loss with policy gradient
  • GPUs: 6×H100 (2 rollout + 2 training + 2 teachers)

full wandb link:
https://wandb.ai/models-xx/verl-test-fully-async-opd/runs/t7mqeuf1?nw=nwuserfxie46

critic/score/mean
image

actor/loss
image

actor/grad_norm
image

actor/distillation/loss
image

nvidia-smi

(base) root@7bd0f12300f6:~#  nvidia-smi
Tue Apr 21 17:36:18 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:18:00.0 Off |                  Off |
| N/A   50C    P0            568W /  700W |   67922MiB /  81559MiB |     90%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:2A:00.0 Off |                  Off |
| N/A   55C    P0            599W /  700W |   59002MiB /  81559MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:3A:00.0 Off |                  Off |
| N/A   41C    P0            335W /  700W |   58740MiB /  81559MiB |     26%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                  Off |
| N/A   38C    P0            262W /  700W |   57958MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9A:00.0 Off |                  Off |
| N/A   51C    P0            480W /  700W |   51457MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:AB:00.0 Off |                  Off |
| N/A   55C    P0            458W /  700W |   51453MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           21700      C   ...WorkerDict.actor_update_actor      67914MiB |
|    1   N/A  N/A           21701      C   ...WorkerDict.actor_update_actor      58994MiB |
|    2   N/A  N/A           23573      C   VLLM::Worker                          58732MiB |
|    3   N/A  N/A           25364      C   VLLM::Worker                          57950MiB |
|    4   N/A  N/A           26680      C   ray::CheckpointEngineWorker            1784MiB |
|    4   N/A  N/A           27431      C   VLLM::Worker                          49660MiB |
|    5   N/A  N/A           26681      C   ray::CheckpointEngineWorker            1724MiB |
|    5   N/A  N/A           27447      C   VLLM::Worker                          49716MiB |
+-----------------------------------------------------------------------------------------+

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates Online Policy Distillation (OPD) into the fully async policy training pipeline. It enables distillation in the agent loop, implements standalone teacher model management using thread executors to avoid event loop conflicts, and includes a new E2E test script. Review feedback highlights a potential TypeError when initializing workers with unsupported distillation arguments and suggests adding validation to ensure the teacher model replicas are correctly initialized when no resource pool is provided.

Comment thread verl/experimental/fully_async_policy/fully_async_trainer.py
Comment on lines +61 to +65
if self.resource_pool:
world_size = self.resource_pool.world_size
else:
world_size = teacher_model_config.n_gpus_per_node * teacher_model_config.nnodes
num_replicas = world_size // teacher_world_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If resource_pool is None and the configuration for n_gpus_per_node or nnodes is missing or set to zero, world_size will be zero. This results in num_replicas being zero, leading to an empty rollout_replicas list. Consequently, the GlobalRequestLoadBalancer will be initialized with an empty list of server addresses, which will cause runtime errors when distillation requests are dispatched. You should add a validation check to ensure num_replicas > 0 when distillation is enabled.

@xiefan46 xiefan46 changed the title Async opd [fully_async, rollout] feat: enable online policy distillation in fully async training Apr 20, 2026
@xiefan46 xiefan46 force-pushed the async-opd branch 8 times, most recently from 68ea39f to 0a27278 Compare April 21, 2026 10:30
@xiefan46 xiefan46 marked this pull request as ready for review April 22, 2026 07:41
@wuxibin89 wuxibin89 mentioned this pull request Apr 24, 2026
34 tasks
Comment thread verl/experimental/fully_async_policy/fully_async_rollouter.py Outdated
Comment thread verl/experimental/teacher_loop/teacher_model.py Outdated
Comment thread verl/experimental/fully_async_policy/fully_async_rollouter.py Outdated
@xiefan46 xiefan46 marked this pull request as draft April 26, 2026 09:13
Comment thread tests/special_e2e/run_fully_async_policy_opd.sh Outdated
@xiefan46 xiefan46 force-pushed the async-opd branch 8 times, most recently from aaacc39 to eac9de5 Compare April 27, 2026 10:36
@xiefan46 xiefan46 force-pushed the async-opd branch 3 times, most recently from 117ae25 to e97ffe1 Compare April 27, 2026 10:52
Copy link
Copy Markdown
Collaborator

@ArronHZG ArronHZG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hold by #6129 and #6076

assert rm_cfg.n_gpus_per_node > 0, "config.reward.reward_model.n_gpus_per_node must be greater than 0"
assert rm_cfg.nnodes > 0, "config.reward.reward_model.nnodes must be greater than 0"

# Teacher model resource pool (for distillation)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future plans, the rollouter will not be responsible for resource allocation; that is, the teacher will need to use standalone mode.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArronHZG got it. Let me wait for those changes then

@wuxibin89
Copy link
Copy Markdown
Collaborator

hold by #6129 and #6076

@ArronHZG OPD teachers use standalone resource pool, so it should no blocked by #6076 ?

@wuxibin89 wuxibin89 marked this pull request as ready for review April 29, 2026 14:30
Enable multi-teacher online policy distillation in fully async mode:
- Student: Qwen3-VL-2B-Instruct, Teachers: Qwen3-4B + Qwen3-VL-4B
- Allocate shared resource pool for teacher models
- Add NCCL_CUMEM workaround for multi-rollout GPU SIGSEGV
- Enable CUDA graph for student rollout and teachers
- Increase rollout GPUs from 1 to 2 to reduce trainer idle ratio
- Add E2E test script (run_fully_async_policy_opd.sh)
N_GPUS_ROLLOUT=2
N_GPUS_TRAINING=4
N_GPUS_TEACHER_TOTAL=2 # 1 per teacher
TOTAL_ROLLOUT_STEPS=${TOTAL_ROLLOUT_STEPS:-128}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 steps should be enough for e2e ci.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wuxibin89 I made some changes to the test script and now

training_steps = total_rollout_steps / (ppo_mini_batch_size × trigger_parameter_sync_step) = 128 / (16 × 4 = 2

Please let me know if more changes are needed

- Test script: model paths default to ${HOME}/models/ (NFS cache),
  following the convention of other CI test scripts
- CI workflow: add --local_dataset_path for geo3k data preprocessing
  to use NFS-cached raw dataset instead of downloading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants