[fully_async, rollout] feat: enable online policy distillation in fully async training by xiefan46 · Pull Request #6056 · verl-project/verl

xiefan46 · 2026-04-18T10:39:23Z

What does this PR do?

Extend of PR: #6051. Enables Online Policy Distillation (OPD) in fully async training mode.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, vllm_omni, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Student: Qwen3-VL-2B-Instruct
Teachers: Qwen3-4B-Instruct-2507 (GSM8K, text-only) + Qwen3-VL-4B-Instruct (Geo3K, vision)
Datasets: GSM8K + Geometry3K
Algorithm: GRPO + k1 distillation loss with policy gradient
GPUs: 6×H100 (2 rollout + 2 training + 2 teachers)

full wandb link:
https://wandb.ai/models-xx/verl-test-fully-async-opd/runs/t7mqeuf1?nw=nwuserfxie46

critic/score/mean

actor/loss

actor/grad_norm

actor/distillation/loss

nvidia-smi

(base) root@7bd0f12300f6:~#  nvidia-smi
Tue Apr 21 17:36:18 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:18:00.0 Off |                  Off |
| N/A   50C    P0            568W /  700W |   67922MiB /  81559MiB |     90%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:2A:00.0 Off |                  Off |
| N/A   55C    P0            599W /  700W |   59002MiB /  81559MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:3A:00.0 Off |                  Off |
| N/A   41C    P0            335W /  700W |   58740MiB /  81559MiB |     26%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                  Off |
| N/A   38C    P0            262W /  700W |   57958MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9A:00.0 Off |                  Off |
| N/A   51C    P0            480W /  700W |   51457MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:AB:00.0 Off |                  Off |
| N/A   55C    P0            458W /  700W |   51453MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           21700      C   ...WorkerDict.actor_update_actor      67914MiB |
|    1   N/A  N/A           21701      C   ...WorkerDict.actor_update_actor      58994MiB |
|    2   N/A  N/A           23573      C   VLLM::Worker                          58732MiB |
|    3   N/A  N/A           25364      C   VLLM::Worker                          57950MiB |
|    4   N/A  N/A           26680      C   ray::CheckpointEngineWorker            1784MiB |
|    4   N/A  N/A           27431      C   VLLM::Worker                          49660MiB |
|    5   N/A  N/A           26681      C   ray::CheckpointEngineWorker            1724MiB |
|    5   N/A  N/A           27447      C   VLLM::Worker                          49716MiB |
+-----------------------------------------------------------------------------------------+

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

gemini-code-assist

Code Review

This pull request integrates Online Policy Distillation (OPD) into the fully async policy training pipeline. It enables distillation in the agent loop, implements standalone teacher model management using thread executors to avoid event loop conflicts, and includes a new E2E test script. Review feedback highlights a potential TypeError when initializing workers with unsupported distillation arguments and suggests adding validation to ensure the teacher model replicas are correctly initialized when no resource pool is provided.

gemini-code-assist · 2026-04-18T10:41:38Z

+        if self.resource_pool:
+            world_size = self.resource_pool.world_size
+        else:
+            world_size = teacher_model_config.n_gpus_per_node * teacher_model_config.nnodes
+        num_replicas = world_size // teacher_world_size


If resource_pool is None and the configuration for n_gpus_per_node or nnodes is missing or set to zero, world_size will be zero. This results in num_replicas being zero, leading to an empty rollout_replicas list. Consequently, the GlobalRequestLoadBalancer will be initialized with an empty list of server addresses, which will cause runtime errors when distillation requests are dispatched. You should add a validation check to ensure num_replicas > 0 when distillation is enabled.

ArronHZG

hold by #6129 and #6076

ArronHZG · 2026-04-27T11:25:39Z

        assert rm_cfg.n_gpus_per_node > 0, "config.reward.reward_model.n_gpus_per_node must be greater than 0"
        assert rm_cfg.nnodes > 0, "config.reward.reward_model.nnodes must be greater than 0"

+    # Teacher model resource pool (for distillation)


In future plans, the rollouter will not be responsible for resource allocation; that is, the teacher will need to use standalone mode.

@ArronHZG got it. Let me wait for those changes then

wuxibin89 · 2026-04-29T14:30:37Z

hold by #6129 and #6076

@ArronHZG OPD teachers use standalone resource pool, so it should no blocked by #6076 ?

Enable multi-teacher online policy distillation in fully async mode: - Student: Qwen3-VL-2B-Instruct, Teachers: Qwen3-4B + Qwen3-VL-4B - Allocate shared resource pool for teacher models - Add NCCL_CUMEM workaround for multi-rollout GPU SIGSEGV - Enable CUDA graph for student rollout and teachers - Increase rollout GPUs from 1 to 2 to reduce trainer idle ratio - Add E2E test script (run_fully_async_policy_opd.sh)

…factoring

… env var

wuxibin89 · 2026-04-30T12:31:56Z

+N_GPUS_ROLLOUT=2
+N_GPUS_TRAINING=4
+N_GPUS_TEACHER_TOTAL=2  # 1 per teacher
+TOTAL_ROLLOUT_STEPS=${TOTAL_ROLLOUT_STEPS:-128}


2 steps should be enough for e2e ci.

@wuxibin89 I made some changes to the test script and now

training_steps = total_rollout_steps / (ppo_mini_batch_size × trigger_parameter_sync_step) = 128 / (16 × 4 = 2

Please let me know if more changes are needed

- Test script: model paths default to ${HOME}/models/ (NFS cache), following the convention of other CI test scripts - CI workflow: add --local_dataset_path for geo3k data preprocessing to use NFS-cached raw dataset instead of downloading

gemini-code-assist Bot reviewed Apr 18, 2026

View reviewed changes

xiefan46 changed the title ~~Async opd~~ [fully_async, rollout] feat: enable online policy distillation in fully async training Apr 20, 2026

xiefan46 force-pushed the async-opd branch 8 times, most recently from 68ea39f to 0a27278 Compare April 21, 2026 10:30

xiefan46 marked this pull request as ready for review April 22, 2026 07:41

xiefan46 requested review from ArronHZG and wuxibin89 as code owners April 22, 2026 07:41

wuxibin89 mentioned this pull request Apr 24, 2026

[roadmap] verl 26Q2 roadmap #5836

Open

34 tasks

wuxibin89 requested review from JacobHelwig April 24, 2026 02:27

wuxibin89 reviewed Apr 24, 2026

View reviewed changes

Comment thread verl/experimental/fully_async_policy/fully_async_rollouter.py Outdated

xiefan46 force-pushed the async-opd branch from 6f2fb51 to f4e9086 Compare April 24, 2026 13:08

JacobHelwig reviewed Apr 24, 2026

View reviewed changes

Comment thread verl/experimental/teacher_loop/teacher_model.py Outdated

JacobHelwig reviewed Apr 24, 2026

View reviewed changes

Comment thread verl/experimental/fully_async_policy/fully_async_rollouter.py Outdated

xiefan46 marked this pull request as draft April 26, 2026 09:13

wuxibin89 reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/special_e2e/run_fully_async_policy_opd.sh Outdated

xiefan46 force-pushed the async-opd branch 8 times, most recently from aaacc39 to eac9de5 Compare April 27, 2026 10:36

xiefan46 force-pushed the async-opd branch 3 times, most recently from 117ae25 to e97ffe1 Compare April 27, 2026 10:52

ArronHZG reviewed Apr 27, 2026

View reviewed changes

wuxibin89 marked this pull request as ready for review April 29, 2026 14:30

xiefan46 force-pushed the async-opd branch from e97ffe1 to 5cc2cec Compare April 30, 2026 09:23

xiefan46 added 2 commits April 30, 2026 18:22

fix: pass teacher_client to AgentLoopManager per verl-project#6129 re…

81b2ad4

…factoring

test: reduce OPD default rollout steps to 128 for CI, overridable via…

b89b52c

… env var

wuxibin89 reviewed Apr 30, 2026

View reviewed changes

xiefan46 added 2 commits April 30, 2026 22:51

test: set ppo_mini_batch_size=16 to match fully_async_policy CI pattern

41d0e0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fully_async, rollout] feat: enable online policy distillation in fully async training#6056

[fully_async, rollout] feat: enable online policy distillation in fully async training#6056
xiefan46 wants to merge 5 commits intoverl-project:mainfrom
xiefan46:async-opd

xiefan46 commented Apr 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArronHZG left a comment

Uh oh!

ArronHZG Apr 27, 2026

Uh oh!

xiefan46 Apr 27, 2026

Uh oh!

wuxibin89 commented Apr 29, 2026

Uh oh!

wuxibin89 Apr 30, 2026

Uh oh!

xiefan46 May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xiefan46 commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArronHZG left a comment

Choose a reason for hiding this comment

Uh oh!

ArronHZG Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

xiefan46 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Apr 29, 2026

Uh oh!

wuxibin89 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

xiefan46 May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiefan46 commented Apr 18, 2026 •

edited

Loading