[fully_async, rollout] feat: enable online policy distillation in fully async training#6056
[fully_async, rollout] feat: enable online policy distillation in fully async training#6056xiefan46 wants to merge 5 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates Online Policy Distillation (OPD) into the fully async policy training pipeline. It enables distillation in the agent loop, implements standalone teacher model management using thread executors to avoid event loop conflicts, and includes a new E2E test script. Review feedback highlights a potential TypeError when initializing workers with unsupported distillation arguments and suggests adding validation to ensure the teacher model replicas are correctly initialized when no resource pool is provided.
| if self.resource_pool: | ||
| world_size = self.resource_pool.world_size | ||
| else: | ||
| world_size = teacher_model_config.n_gpus_per_node * teacher_model_config.nnodes | ||
| num_replicas = world_size // teacher_world_size |
There was a problem hiding this comment.
If resource_pool is None and the configuration for n_gpus_per_node or nnodes is missing or set to zero, world_size will be zero. This results in num_replicas being zero, leading to an empty rollout_replicas list. Consequently, the GlobalRequestLoadBalancer will be initialized with an empty list of server addresses, which will cause runtime errors when distillation requests are dispatched. You should add a validation check to ensure num_replicas > 0 when distillation is enabled.
68ea39f to
0a27278
Compare
aaacc39 to
eac9de5
Compare
117ae25 to
e97ffe1
Compare
| assert rm_cfg.n_gpus_per_node > 0, "config.reward.reward_model.n_gpus_per_node must be greater than 0" | ||
| assert rm_cfg.nnodes > 0, "config.reward.reward_model.nnodes must be greater than 0" | ||
|
|
||
| # Teacher model resource pool (for distillation) |
There was a problem hiding this comment.
In future plans, the rollouter will not be responsible for resource allocation; that is, the teacher will need to use standalone mode.
There was a problem hiding this comment.
@ArronHZG got it. Let me wait for those changes then
Enable multi-teacher online policy distillation in fully async mode: - Student: Qwen3-VL-2B-Instruct, Teachers: Qwen3-4B + Qwen3-VL-4B - Allocate shared resource pool for teacher models - Add NCCL_CUMEM workaround for multi-rollout GPU SIGSEGV - Enable CUDA graph for student rollout and teachers - Increase rollout GPUs from 1 to 2 to reduce trainer idle ratio - Add E2E test script (run_fully_async_policy_opd.sh)
| N_GPUS_ROLLOUT=2 | ||
| N_GPUS_TRAINING=4 | ||
| N_GPUS_TEACHER_TOTAL=2 # 1 per teacher | ||
| TOTAL_ROLLOUT_STEPS=${TOTAL_ROLLOUT_STEPS:-128} |
There was a problem hiding this comment.
2 steps should be enough for e2e ci.
There was a problem hiding this comment.
@wuxibin89 I made some changes to the test script and now
training_steps = total_rollout_steps / (ppo_mini_batch_size × trigger_parameter_sync_step) = 128 / (16 × 4 = 2
Please let me know if more changes are needed
- Test script: model paths default to ${HOME}/models/ (NFS cache),
following the convention of other CI test scripts
- CI workflow: add --local_dataset_path for geo3k data preprocessing
to use NFS-cached raw dataset instead of downloading
What does this PR do?
Extend of PR: #6051. Enables Online Policy Distillation (OPD) in fully async training mode.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,vllm_omni,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
full wandb link:
https://wandb.ai/models-xx/verl-test-fully-async-opd/runs/t7mqeuf1?nw=nwuserfxie46
critic/score/mean

actor/loss

actor/grad_norm

actor/distillation/loss

nvidia-smi
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.