-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Checklist / 检查清单
- I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。
Bug Description / Bug 描述
Environment
- ms-swift version: 3.x (latest main)
- vLLM mode: colocate
- Training: GRPO with LoRA
- Model: Qwen2.5-Omni (multimodal)
Describe the bug
When using --vllm_enable_lora true in colocate mode and resuming GRPO LoRA training via --resume_from_checkpoint, the first rollout batch after resume generates correct outputs, but all subsequent batches produce completely malformed text (losing all format tags), causing pred_extract_failed for 100% of samples.
Root Cause Analysis
The issue is in rollout_mixin.py's weight synchronization logic:
# rollout_mixin.py line 382
if tuner_type == 'full' or (not self.base_sync_done or args.sleep_level == 2) or not self.rollout_enable_lora:
self._move_full_model_to_vllm() # First sync: merges LoRA into base, loads merged weights
else:
self._move_adapter_to_vllm() # Subsequent syncs: loads LoRA as adapter on topPhase 1 (base_sync_done=False): _move_full_model_to_vllm() merges the LoRA adapter into the base model weights, loads the merged weights (W_base + LoRA) into vLLM, then unmerges. Sets base_sync_done = True.
Phase 2 (base_sync_done=True): _move_adapter_to_vllm() extracts the current LoRA parameters and adds them as a LoRA adapter to vLLM. vLLM now computes: (W_base + LoRA) + LoRA = W_base + 2×LoRA.
This double-counting of LoRA weights is catastrophic when resuming from a well-trained checkpoint (e.g., 500 steps), where LoRA weights are large. The 2× amplification completely destroys the model's generation capability.
Why this doesn't manifest when training from scratch: When starting fresh, the initial LoRA weights from a SFT adapter are very small, so the double-counting effect is negligible and the model quickly adapts.
Steps to Reproduce
# Step 1: Start GRPO LoRA training with vllm_enable_lora
swift rlhf \
--rlhf_type grpo \
--vllm_mode colocate \
--vllm_enable_lora true \
--adapters "path/to/sft_checkpoint" \
--output_dir "path/to/output" \
...
# Step 2: Interrupt training at step N (e.g., step 500)
# Step 3: Resume from checkpoint
swift rlhf \
--rlhf_type grpo \
--vllm_mode colocate \
--vllm_enable_lora true \
--resume_from_checkpoint "path/to/output/checkpoint-500" \
--ref_adapters "path/to/sft_checkpoint" \
...Observed behavior
| Batch | format_reward | pred_extract_failed | Notes |
|---|---|---|---|
| 1st (step 501) | ~1.0 | 0/8 | Correct (uses _move_full_model_to_vllm) |
| 2nd (step 502) | 0.0 | 8/8 | Broken (uses _move_adapter_to_vllm, LoRA double-counted) |
| 3rd+ | 0.0 | 8/8 | All subsequent batches broken |
Expected behavior
All batches after resume should produce outputs consistent with the checkpoint's quality level.
Workaround
Set --vllm_enable_lora false when using resume_from_checkpoint. This forces _move_full_model_to_vllm() every time, avoiding the double-counting. The trade-off is slower weight synchronization.
Suggested Fix
After _move_full_model_to_vllm() loads the merged weights into vLLM, the subsequent _move_adapter_to_vllm() loads the full LoRA on top of already-merged base weights. A correct implementation should either:
- Load unmerged base weights (without LoRA) into vLLM during the first full sync, so that subsequent adapter-only syncs correctly add LoRA once; or
- Track the "base LoRA snapshot" and only send the delta in subsequent adapter syncs.
Related
- Introduced in PR [grpo] Optimize vLLM weight synchronization & update buitin accuracy reward #5773
- Documentation at
docs/source/Instruction/GRPO/GetStarted/GRPO.mdlists limitations forvllm_enable_lora(freeze_vit=false, MoE) but does not mentionresume_from_checkpointincompatibility
How to Reproduce / 如何复现
# Step 1: Start GRPO LoRA training with vllm_enable_lora
swift rlhf \
--rlhf_type grpo \
--vllm_mode colocate \
--vllm_enable_lora true \
--adapters "path/to/sft_checkpoint" \
--output_dir "path/to/output" \
...
# Step 2: Interrupt training at step N (e.g., step 500)
# Step 3: Resume from checkpoint
swift rlhf \
--rlhf_type grpo \
--vllm_mode colocate \
--vllm_enable_lora true \
--resume_from_checkpoint "path/to/output/checkpoint-500" \
--ref_adapters "path/to/sft_checkpoint" \
...Result: The first batch after resume is correct. All subsequent batches fail with 100% pred_extract_failed — the model generates malformed text without any expected format tags.
Additional Information / 补充信息
No response