Skip to content

[Bug] vllm_enable_lora causes LoRA double-counting after resume_from_checkpoint, leading to catastrophic generation failure #8233

@xiaobaiv

Description

@xiaobaiv

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

Environment

  • ms-swift version: 3.x (latest main)
  • vLLM mode: colocate
  • Training: GRPO with LoRA
  • Model: Qwen2.5-Omni (multimodal)

Describe the bug

When using --vllm_enable_lora true in colocate mode and resuming GRPO LoRA training via --resume_from_checkpoint, the first rollout batch after resume generates correct outputs, but all subsequent batches produce completely malformed text (losing all format tags), causing pred_extract_failed for 100% of samples.

Root Cause Analysis

The issue is in rollout_mixin.py's weight synchronization logic:

# rollout_mixin.py line 382
if tuner_type == 'full' or (not self.base_sync_done or args.sleep_level == 2) or not self.rollout_enable_lora:
    self._move_full_model_to_vllm()  # First sync: merges LoRA into base, loads merged weights
else:
    self._move_adapter_to_vllm()     # Subsequent syncs: loads LoRA as adapter on top

Phase 1 (base_sync_done=False): _move_full_model_to_vllm() merges the LoRA adapter into the base model weights, loads the merged weights (W_base + LoRA) into vLLM, then unmerges. Sets base_sync_done = True.

Phase 2 (base_sync_done=True): _move_adapter_to_vllm() extracts the current LoRA parameters and adds them as a LoRA adapter to vLLM. vLLM now computes: (W_base + LoRA) + LoRA = W_base + 2×LoRA.

This double-counting of LoRA weights is catastrophic when resuming from a well-trained checkpoint (e.g., 500 steps), where LoRA weights are large. The 2× amplification completely destroys the model's generation capability.

Why this doesn't manifest when training from scratch: When starting fresh, the initial LoRA weights from a SFT adapter are very small, so the double-counting effect is negligible and the model quickly adapts.

Steps to Reproduce

# Step 1: Start GRPO LoRA training with vllm_enable_lora
swift rlhf \
    --rlhf_type grpo \
    --vllm_mode colocate \
    --vllm_enable_lora true \
    --adapters "path/to/sft_checkpoint" \
    --output_dir "path/to/output" \
    ...

# Step 2: Interrupt training at step N (e.g., step 500)

# Step 3: Resume from checkpoint
swift rlhf \
    --rlhf_type grpo \
    --vllm_mode colocate \
    --vllm_enable_lora true \
    --resume_from_checkpoint "path/to/output/checkpoint-500" \
    --ref_adapters "path/to/sft_checkpoint" \
    ...

Observed behavior

Batch format_reward pred_extract_failed Notes
1st (step 501) ~1.0 0/8 Correct (uses _move_full_model_to_vllm)
2nd (step 502) 0.0 8/8 Broken (uses _move_adapter_to_vllm, LoRA double-counted)
3rd+ 0.0 8/8 All subsequent batches broken

Expected behavior

All batches after resume should produce outputs consistent with the checkpoint's quality level.

Workaround

Set --vllm_enable_lora false when using resume_from_checkpoint. This forces _move_full_model_to_vllm() every time, avoiding the double-counting. The trade-off is slower weight synchronization.

Suggested Fix

After _move_full_model_to_vllm() loads the merged weights into vLLM, the subsequent _move_adapter_to_vllm() loads the full LoRA on top of already-merged base weights. A correct implementation should either:

  1. Load unmerged base weights (without LoRA) into vLLM during the first full sync, so that subsequent adapter-only syncs correctly add LoRA once; or
  2. Track the "base LoRA snapshot" and only send the delta in subsequent adapter syncs.

Related

How to Reproduce / 如何复现

# Step 1: Start GRPO LoRA training with vllm_enable_lora
swift rlhf \
    --rlhf_type grpo \
    --vllm_mode colocate \
    --vllm_enable_lora true \
    --adapters "path/to/sft_checkpoint" \
    --output_dir "path/to/output" \
    ...

# Step 2: Interrupt training at step N (e.g., step 500)

# Step 3: Resume from checkpoint
swift rlhf \
    --rlhf_type grpo \
    --vllm_mode colocate \
    --vllm_enable_lora true \
    --resume_from_checkpoint "path/to/output/checkpoint-500" \
    --ref_adapters "path/to/sft_checkpoint" \
    ...

Result: The first batch after resume is correct. All subsequent batches fail with 100% pred_extract_failed — the model generates malformed text without any expected format tags.

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions