[Bug] vllm_enable_lora causes LoRA double-counting after resume_from_checkpoint, leading to catastrophic generation failure

### Checklist / 检查清单

- [x] I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues，确认这是一个新的 bug report。

### Bug Description / Bug 描述

### Environment

- ms-swift version: 3.x (latest main)
- vLLM mode: colocate
- Training: GRPO with LoRA
- Model: Qwen2.5-Omni (multimodal)

### Describe the bug

When using `--vllm_enable_lora true` in colocate mode and resuming GRPO LoRA training via `--resume_from_checkpoint`, the **first rollout batch** after resume generates correct outputs, but **all subsequent batches** produce completely malformed text (losing all format tags), causing `pred_extract_failed` for 100% of samples.

### Root Cause Analysis

The issue is in `rollout_mixin.py`'s weight synchronization logic:

```python
# rollout_mixin.py line 382
if tuner_type == 'full' or (not self.base_sync_done or args.sleep_level == 2) or not self.rollout_enable_lora:
    self._move_full_model_to_vllm()  # First sync: merges LoRA into base, loads merged weights
else:
    self._move_adapter_to_vllm()     # Subsequent syncs: loads LoRA as adapter on top
```

**Phase 1** (`base_sync_done=False`): `_move_full_model_to_vllm()` merges the LoRA adapter into the base model weights, loads the **merged** weights (`W_base + LoRA`) into vLLM, then unmerges. Sets `base_sync_done = True`.

**Phase 2** (`base_sync_done=True`): `_move_adapter_to_vllm()` extracts the current LoRA parameters and adds them as a LoRA adapter to vLLM. vLLM now computes: `(W_base + LoRA) + LoRA = W_base + 2×LoRA`.

This **double-counting** of LoRA weights is catastrophic when resuming from a well-trained checkpoint (e.g., 500 steps), where LoRA weights are large. The 2× amplification completely destroys the model's generation capability.

**Why this doesn't manifest when training from scratch:** When starting fresh, the initial LoRA weights from a SFT adapter are very small, so the double-counting effect is negligible and the model quickly adapts.

### Steps to Reproduce

```bash
# Step 1: Start GRPO LoRA training with vllm_enable_lora
swift rlhf \
    --rlhf_type grpo \
    --vllm_mode colocate \
    --vllm_enable_lora true \
    --adapters "path/to/sft_checkpoint" \
    --output_dir "path/to/output" \
    ...

# Step 2: Interrupt training at step N (e.g., step 500)

# Step 3: Resume from checkpoint
swift rlhf \
    --rlhf_type grpo \
    --vllm_mode colocate \
    --vllm_enable_lora true \
    --resume_from_checkpoint "path/to/output/checkpoint-500" \
    --ref_adapters "path/to/sft_checkpoint" \
    ...
```

### Observed behavior

| Batch | format_reward | pred_extract_failed | Notes |
|-------|--------------|---------------------|-------|
| 1st (step 501) | ~1.0 | 0/8 | Correct (uses `_move_full_model_to_vllm`) |
| 2nd (step 502) | 0.0 | 8/8 | Broken (uses `_move_adapter_to_vllm`, LoRA double-counted) |
| 3rd+ | 0.0 | 8/8 | All subsequent batches broken |

### Expected behavior

All batches after resume should produce outputs consistent with the checkpoint's quality level.

### Workaround

Set `--vllm_enable_lora false` when using `resume_from_checkpoint`. This forces `_move_full_model_to_vllm()` every time, avoiding the double-counting. The trade-off is slower weight synchronization.

### Suggested Fix

After `_move_full_model_to_vllm()` loads the merged weights into vLLM, the subsequent `_move_adapter_to_vllm()` loads the **full** LoRA on top of already-merged base weights. A correct implementation should either:

1. Load **unmerged** base weights (without LoRA) into vLLM during the first full sync, so that subsequent adapter-only syncs correctly add LoRA once; or
2. Track the "base LoRA snapshot" and only send the delta in subsequent adapter syncs.

### Related

- Introduced in PR #5773
- Documentation at `docs/source/Instruction/GRPO/GetStarted/GRPO.md` lists limitations for `vllm_enable_lora` (freeze_vit=false, MoE) but does not mention `resume_from_checkpoint` incompatibility

### How to Reproduce / 如何复现

```bash
# Step 1: Start GRPO LoRA training with vllm_enable_lora
swift rlhf \
    --rlhf_type grpo \
    --vllm_mode colocate \
    --vllm_enable_lora true \
    --adapters "path/to/sft_checkpoint" \
    --output_dir "path/to/output" \
    ...

# Step 2: Interrupt training at step N (e.g., step 500)

# Step 3: Resume from checkpoint
swift rlhf \
    --rlhf_type grpo \
    --vllm_mode colocate \
    --vllm_enable_lora true \
    --resume_from_checkpoint "path/to/output/checkpoint-500" \
    --ref_adapters "path/to/sft_checkpoint" \
    ...
```

**Result:** The first batch after resume is correct. All subsequent batches fail with 100% `pred_extract_failed` — the model generates malformed text without any expected format tags.

### Additional Information / 补充信息

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] vllm_enable_lora causes LoRA double-counting after resume_from_checkpoint, leading to catastrophic generation failure #8233

Checklist / 检查清单

Bug Description / Bug 描述

Environment

Describe the bug

Root Cause Analysis

Steps to Reproduce

Observed behavior

Expected behavior

Workaround

Suggested Fix

Related

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batch	format_reward	pred_extract_failed	Notes
1st (step 501)	~1.0	0/8	Correct (uses `_move_full_model_to_vllm`)
2nd (step 502)	0.0	8/8	Broken (uses `_move_adapter_to_vllm`, LoRA double-counted)
3rd+	0.0	8/8	All subsequent batches broken

[Bug] vllm_enable_lora causes LoRA double-counting after resume_from_checkpoint, leading to catastrophic generation failure #8233

Description

Checklist / 检查清单

Bug Description / Bug 描述

Environment

Describe the bug

Root Cause Analysis

Steps to Reproduce

Observed behavior

Expected behavior

Workaround

Suggested Fix

Related

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions