Merge branch 'main' into release/3.11

Jintao-Huang · Jintao-Huang · commit 7537fa8e7c88 · 2025-12-09T09:55:10.000+08:00
diff --git a/docs/source/Instruction/Command-line-parameters.md b/docs/source/Instruction/Command-line-parameters.md
@@ -617,6 +617,7 @@ reward模型参数将在PPO、GRPO中使用。
 - log_entropy: 记录训练中的熵值变化动态，默认为False，具体参考[文档](./GRPO/GetStarted/GRPO.md#logged-metrics)
 - rollout_importance_sampling_mode: 训推不一致校正模式，可选项为 `token_truncate`、`token_mask`、`sequence_truncate`、`sequence_mask`。默认为None，不启用校正。具体参考[文档](./GRPO/AdvancedResearch/training_inference_mismatch.md)
 - rollout_importance_sampling_threshold: 重要性采样权重的阈值，用于截断或屏蔽极端权重。默认为2.0。
+- log_rollout_offpolicy_metrics: 当 `rollout_importance_sampling_mode` 未设置时，是否记录训推不一致诊断指标（KL、PPL、χ²等）。当设置了 `rollout_importance_sampling_mode` 时，指标会自动记录。默认为False。
 
 ##### 奖励函数参数
 内置的奖励函数参考[文档](./GRPO/DeveloperGuide/reward_function.md)
diff --git a/docs/source/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.md b/docs/source/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.md
@@ -187,6 +187,16 @@ ESS 值越大（接近1），表示重要性采样权重分布越均匀，样本
 
 ## 使用方式
 
+### 仅记录诊断指标
+
+如果只想监控训推不一致的程度，而不启用重要性采样校正，可以设置：
+
+```
+--log_rollout_offpolicy_metrics true
+```
+
+这将记录上述所有诊断指标（KL、PPL、χ² 等），但不会对损失函数进行任何修正。
+
 ### 启用重要性采样校正
 
 在GRPO训练中，设置以下参数启用校正机制：
@@ -196,6 +206,7 @@ ESS 值越大（接近1），表示重要性采样权重分布越均匀，样本
 --rollout_importance_sampling_threshold （默认为2）
 ```
 
+当设置了 `rollout_importance_sampling_mode` 时，诊断指标会自动记录，无需额外设置 `log_rollout_offpolicy_metrics`。
 
 参考资料
 
diff --git a/docs/source/Instruction/GRPO/DeveloperGuide/reward_function.md b/docs/source/Instruction/GRPO/DeveloperGuide/reward_function.md
@@ -2,6 +2,8 @@
 ## 自定义奖励函数
 奖励函数接受模型生成的文本 completions 其他数据集中的列以及训练器状态作为参数(kwargs)进行打分, 其中[训练器状态](https://huggingface.co/docs/transformers/main/main_classes/callback#transformers.TrainerState)包含训练的步数等信息。
 
+> Megatron GRPO 使用 self._step 获取当前训练步数
+
 注意：模型输入相关的列（比如query，response）会被处理为 messages 键，原数据集中的 assistant response 会被舍弃，请使用额外的列进行保留。
 相关处理的列名参考[文档](../../../Customization/Custom-dataset.md#query-response格式)
 
diff --git a/docs/source/Instruction/GRPO/GetStarted/GRPO.md b/docs/source/Instruction/GRPO/GetStarted/GRPO.md
@@ -258,14 +258,14 @@ swift rlhf \
 如果设置了`top_entropy_quantile`参数<1.0, 则会记录entropy threshold的值
 - entropy/threshold: 分位点处的 entropy 值，小于该值的 token 将不会被计算 loss
 
-训推一致性指标，前缀为rollout_correction (ms-swift>=3.11)
+训推一致性指标，前缀为rollout_correction (ms-swift>=3.11)，需设置`log_rollout_offpolicy_metrics=true`或`rollout_importance_sampling_mode`：
 - `kl` / `k3_kl`：训练策略与 rollout 策略之间的 KL 散度（直接估计器 / K3 估计器）
 - `training_ppl` / `rollout_ppl`：训练策略和 rollout 策略的困惑度
 - `log_ppl_diff`：log PPL 差异，反映分布偏移程度
 - `ppl_ratio`：PPL 比率
 - `chi2_token` / `chi2_seq`：Token/Sequence 级别的 χ² 散度
 
-IS 校正指标(需设置rollout_importance_sampling_mode)
+IS 校正指标（需设置`rollout_importance_sampling_mode`）：
 - `is_weight_mean`：平均重要性采样权重
 - `ess`：有效样本大小（Effective Sample Size）
 - `clipped_frac`：被截断或屏蔽的样本比例
diff --git a/docs/source/Megatron-SWIFT/Command-line-parameters.md b/docs/source/Megatron-SWIFT/Command-line-parameters.md
@@ -386,6 +386,9 @@ Megatron训练参数继承自Megatron参数和基本参数（**与ms-swift共用
 - delta: [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291)中双侧 GRPO 上界裁剪值。若设置，建议大于 1 + epsilon。默认为None。
 - importance_sampling_level: 控制重要性采样比计算，可选项为 `token` 和 `sequence`，`token` 模式下保留原始的每个 token 的对数概率比，`sequence` 模式下则会对序列中所有有效 token 的对数概率比进行平均。[GSPO论文](https://arxiv.org/abs/2507.18071)中使用sequence级别计算来稳定训练，默认为`token`。
 - scale_rewards：指定奖励的缩放策略。可选值包括 `group`（按组内标准差缩放）、`batch`（按整个批次的标准差缩放）、`none`（不进行缩放）。在 ms-swift < 3.10 版本中，该参数为布尔类型，`true` 对应 `group`，`false` 对应 `none`。默认值与 `advantage_estimator` 绑定：`grpo` 对应 `group`，`rloo` 对应 `none`，`reinforce_plus_plus` 对应 `batch`。
+- rollout_importance_sampling_mode: 训推不一致校正模式，可选项为 `token_truncate`、`token_mask`、`sequence_truncate`、`sequence_mask`。默认为None，不启用校正。具体参考[文档](../Instruction/GRPO/AdvancedResearch/training_inference_mismatch.md)。
+- rollout_importance_sampling_threshold: 重要性采样权重的阈值，用于截断或屏蔽极端权重。默认为2.0。
+- log_rollout_offpolicy_metrics: 当 `rollout_importance_sampling_mode` 未设置时，是否记录训推不一致诊断指标（KL、PPL、χ²等）。当设置了 `rollout_importance_sampling_mode` 时，指标会自动记录。默认为False。
 
 内置奖励函数参数参考[文档](../Instruction/Command-line-parameters.md#奖励函数参数)
 
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -632,6 +632,7 @@ The hyperparameters for the reward function can be found in the [Built-in Reward
 - log_entropy: Logs the entropy values during training. The default is False. For more information, refer to the [documentation](./GRPO/GetStarted/GRPO.md#logged-metrics).
 - rollout_importance_sampling_mode: Training-inference mismatch correction mode. Options are `token_truncate`, `token_mask`, `sequence_truncate`, `sequence_mask`. Default is None (disabled). For details, refer to the [documentation](./GRPO/AdvancedResearch/training_inference_mismatch.md).
 - rollout_importance_sampling_threshold: Threshold for importance sampling weights, used for truncating or masking extreme weights. Default is 2.0.
+- log_rollout_offpolicy_metrics: Whether to log training-inference mismatch diagnostic metrics (KL, PPL, χ², etc.) when `rollout_importance_sampling_mode` is not set. When `rollout_importance_sampling_mode` is set, metrics are always logged. Default is False.
 
 
 ##### Reward function parameters
diff --git a/docs/source_en/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.md b/docs/source_en/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.md
@@ -187,6 +187,16 @@ A larger ESS value (closer to 1) indicates more uniform importance sampling weig
 
 ## Usage
 
+### Logging Diagnostic Metrics Only
+
+If you only want to monitor the degree of training-inference mismatch without enabling importance sampling correction, you can set:
+
+```
+--log_rollout_offpolicy_metrics true
+```
+
+This will log all diagnostic metrics (KL, PPL, χ², etc.) without modifying the loss function.
+
 ### Enabling Importance Sampling Correction
 
 Enable the correction mechanism with the following parameters:
@@ -196,6 +206,7 @@ Enable the correction mechanism with the following parameters:
 --rollout_importance_sampling_threshold (default 2)
 ```
 
+When `rollout_importance_sampling_mode` is set, diagnostic metrics are automatically logged without needing to set `log_rollout_offpolicy_metrics`.
 
 ## References
 
diff --git a/docs/source_en/Instruction/GRPO/DeveloperGuide/reward_function.md b/docs/source_en/Instruction/GRPO/DeveloperGuide/reward_function.md
@@ -1,6 +1,8 @@
 # Reward Function
 ## Custom Reward Function
-The reward function takes as arguments (via kwargs) the model-generated completions, other columns from the dataset, and the training state, and calculates a reward score. The [trainer state]() includes information such as the current training step.
+The reward function takes as arguments (via kwargs) the model-generated completions, other columns from the dataset, and the training state, and calculates a reward score. The [trainer state](https://huggingface.co/docs/transformers/main/main_classes/callback#transformers.TrainerState) includes information such as the current training step.
+
+> If you are using the Megatron backend, use self._step to get the current training step.
 
 Note: The columns related to model input (such as query and response) are converted to the messages key. The original assistant response in the dataset will be discarded, so please use extra columns if you wish to retain it.
 The relevant column names for processing can be found in the [document](../../../Customization/Custom-dataset.md#Query-Response)
diff --git a/docs/source_en/Instruction/GRPO/GetStarted/GRPO.md b/docs/source_en/Instruction/GRPO/GetStarted/GRPO.md
@@ -256,14 +256,14 @@ If the `log_entropy` parameter is set, additional entropy-related metrics will b
 If `top_entropy_quantile` is set to a value smaller than 1.0, the entropy threshold value will also be recorded:
 - entropy/threshold: Tokens with entropy below this value will be excluded from the loss calculation.
 
-Training-inference consistency metrics, prefixed with rollout_correction (ms-swift>=3.11):
+Training-inference consistency metrics, prefixed with rollout_correction (ms-swift>=3.11), requires setting `log_rollout_offpolicy_metrics=true` or `rollout_importance_sampling_mode`:
 - `kl` / `k3_kl`: KL divergence between training policy and rollout policy (direct estimator / K3 estimator)
 - `training_ppl` / `rollout_ppl`: Perplexity of training policy and rollout policy
 - `log_ppl_diff`: Log PPL difference, reflects the degree of distribution shift
 - `ppl_ratio`: PPL ratio
 - `chi2_token` / `chi2_seq`: Token/Sequence-level χ² divergence
 
-IS correction metrics (requires setting rollout_importance_sampling_mode):
+IS correction metrics (requires setting `rollout_importance_sampling_mode`):
 - `is_weight_mean`: Average importance sampling weight
 - `ess`: Effective Sample Size
 - `clipped_frac`: Fraction of samples that were truncated or masked
diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
@@ -410,6 +410,9 @@ In addition to inheriting the training parameters, the following parameters are
 - delta: Bilateral GRPO upper bound clipping value from the [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291). If set, it is recommended to be greater than 1 + epsilon. Default is None.
 - importance_sampling_level: Controls importance sampling ratio calculation. Options are `token` and `sequence`. In `token` mode, the original log probability ratio for each token is preserved. In `sequence` mode, the log probability ratios of all valid tokens in the sequence are averaged. The [GSPO paper](https://arxiv.org/abs/2507.18071) uses sequence-level calculation to stabilize training. Default is `token`.
 - scale_rewards: Specifies the reward scaling strategy. Options include `group` (scale by within-group standard deviation), `batch` (scale by batch-wide standard deviation), and `none` (no scaling). In ms-swift < 3.10, this parameter is boolean, where `true` corresponds to `group` and `false` corresponds to `none`. The default value is bound to `advantage_estimator`: `grpo` corresponds to `group`, `rloo` corresponds to `none`, and `reinforce_plus_plus` corresponds to `batch`.
+- rollout_importance_sampling_mode: Training-inference mismatch correction mode. Options are `token_truncate`, `token_mask`, `sequence_truncate`, `sequence_mask`. Default is None (disabled). For details, refer to the [documentation](../Instruction/GRPO/AdvancedResearch/training_inference_mismatch.md).
+- rollout_importance_sampling_threshold: Threshold for importance sampling weights, used for truncating or masking extreme weights. Default is 2.0.
+- log_rollout_offpolicy_metrics: Whether to log training-inference mismatch diagnostic metrics (KL, PPL, χ², etc.) when `rollout_importance_sampling_mode` is not set. When `rollout_importance_sampling_mode` is set, metrics are always logged. Default is False.
 
 Built-in reward function parameters refer to the [documentation](../Instruction/Command-line-parameters.md#reward-function-parameters).
 
diff --git a/swift/megatron/argument/megatron_args.py b/swift/megatron/argument/megatron_args.py
@@ -117,6 +117,7 @@ class RLHFMegatronArgumentsMixin:
     rollout_importance_sampling_mode: Optional[Literal['token_truncate', 'token_mask', 'sequence_truncate',
                                                        'sequence_mask']] = None
     rollout_importance_sampling_threshold: float = 2.0
+    log_rollout_offpolicy_metrics: bool = False
 
     # ───────────────────────────  Not Supported Yet  ───────────────────────────
 
diff --git a/swift/megatron/trainers/grpo_trainer.py b/swift/megatron/trainers/grpo_trainer.py
@@ -105,6 +105,7 @@ def _init_grpo_params(self):
         # Rollout Importance Sampling Correction
         self.rollout_importance_sampling_mode = args.rollout_importance_sampling_mode
         self.rollout_importance_sampling_threshold = args.rollout_importance_sampling_threshold
+        self.log_rollout_offpolicy_metrics = args.log_rollout_offpolicy_metrics
 
         # batch size (completion-level)
         self.generation_batch_size = args.generation_batch_size
@@ -1249,7 +1250,10 @@ def loss_func(self, output_tensor: torch.Tensor, data: Dict[str, Any]):
 
         # Rollout importance sampling correction
         rollout_correction_metrics = {}
-        if rollout_per_token_logps is not None and not self.disable_rollout_importance_sampling:
+        should_compute_rollout_metrics = (
+            self.rollout_importance_sampling_mode is not None or self.log_rollout_offpolicy_metrics)
+        if (rollout_per_token_logps is not None and not self.disable_rollout_importance_sampling
+                and should_compute_rollout_metrics):
             # Compute off-policy diagnostic metrics
             rollout_correction_metrics = self._compute_rollout_offpolicy_metrics(old_per_token_logps,
                                                                                  rollout_per_token_logps,
diff --git a/swift/plugin/tuner.py b/swift/plugin/tuner.py
@@ -66,6 +66,9 @@ def save_pretrained(
         safe_serialization: bool = True,
         **kwargs,
     ) -> None:
+        if isinstance(model, PeftModel):
+            if 'selected_adapters' not in kwargs:
+                kwargs['selected_adapters'] = ['default']
         model.save_pretrained(save_directory, safe_serialization=safe_serialization, **kwargs)
 
     @staticmethod
diff --git a/swift/trainers/arguments.py b/swift/trainers/arguments.py
@@ -534,6 +534,9 @@ class GRPOArgumentsMixin(RolloutTrainerArgumentsMixin):
             See the documentation for details.
         rollout_importance_sampling_threshold (float): The threshold for importance sampling weights, used to truncate
             or mask extreme weights. Defaults to 2.0.
+        log_rollout_offpolicy_metrics (bool): Whether to log rollout off-policy diagnostic metrics (KL, PPL, chi2, etc.)
+            when `rollout_importance_sampling_mode` is not set. When `rollout_importance_sampling_mode` is set,
+            metrics are always logged regardless of this setting. Defaults to False.
     """
     epsilon: float = 0.2
     epsilon_high: Optional[float] = None
@@ -603,6 +606,7 @@ class GRPOArgumentsMixin(RolloutTrainerArgumentsMixin):
     rollout_importance_sampling_mode: Optional[Literal['token_truncate', 'token_mask', 'sequence_truncate',
                                                        'sequence_mask']] = None
     rollout_importance_sampling_threshold: float = 2.0  # Threshold for truncation/masking (C in paper)
+    log_rollout_offpolicy_metrics: bool = False  # Log off-policy metrics even when IS correction is disabled
 
 
 @dataclass
diff --git a/swift/trainers/mixin.py b/swift/trainers/mixin.py
@@ -282,15 +282,17 @@ def _save_model(self, output_dir: Optional[str] = None, state_dict=None):
 
             _unwrap_model = unwrap_model(self.model)
             if isinstance(_unwrap_model, supported_classes):
+                save_kwargs = {'state_dict': state_dict}
+                if isinstance(_unwrap_model, PeftModel):
+                    save_kwargs['selected_adapters'] = ['default']
                 if use_flash_ckpt:
                     _unwrap_model.save_pretrained(
                         output_dir,
-                        state_dict=state_dict,
                         safe_serialization=False,
-                        save_function=self.flash_checkpointer.ckpt_agent.save)
+                        save_function=self.flash_checkpointer.ckpt_agent.save,
+                        **save_kwargs)
                 else:
-                    _unwrap_model.save_pretrained(
-                        output_dir, state_dict=state_dict, safe_serialization=save_safetensors)
+                    _unwrap_model.save_pretrained(output_dir, safe_serialization=save_safetensors, **save_kwargs)
             else:
                 logger.info('Trainer.model is not a `PreTrainedModel`, only saving its state dict.')
                 if use_flash_ckpt:
@@ -334,14 +336,17 @@ def _save_model(self, output_dir: Optional[str] = None, state_dict=None):
                 self.model, output_dir, state_dict=state_dict, safe_serialization=save_safetensors)
         else:
             if self.model.__class__.__name__ != 'SentenceTransformer':
+                save_kwargs = {'state_dict': state_dict}
+                if isinstance(self.model, PeftModel):
+                    save_kwargs['selected_adapters'] = ['default']
                 if use_flash_ckpt:
                     self.model.save_pretrained(
                         output_dir,
-                        state_dict=state_dict,
                         safe_serialization=False,
-                        save_function=self.flash_checkpointer.ckpt_agent.save)
+                        save_function=self.flash_checkpointer.ckpt_agent.save,
+                        **save_kwargs)
                 else:
-                    self.model.save_pretrained(output_dir, state_dict=state_dict, safe_serialization=save_safetensors)
+                    self.model.save_pretrained(output_dir, safe_serialization=save_safetensors, **save_kwargs)
             else:
 
                 @contextmanager
diff --git a/swift/trainers/rlhf_trainer/dpo_trainer.py b/swift/trainers/rlhf_trainer/dpo_trainer.py
@@ -154,14 +154,15 @@ def concatenated_forward(
                 public_lengths = torch.min(chosen_lengths, rejected_lengths)  # l_p in the paper
                 public_lengths = torch.cat([public_lengths, public_lengths], dim=0)
 
-                seq_len = per_token_logps.size(1)
-                text_position_ids = torch.arange(seq_len, device=per_token_logps.device).expand_as(per_token_logps)
-
-                ld_mask = text_position_ids < public_lengths.unsqueeze(1)
-                mask = text_position_ids < completion_lengths.unsqueeze(1)
-
-                front_mask = (ld_mask & mask).float()
-                rear_mask = (~ld_mask & mask).float()
+                # Use loss_mask to compute position within completion
+                # cumsum gives position within completion (1-indexed), subtract 1 to get 0-indexed
+                completion_position_ids = (loss_mask.cumsum(dim=1) - 1) * loss_mask
+
+                ld_mask = completion_position_ids < public_lengths.unsqueeze(1)
+                # front_mask: positions within public_lengths (shared prefix)
+                # rear_mask: positions beyond public_lengths (length-dependent suffix)
+                front_mask = (ld_mask & loss_mask).float()
+                rear_mask = (~ld_mask & loss_mask).float()
                 front_logps = (per_token_logps * front_mask).sum(dim=1)
                 rear_logps = (per_token_logps * rear_mask).sum(dim=1)
 
diff --git a/swift/trainers/rlhf_trainer/grpo_trainer.py b/swift/trainers/rlhf_trainer/grpo_trainer.py
diff --git a/swift/trainers/rlhf_trainer/utils.py b/swift/trainers/rlhf_trainer/utils.py