Skip to content

Commit 7537fa8

Browse files
committed
Merge branch 'main' into release/3.11
2 parents 31c9f29 + 57c294e commit 7537fa8

File tree

18 files changed

+84
-25
lines changed

18 files changed

+84
-25
lines changed

docs/source/Instruction/Command-line-parameters.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -617,6 +617,7 @@ reward模型参数将在PPO、GRPO中使用。
617617
- log_entropy: 记录训练中的熵值变化动态,默认为False,具体参考[文档](./GRPO/GetStarted/GRPO.md#logged-metrics)
618618
- rollout_importance_sampling_mode: 训推不一致校正模式,可选项为 `token_truncate``token_mask``sequence_truncate``sequence_mask`。默认为None,不启用校正。具体参考[文档](./GRPO/AdvancedResearch/training_inference_mismatch.md)
619619
- rollout_importance_sampling_threshold: 重要性采样权重的阈值,用于截断或屏蔽极端权重。默认为2.0。
620+
- log_rollout_offpolicy_metrics: 当 `rollout_importance_sampling_mode` 未设置时,是否记录训推不一致诊断指标(KL、PPL、χ²等)。当设置了 `rollout_importance_sampling_mode` 时,指标会自动记录。默认为False。
620621

621622
##### 奖励函数参数
622623
内置的奖励函数参考[文档](./GRPO/DeveloperGuide/reward_function.md)

docs/source/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,16 @@ ESS 值越大(接近1),表示重要性采样权重分布越均匀,样本
187187

188188
## 使用方式
189189

190+
### 仅记录诊断指标
191+
192+
如果只想监控训推不一致的程度,而不启用重要性采样校正,可以设置:
193+
194+
```
195+
--log_rollout_offpolicy_metrics true
196+
```
197+
198+
这将记录上述所有诊断指标(KL、PPL、χ² 等),但不会对损失函数进行任何修正。
199+
190200
### 启用重要性采样校正
191201

192202
在GRPO训练中,设置以下参数启用校正机制:
@@ -196,6 +206,7 @@ ESS 值越大(接近1),表示重要性采样权重分布越均匀,样本
196206
--rollout_importance_sampling_threshold (默认为2)
197207
```
198208

209+
当设置了 `rollout_importance_sampling_mode` 时,诊断指标会自动记录,无需额外设置 `log_rollout_offpolicy_metrics`
199210

200211
参考资料
201212

docs/source/Instruction/GRPO/DeveloperGuide/reward_function.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
## 自定义奖励函数
33
奖励函数接受模型生成的文本 completions 其他数据集中的列以及训练器状态作为参数(kwargs)进行打分, 其中[训练器状态](https://huggingface.co/docs/transformers/main/main_classes/callback#transformers.TrainerState)包含训练的步数等信息。
44

5+
> Megatron GRPO 使用 self._step 获取当前训练步数
6+
57
注意:模型输入相关的列(比如query,response)会被处理为 messages 键,原数据集中的 assistant response 会被舍弃,请使用额外的列进行保留。
68
相关处理的列名参考[文档](../../../Customization/Custom-dataset.md#query-response格式)
79

docs/source/Instruction/GRPO/GetStarted/GRPO.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -258,14 +258,14 @@ swift rlhf \
258258
如果设置了`top_entropy_quantile`参数<1.0, 则会记录entropy threshold的值
259259
- entropy/threshold: 分位点处的 entropy 值,小于该值的 token 将不会被计算 loss
260260

261-
训推一致性指标,前缀为rollout_correction (ms-swift>=3.11)
261+
训推一致性指标,前缀为rollout_correction (ms-swift>=3.11),需设置`log_rollout_offpolicy_metrics=true``rollout_importance_sampling_mode`
262262
- `kl` / `k3_kl`:训练策略与 rollout 策略之间的 KL 散度(直接估计器 / K3 估计器)
263263
- `training_ppl` / `rollout_ppl`:训练策略和 rollout 策略的困惑度
264264
- `log_ppl_diff`:log PPL 差异,反映分布偏移程度
265265
- `ppl_ratio`:PPL 比率
266266
- `chi2_token` / `chi2_seq`:Token/Sequence 级别的 χ² 散度
267267

268-
IS 校正指标(需设置rollout_importance_sampling_mode)
268+
IS 校正指标(需设置`rollout_importance_sampling_mode`):
269269
- `is_weight_mean`:平均重要性采样权重
270270
- `ess`:有效样本大小(Effective Sample Size)
271271
- `clipped_frac`:被截断或屏蔽的样本比例

docs/source/Megatron-SWIFT/Command-line-parameters.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -386,6 +386,9 @@ Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用
386386
- delta: [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291)中双侧 GRPO 上界裁剪值。若设置,建议大于 1 + epsilon。默认为None。
387387
- importance_sampling_level: 控制重要性采样比计算,可选项为 `token``sequence``token` 模式下保留原始的每个 token 的对数概率比,`sequence` 模式下则会对序列中所有有效 token 的对数概率比进行平均。[GSPO论文](https://arxiv.org/abs/2507.18071)中使用sequence级别计算来稳定训练,默认为`token`
388388
- scale_rewards:指定奖励的缩放策略。可选值包括 `group`(按组内标准差缩放)、`batch`(按整个批次的标准差缩放)、`none`(不进行缩放)。在 ms-swift < 3.10 版本中,该参数为布尔类型,`true` 对应 `group``false` 对应 `none`。默认值与 `advantage_estimator` 绑定:`grpo` 对应 `group``rloo` 对应 `none``reinforce_plus_plus` 对应 `batch`
389+
- rollout_importance_sampling_mode: 训推不一致校正模式,可选项为 `token_truncate``token_mask``sequence_truncate``sequence_mask`。默认为None,不启用校正。具体参考[文档](../Instruction/GRPO/AdvancedResearch/training_inference_mismatch.md)
390+
- rollout_importance_sampling_threshold: 重要性采样权重的阈值,用于截断或屏蔽极端权重。默认为2.0。
391+
- log_rollout_offpolicy_metrics: 当 `rollout_importance_sampling_mode` 未设置时,是否记录训推不一致诊断指标(KL、PPL、χ²等)。当设置了 `rollout_importance_sampling_mode` 时,指标会自动记录。默认为False。
389392

390393
内置奖励函数参数参考[文档](../Instruction/Command-line-parameters.md#奖励函数参数)
391394

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -632,6 +632,7 @@ The hyperparameters for the reward function can be found in the [Built-in Reward
632632
- log_entropy: Logs the entropy values during training. The default is False. For more information, refer to the [documentation](./GRPO/GetStarted/GRPO.md#logged-metrics).
633633
- rollout_importance_sampling_mode: Training-inference mismatch correction mode. Options are `token_truncate`, `token_mask`, `sequence_truncate`, `sequence_mask`. Default is None (disabled). For details, refer to the [documentation](./GRPO/AdvancedResearch/training_inference_mismatch.md).
634634
- rollout_importance_sampling_threshold: Threshold for importance sampling weights, used for truncating or masking extreme weights. Default is 2.0.
635+
- log_rollout_offpolicy_metrics: Whether to log training-inference mismatch diagnostic metrics (KL, PPL, χ², etc.) when `rollout_importance_sampling_mode` is not set. When `rollout_importance_sampling_mode` is set, metrics are always logged. Default is False.
635636

636637

637638
##### Reward function parameters

docs/source_en/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,16 @@ A larger ESS value (closer to 1) indicates more uniform importance sampling weig
187187

188188
## Usage
189189

190+
### Logging Diagnostic Metrics Only
191+
192+
If you only want to monitor the degree of training-inference mismatch without enabling importance sampling correction, you can set:
193+
194+
```
195+
--log_rollout_offpolicy_metrics true
196+
```
197+
198+
This will log all diagnostic metrics (KL, PPL, χ², etc.) without modifying the loss function.
199+
190200
### Enabling Importance Sampling Correction
191201

192202
Enable the correction mechanism with the following parameters:
@@ -196,6 +206,7 @@ Enable the correction mechanism with the following parameters:
196206
--rollout_importance_sampling_threshold (default 2)
197207
```
198208

209+
When `rollout_importance_sampling_mode` is set, diagnostic metrics are automatically logged without needing to set `log_rollout_offpolicy_metrics`.
199210

200211
## References
201212

docs/source_en/Instruction/GRPO/DeveloperGuide/reward_function.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Reward Function
22
## Custom Reward Function
3-
The reward function takes as arguments (via kwargs) the model-generated completions, other columns from the dataset, and the training state, and calculates a reward score. The [trainer state]() includes information such as the current training step.
3+
The reward function takes as arguments (via kwargs) the model-generated completions, other columns from the dataset, and the training state, and calculates a reward score. The [trainer state](https://huggingface.co/docs/transformers/main/main_classes/callback#transformers.TrainerState) includes information such as the current training step.
4+
5+
> If you are using the Megatron backend, use self._step to get the current training step.
46
57
Note: The columns related to model input (such as query and response) are converted to the messages key. The original assistant response in the dataset will be discarded, so please use extra columns if you wish to retain it.
68
The relevant column names for processing can be found in the [document](../../../Customization/Custom-dataset.md#Query-Response)

docs/source_en/Instruction/GRPO/GetStarted/GRPO.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -256,14 +256,14 @@ If the `log_entropy` parameter is set, additional entropy-related metrics will b
256256
If `top_entropy_quantile` is set to a value smaller than 1.0, the entropy threshold value will also be recorded:
257257
- entropy/threshold: Tokens with entropy below this value will be excluded from the loss calculation.
258258

259-
Training-inference consistency metrics, prefixed with rollout_correction (ms-swift>=3.11):
259+
Training-inference consistency metrics, prefixed with rollout_correction (ms-swift>=3.11), requires setting `log_rollout_offpolicy_metrics=true` or `rollout_importance_sampling_mode`:
260260
- `kl` / `k3_kl`: KL divergence between training policy and rollout policy (direct estimator / K3 estimator)
261261
- `training_ppl` / `rollout_ppl`: Perplexity of training policy and rollout policy
262262
- `log_ppl_diff`: Log PPL difference, reflects the degree of distribution shift
263263
- `ppl_ratio`: PPL ratio
264264
- `chi2_token` / `chi2_seq`: Token/Sequence-level χ² divergence
265265

266-
IS correction metrics (requires setting rollout_importance_sampling_mode):
266+
IS correction metrics (requires setting `rollout_importance_sampling_mode`):
267267
- `is_weight_mean`: Average importance sampling weight
268268
- `ess`: Effective Sample Size
269269
- `clipped_frac`: Fraction of samples that were truncated or masked

docs/source_en/Megatron-SWIFT/Command-line-parameters.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -410,6 +410,9 @@ In addition to inheriting the training parameters, the following parameters are
410410
- delta: Bilateral GRPO upper bound clipping value from the [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291). If set, it is recommended to be greater than 1 + epsilon. Default is None.
411411
- importance_sampling_level: Controls importance sampling ratio calculation. Options are `token` and `sequence`. In `token` mode, the original log probability ratio for each token is preserved. In `sequence` mode, the log probability ratios of all valid tokens in the sequence are averaged. The [GSPO paper](https://arxiv.org/abs/2507.18071) uses sequence-level calculation to stabilize training. Default is `token`.
412412
- scale_rewards: Specifies the reward scaling strategy. Options include `group` (scale by within-group standard deviation), `batch` (scale by batch-wide standard deviation), and `none` (no scaling). In ms-swift < 3.10, this parameter is boolean, where `true` corresponds to `group` and `false` corresponds to `none`. The default value is bound to `advantage_estimator`: `grpo` corresponds to `group`, `rloo` corresponds to `none`, and `reinforce_plus_plus` corresponds to `batch`.
413+
- rollout_importance_sampling_mode: Training-inference mismatch correction mode. Options are `token_truncate`, `token_mask`, `sequence_truncate`, `sequence_mask`. Default is None (disabled). For details, refer to the [documentation](../Instruction/GRPO/AdvancedResearch/training_inference_mismatch.md).
414+
- rollout_importance_sampling_threshold: Threshold for importance sampling weights, used for truncating or masking extreme weights. Default is 2.0.
415+
- log_rollout_offpolicy_metrics: Whether to log training-inference mismatch diagnostic metrics (KL, PPL, χ², etc.) when `rollout_importance_sampling_mode` is not set. When `rollout_importance_sampling_mode` is set, metrics are always logged. Default is False.
413416

414417
Built-in reward function parameters refer to the [documentation](../Instruction/Command-line-parameters.md#reward-function-parameters).
415418

0 commit comments

Comments
 (0)