无法复现X-R1-3B在Math500数据集上精度

Hi，想请问用X-R1-750的数据训Qwen2.5-3B全量微调模型，per_device_train_batch_size/num_generations是咋设置的呢？是有额外的trick么？

我的训练参数（per_device_train_batch_size=1，num_generations=4，num_processes=4），到后期，Loss或KL散度会有激增，在Math500上acc始终只有acc_0.218, format_0.414。相比之下，Huggingface上开源的X-R1-3B这一模型是acc_0.346, format_0.886。

![Image](https://github.com/user-attachments/assets/d31996fc-4b14-4d02-8c92-b4be9e14fb21)