Skip to content

Fix resume bug#300

Merged
youliangtan merged 2 commits intoNVIDIA:mainfrom
Choiyoonji:fix_resume
Aug 7, 2025
Merged

Fix resume bug#300
youliangtan merged 2 commits intoNVIDIA:mainfrom
Choiyoonji:fix_resume

Conversation

@Choiyoonji
Copy link
Contributor

Fixes #229

Changes proposed in this pull request:

  • This PR addresses an _pickle.UnpicklingError that occurs when resuming training using weights_only=True with checkpoints saved in PyTorch 2.1+.

  • Specifically, the following code was added right after import torch and import numpy as np in /gr00t/experiment/trainer.py:

    torch.serialization.add_safe_globals([
        np.core.multiarray._reconstruct,
        np.ndarray,
        np.dtype,
        np.dtypes.UInt32DType
    ])
  • This change allows PyTorch to safely unpickle RNG state and resume training by explicitly allowlisting required NumPy globals.

  • Example error message:

    _pickle.UnpicklingError: Weights only load failed...
    WeightsUnpickler error: Unsupported global: GLOBAL numpy.dtypes.UInt32DType was not an allowed global by default.
    

Error Reproduction & Test Case

  • How to reproduce:

    1. Start training and interrupt it manually (e.g., using Ctrl+C) to generate a checkpoint.

    2. Attempt to resume training using the --resume flag.

    3. You will encounter the following error:

      _pickle.UnpicklingError: Weights only load failed...
      
  • Fix validation:

    • After applying the changes in this PR, resuming training using the same checkpoint works correctly without errors.
    • If similar errors occur for other NumPy types (e.g., Float64DType, Int64DType), they can be added to the allowlist in the same way.

Reference


Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

Signed-off-by: choiyj <cyj21c6352@gmail.com>
@youliangtan youliangtan merged commit ae7d46f into NVIDIA:main Aug 7, 2025
3 checks passed
@youliangtan
Copy link
Member

@Choiyoonji Thanks for the contribution!

ddebenedittis pushed a commit to Borg-Robotics/Isaac-GR00T that referenced this pull request Oct 7, 2025
* Add safe globals

Signed-off-by: choiyj <cyj21c6352@gmail.com>

* Move numpy allowlist to DualBrainTrainer __init__

---------

Signed-off-by: choiyj <cyj21c6352@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

resume failed with flag set when try to train from latest checkpoint

2 participants