Fix resume bug by Choiyoonji · Pull Request #300 · NVIDIA/Isaac-GR00T

Choiyoonji · 2025-08-06T11:06:11Z

Fixes #229

Changes proposed in this pull request:

This PR addresses an _pickle.UnpicklingError that occurs when resuming training using weights_only=True with checkpoints saved in PyTorch 2.1+.

Specifically, the following code was added right after import torch and import numpy as np in /gr00t/experiment/trainer.py:

torch.serialization.add_safe_globals([
    np.core.multiarray._reconstruct,
    np.ndarray,
    np.dtype,
    np.dtypes.UInt32DType
])

This change allows PyTorch to safely unpickle RNG state and resume training by explicitly allowlisting required NumPy globals.

Example error message:

_pickle.UnpicklingError: Weights only load failed...
WeightsUnpickler error: Unsupported global: GLOBAL numpy.dtypes.UInt32DType was not an allowed global by default.

Error Reproduction & Test Case

How to reproduce:
1. Start training and interrupt it manually (e.g., using Ctrl+C) to generate a checkpoint.
2. Attempt to resume training using the --resume flag.
3. You will encounter the following error:
```
_pickle.UnpicklingError: Weights only load failed...
```
Fix validation:
- After applying the changes in this PR, resuming training using the same checkpoint works correctly without errors.
- If similar errors occur for other NumPy types (e.g., Float64DType, Int64DType), they can be added to the allowlist in the same way.

Reference

[PyTorch Documentation – torch.load (Security Section)](https://pytorch.org/docs/stable/generated/torch.load.html#security)

Before submitting

I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

Signed-off-by: choiyj <cyj21c6352@gmail.com>

gr00t/experiment/trainer.py

youliangtan · 2025-08-07T07:24:29Z

@Choiyoonji Thanks for the contribution!

* Add safe globals Signed-off-by: choiyj <cyj21c6352@gmail.com> * Move numpy allowlist to DualBrainTrainer __init__ --------- Signed-off-by: choiyj <cyj21c6352@gmail.com>

Add safe globals

0c8e96e

Signed-off-by: choiyj <cyj21c6352@gmail.com>

youliangtan reviewed Aug 6, 2025

View reviewed changes

gr00t/experiment/trainer.py Outdated Show resolved Hide resolved

Move numpy allowlist to DualBrainTrainer __init__

a5b640c

youliangtan merged commit ae7d46f into NVIDIA:main Aug 7, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix resume bug#300

Fix resume bug#300
youliangtan merged 2 commits intoNVIDIA:mainfrom
Choiyoonji:fix_resume

Choiyoonji commented Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

youliangtan commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Choiyoonji commented Aug 6, 2025

Error Reproduction & Test Case

Reference

Before submitting

Uh oh!

Uh oh!

Uh oh!

youliangtan commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants