[BugFix] Fix queue checkpoint resume with lazy-initialized scalar buf… by mohsinm-dev · Pull Request #401 · galilai-group/stable-pretraining

mohsinm-dev · 2026-03-17T11:26:59Z

Description

When a queue (OrderedQueue/UnsortedQueue) is created with shape=None, the out buffer starts as
(max_length, 1). After the first append() with scalar labels (B,), it becomes (max_length,). On
checkpoint resume a fresh queue still has the (max_length, 1) placeholder, causing load_state_dict to fail
with a shape mismatch.
The existing OrderedQueue.load_state_dict override was dead code during Lightning resume because PyTorch's
recursive loader calls _load_from_state_dict on children, not their load_state_dict override.
Added _load_from_state_dict to both queue classes and load_state_dict to UnsortedQueue for parity.

Fixes #400

…fers

[BugFix] Fix queue checkpoint resume with lazy-initialized scalar buf…

762439d

…fers

mohsinm-dev requested a review from RandallBalestriero as a code owner March 17, 2026 11:26