Skip to content

[BugFix] Fix queue checkpoint resume with lazy-initialized scalar buf…#401

Open
mohsinm-dev wants to merge 1 commit intogalilai-group:mainfrom
mohsinm-dev:fix/queue-resume-scalar-buffers
Open

[BugFix] Fix queue checkpoint resume with lazy-initialized scalar buf…#401
mohsinm-dev wants to merge 1 commit intogalilai-group:mainfrom
mohsinm-dev:fix/queue-resume-scalar-buffers

Conversation

@mohsinm-dev
Copy link
Copy Markdown
Contributor

@mohsinm-dev mohsinm-dev commented Mar 17, 2026

Description

  • When a queue (OrderedQueue/UnsortedQueue) is created with shape=None, the out buffer starts as
    (max_length, 1). After the first append() with scalar labels (B,), it becomes (max_length,). On
    checkpoint resume a fresh queue still has the (max_length, 1) placeholder, causing load_state_dict to fail
    with a shape mismatch.
  • The existing OrderedQueue.load_state_dict override was dead code during Lightning resume because PyTorch's
    recursive loader calls _load_from_state_dict on children, not their load_state_dict override.
  • Added _load_from_state_dict to both queue classes and load_state_dict to UnsortedQueue for parity.

Fixes #400

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OrderedQueue / OnlineQueue resume can fail on scalar-label buffers when shape is inferred lazily

1 participant