-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Open
Labels
callback: model checkpointcheckpointingRelated to checkpointingRelated to checkpointingfeatureIs an improvement or enhancementIs an improvement or enhancement
Description
Description & Motivation
PyTorch Lightning’s current async checkpointing implementation predates PyTorch’s Distributed Checkpoint (DCP) API and feels outdated.
This issue proposes evaluating and migrating Lightning’s async checkpoint logic to leverage torch.distributed.checkpoint (DCP), specifically async_save, to:
- Align with upstream PyTorch checkpointing APIs
- Improve robustness and maintainability
- Better support distributed and sharded training setups
- Reduce custom logic that duplicates upstream functionality
Pitch
Alternatives
No response
Additional context
cc @lantiga
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
callback: model checkpointcheckpointingRelated to checkpointingRelated to checkpointingfeatureIs an improvement or enhancementIs an improvement or enhancement