Skip to content

Revamp Async Checkpointing to Use PyTorch Distributed Checkpoint (DCP) async_save #21557

@deependujha

Description

@deependujha

Description & Motivation

PyTorch Lightning’s current async checkpointing implementation predates PyTorch’s Distributed Checkpoint (DCP) API and feels outdated.

This issue proposes evaluating and migrating Lightning’s async checkpoint logic to leverage torch.distributed.checkpoint (DCP), specifically async_save, to:

  • Align with upstream PyTorch checkpointing APIs
  • Improve robustness and maintainability
  • Better support distributed and sharded training setups
  • Reduce custom logic that duplicates upstream functionality

Pitch

Use PyTorch DCP's async_save

Alternatives

No response

Additional context

https://docs.pytorch.org/docs/stable/distributed.checkpoint.html#distributed-checkpoint-torch-distributed-checkpoint

cc @lantiga

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions