Skip to content

Commit 6d0e411

Browse files
saumishrfacebook-github-bot
authored andcommitted
Rank local checkpointing in DCP without collectives (#991)
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Reviewed By: meetv18 Differential Revision: D70112642
1 parent b285385 commit 6d0e411

File tree

1 file changed

+6
-2
lines changed

1 file changed

+6
-2
lines changed

tests/framework/callbacks/test_dcp_saver.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1012,13 +1012,17 @@ class DummyStorageWriter(FileSystemWriter):
10121012
def __init__(self, path: str) -> None:
10131013
super().__init__(path)
10141014

1015-
def set_up_storage_writer(self, is_coordinator: bool) -> None:
1015+
def set_up_storage_writer(
1016+
self, is_coordinator: bool, *args: Any, **kwargs: Any
1017+
) -> None:
10161018
pass
10171019

10181020

10191021
class DummyStorageReader(FileSystemReader):
10201022
def __init__(self, path: str) -> None:
10211023
super().__init__(path)
10221024

1023-
def set_up_storage_writer(self, is_coordinator: bool) -> None:
1025+
def set_up_storage_writer(
1026+
self, is_coordinator: bool, *args: Any, **kwargs: Any
1027+
) -> None:
10241028
pass

0 commit comments

Comments
 (0)