Commit 76c5fcf
Rank local checkpointing in DCP without collectives (meta-pytorch#991)
Summary:
X-link: pytorch/pytorch#147758
Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.
Differential Revision: D701126421 parent f1ebb63 commit 76c5fcf
1 file changed
+6
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1012 | 1012 | | |
1013 | 1013 | | |
1014 | 1014 | | |
1015 | | - | |
| 1015 | + | |
| 1016 | + | |
| 1017 | + | |
1016 | 1018 | | |
1017 | 1019 | | |
1018 | 1020 | | |
1019 | 1021 | | |
1020 | 1022 | | |
1021 | 1023 | | |
1022 | 1024 | | |
1023 | | - | |
| 1025 | + | |
| 1026 | + | |
| 1027 | + | |
1024 | 1028 | | |
0 commit comments