-
Notifications
You must be signed in to change notification settings - Fork 290
Rank local checkpointing in DCP without collectives #991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642
058b3d6
to
d5af27e
Compare
This pull request was exported from Phabricator. Differential Revision: D70112642 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary: Pull Request resolved: pytorch#991 X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642
d5af27e
to
020eb87
Compare
Summary: X-link: pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Differential Revision: D70112642
020eb87
to
006e037
Compare
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary: X-link: pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Differential Revision: D70112642
006e037
to
f8ad352
Compare
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642
This pull request was exported from Phabricator. Differential Revision: D70112642 |
f8ad352
to
76c5fcf
Compare
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary: X-link: pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Differential Revision: D70112642
…ch#147758) Summary: X-link: pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Differential Revision: D70112642
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642
76c5fcf
to
b631684
Compare
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary: X-link: pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Differential Revision: D70112642
b631684
to
3c6f8f9
Compare
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary: X-link: pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Differential Revision: D70112642
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642
0a24898
to
0229225
Compare
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642
0229225
to
c8ef529
Compare
This pull request was exported from Phabricator. Differential Revision: D70112642 |
…ch#147758) Summary: X-link: pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Differential Revision: D70112642
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642
c8ef529
to
d78a189
Compare
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary: X-link: pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd Reviewed By: meetv18 Differential Revision: D70112642
d78a189
to
7431191
Compare
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Reviewed By: meetv18 Differential Revision: D70112642
This pull request was exported from Phabricator. Differential Revision: D70112642 |
…ch#147758) Summary: X-link: pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd Reviewed By: meetv18 Differential Revision: D70112642
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Reviewed By: meetv18 Differential Revision: D70112642
7431191
to
c0a2e1c
Compare
This pull request was exported from Phabricator. Differential Revision: D70112642 |
…ch#147758) Summary: X-link: pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd Reviewed By: meetv18 Differential Revision: D70112642
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Reviewed By: meetv18 Differential Revision: D70112642
c0a2e1c
to
c67c638
Compare
This pull request was exported from Phabricator. Differential Revision: D70112642 |
…ch#147758) Summary: X-link: pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd Reviewed By: meetv18 Differential Revision: D70112642
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Reviewed By: meetv18 Differential Revision: D70112642
c67c638
to
f82b20d
Compare
This pull request was exported from Phabricator. Differential Revision: D70112642 |
…ch#147758) Summary: X-link: pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd Reviewed By: meetv18 Differential Revision: D70112642
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Reviewed By: meetv18 Differential Revision: D70112642
f82b20d
to
6d0e411
Compare
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary:
X-link: pytorch/pytorch#147758
Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.
Differential Revision: D70112642