Skip to content

Rank local checkpointing in DCP without collectives #991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

saumishr
Copy link
Contributor

Summary:
X-link: pytorch/pytorch#147758

Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

saumishr added a commit to saumishr/tnt that referenced this pull request Apr 18, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

saumishr added a commit to saumishr/tnt that referenced this pull request Apr 20, 2025
Summary:
Pull Request resolved: pytorch#991

X-link: pytorch/pytorch#147758

Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642
pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request Apr 24, 2025
Summary:
X-link: pytorch/tnt#991



Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Test Plan: E2E UTs

Differential Revision: D70112642
saumishr added a commit to saumishr/tnt that referenced this pull request Apr 24, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request Apr 30, 2025
Summary:
X-link: pytorch/tnt#991



Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Test Plan: E2E UTs

Differential Revision: D70112642
saumishr added a commit to saumishr/tnt that referenced this pull request Apr 30, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

saumishr added a commit to saumishr/tnt that referenced this pull request Jun 26, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request Jun 26, 2025
Summary:
X-link: pytorch/tnt#991



Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Test Plan: E2E UTs

Differential Revision: D70112642
saumishr added a commit to saumishr/pytorch that referenced this pull request Jun 26, 2025
…ch#147758)

Summary:
X-link: pytorch/tnt#991



Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Test Plan: E2E UTs

Differential Revision: D70112642
saumishr added a commit to saumishr/tnt that referenced this pull request Jun 26, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request Jul 25, 2025
Summary:
X-link: pytorch/tnt#991



Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Test Plan: E2E UTs

Differential Revision: D70112642
saumishr added a commit to saumishr/tnt that referenced this pull request Jul 25, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request Jul 30, 2025
Summary:
X-link: pytorch/tnt#991



Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Test Plan: E2E UTs

Differential Revision: D70112642
saumishr added a commit to saumishr/tnt that referenced this pull request Jul 31, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

saumishr added a commit to saumishr/tnt that referenced this pull request Jul 31, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

saumishr added a commit to saumishr/pytorch that referenced this pull request Jul 31, 2025
…ch#147758)

Summary:
X-link: pytorch/tnt#991



Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Test Plan: E2E UTs

Differential Revision: D70112642
saumishr added a commit to saumishr/tnt that referenced this pull request Jul 31, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request Aug 12, 2025
Summary:
X-link: pytorch/tnt#991



Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Test Plan:
E2E UTs

Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd

Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd

Reviewed By: meetv18

Differential Revision: D70112642
saumishr added a commit to saumishr/tnt that referenced this pull request Aug 12, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Reviewed By: meetv18

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

saumishr added a commit to saumishr/pytorch that referenced this pull request Aug 13, 2025
…ch#147758)

Summary:
X-link: pytorch/tnt#991



Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Test Plan:
E2E UTs

Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd

Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd

Reviewed By: meetv18

Differential Revision: D70112642
saumishr added a commit to saumishr/tnt that referenced this pull request Aug 13, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Reviewed By: meetv18

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

saumishr added a commit to saumishr/pytorch that referenced this pull request Aug 13, 2025
…ch#147758)

Summary:
X-link: pytorch/tnt#991



Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Test Plan:
E2E UTs

Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd

Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd

Reviewed By: meetv18

Differential Revision: D70112642
saumishr added a commit to saumishr/tnt that referenced this pull request Aug 13, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Reviewed By: meetv18

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

saumishr added a commit to saumishr/pytorch that referenced this pull request Aug 13, 2025
…ch#147758)

Summary:
X-link: pytorch/tnt#991



Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Test Plan:
E2E UTs

Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd

Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd

Reviewed By: meetv18

Differential Revision: D70112642
saumishr added a commit to saumishr/tnt that referenced this pull request Aug 13, 2025
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Reviewed By: meetv18

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

saumishr added a commit to saumishr/pytorch that referenced this pull request Aug 13, 2025
…ch#147758)

Summary:
X-link: pytorch/tnt#991



Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Test Plan:
E2E UTs

Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd

Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd

Reviewed By: meetv18

Differential Revision: D70112642
Summary:

X-link: pytorch/pytorch#147758


Context:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Reviewed By: meetv18

Differential Revision: D70112642
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70112642

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants