Skip to content

Update torchft.md #1596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Update torchft.md #1596

wants to merge 3 commits into from

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Aug 19, 2025

Add some basic documentation on how to use Titan with TorchFt for DiLoCo.

LMK if anything needs clarification @vishal9-team

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 19, 2025
Copy link
Contributor

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(adding a small comment if you don't mind as I have been using diloco with torchtitan for a bit)

could be usefull to add a clarification that the global_batch_size is split across all the diloco island. It's clear for HSDP but for diloco there is a distinction between global batch size (all diloco island) and the inner global batch size (inside each diloco island doing normal sharding)

@H-Huang
Copy link
Member Author

H-Huang commented Aug 19, 2025

@samsja thanks :), added

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the relationship with https://github.com/pytorch/torchtitan/blob/main/docs/torchft.md

Could we consolidate to just keep one of them?

@H-Huang
Copy link
Member Author

H-Huang commented Aug 20, 2025

@tianyu-l, oops I didn't realize. Vishal had asked if there were any docs and i didn't realize there was an existing one, so i just added something new. Let me see if there is anything to consolidate. If this is redundant, then ill just close this PR

@H-Huang H-Huang changed the title Add TorchFT README and instructions Update torchft.md Aug 20, 2025
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Maybe @fegin for the final review.

Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one minor comment

- `--fault_tolerance.sync_steps`: The number of training steps before synchronization.
- `--fault_tolerance.semi_sync_method`: Synchronization method (e.g., "local_sgd", "diloco")

For more semi-synchronouse configuration options, see [ft/config/job_config.py](config/job_config.py).
Copy link
Contributor

@fegin fegin Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just run NGPU=1 ./run_train.sh --help to get the help message instead of understanding job_config.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants