Update torchft.md #1596

H-Huang · 2025-08-19T15:26:40Z

Add some basic documentation on how to use Titan with TorchFt for DiLoCo.

LMK if anything needs clarification @vishal9-team

samsja

(adding a small comment if you don't mind as I have been using diloco with torchtitan for a bit)

could be usefull to add a clarification that the global_batch_size is split across all the diloco island. It's clear for HSDP but for diloco there is a distinction between global batch size (all diloco island) and the inner global batch size (inside each diloco island doing normal sharding)

H-Huang · 2025-08-19T17:56:34Z

@samsja thanks :), added

tianyu-l

what's the relationship with https://github.com/pytorch/torchtitan/blob/main/docs/torchft.md

Could we consolidate to just keep one of them?

H-Huang · 2025-08-20T03:04:59Z

@tianyu-l, oops I didn't realize. Vishal had asked if there were any docs and i didn't realize there was an existing one, so i just added something new. Let me see if there is anything to consolidate. If this is redundant, then ill just close this PR

tianyu-l

LGTM.
Maybe @fegin for the final review.

fegin

LGTM, one minor comment

fegin · 2025-08-21T06:21:00Z

docs/torchft.md

+- `--fault_tolerance.sync_steps`: The number of training steps before synchronization.
+- `--fault_tolerance.semi_sync_method`: Synchronization method (e.g., "local_sgd", "diloco")
+
+For more semi-synchronouse configuration options, see [ft/config/job_config.py](config/job_config.py).


Probably just run NGPU=1 ./run_train.sh --help to get the help message instead of understanding job_config.py.

H-Huang requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 19, 2025 15:26

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 19, 2025

H-Huang requested review from vishal9-team, d4l3k and tushar00jain August 19, 2025 15:26

samsja reviewed Aug 19, 2025

View reviewed changes

tushar00jain approved these changes Aug 19, 2025

View reviewed changes

H-Huang force-pushed the torchft_readme branch from c87fd44 to d65112f Compare August 19, 2025 17:56

vishal9-team approved these changes Aug 19, 2025

View reviewed changes

tianyu-l reviewed Aug 19, 2025

View reviewed changes

H-Huang changed the title ~~Add TorchFT README and instructions~~ Update torchft.md Aug 20, 2025

H-Huang added 3 commits August 20, 2025 10:50

Add TorchFT README and instructions

fd13ba2

consolidate around torchft.md

41586f0

update command

d86e252

H-Huang force-pushed the torchft_readme branch from 26113b0 to d86e252 Compare August 20, 2025 17:55

tianyu-l approved these changes Aug 20, 2025

View reviewed changes

fegin approved these changes Aug 21, 2025

View reviewed changes

fegin reviewed Aug 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update torchft.md #1596

Update torchft.md #1596

Uh oh!

H-Huang commented Aug 19, 2025

Uh oh!

samsja left a comment

Uh oh!

H-Huang commented Aug 19, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

H-Huang commented Aug 20, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

fegin left a comment

Uh oh!

fegin Aug 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Update torchft.md #1596

Are you sure you want to change the base?

Update torchft.md #1596

Uh oh!

Conversation

H-Huang commented Aug 19, 2025

Uh oh!

samsja left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Aug 19, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Aug 20, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

fegin Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fegin Aug 21, 2025 •

edited

Loading