-
Notifications
You must be signed in to change notification settings - Fork 483
Update torchft.md #1596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Update torchft.md #1596
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(adding a small comment if you don't mind as I have been using diloco with torchtitan for a bit)
could be usefull to add a clarification that the global_batch_size is split across all the diloco island. It's clear for HSDP but for diloco there is a distinction between global batch size (all diloco island) and the inner global batch size (inside each diloco island doing normal sharding)
c87fd44
to
d65112f
Compare
@samsja thanks :), added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the relationship with https://github.com/pytorch/torchtitan/blob/main/docs/torchft.md
Could we consolidate to just keep one of them?
@tianyu-l, oops I didn't realize. Vishal had asked if there were any docs and i didn't realize there was an existing one, so i just added something new. Let me see if there is anything to consolidate. If this is redundant, then ill just close this PR |
26113b0
to
d86e252
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Maybe @fegin for the final review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one minor comment
- `--fault_tolerance.sync_steps`: The number of training steps before synchronization. | ||
- `--fault_tolerance.semi_sync_method`: Synchronization method (e.g., "local_sgd", "diloco") | ||
|
||
For more semi-synchronouse configuration options, see [ft/config/job_config.py](config/job_config.py). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably just run NGPU=1 ./run_train.sh --help
to get the help message instead of understanding job_config.py
.
Add some basic documentation on how to use Titan with TorchFt for DiLoCo.
LMK if anything needs clarification @vishal9-team