Skip to content

Update torchft.md #1596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 48 additions & 2 deletions docs/torchft.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ TorchFT is designed to provide fault tolerance when training with replicated wei

Before using TorchFT with TorchTitan, you need to install TorchFT by following the instructions in the [TorchFT README](https://github.com/pytorch/torchft/blob/main/README.md) to install TorchFT.

Alternatively, you can install TorchFT with `pip install torchft-nightly`.

## Configuring TorchTitan for Using TorchFT

When using TorchFT with TorchTitan, you need to launch multiple replica groups, each of which is a separate TorchTitan instance. Each replica group is responsible for maintaining a copy of the model weights. In case of a failure, the other replica groups can continue training without lossing weight information.
Expand All @@ -21,20 +23,23 @@ For example, if you want to run HSDP on a single machine with eight GPUs, where
Let's consider an example where we want to run HSDP on a single machine with eight GPUs, where weights are sharded within four GPUs with two replica groups (2, 4 device mesh). Without using TorchFT, you can launch such a training process by specifying `--parallelism.data_parallel_replica_degree=2 --parallelism.data_parallel_shard_degree=4`. However, in the event of a trainer failure (emulating a real-world machine failure), the entire training process would need to stop and recover from the last checkpoint. This can lead to significant downtime and wasted resources.

With TorchFT, we can tolerate one replica group failure, ensuring that the training process continues uninterrupted. To achieve this, we can launch two TorchTitan instances, each managing four GPUs and communicating with each other through TorchFT. This setup allows for seamless fault tolerance and minimizes the impact of individual trainer failures.

### Launching TorchFT with TorchTitan
### Launching TorchFT with TorchTitan (Example 1)

To launch TorchFT with TorchTitan, you need to execute the following three commands in different shell sessions:

1. Launch TorchFT lighthouse:

```bash
RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000
```

2. Launch the first TorchTitan instance:

```bash
NGPU=4 CUDA_VISIBLE_DEVICES=0,1,2,3 CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --fault_tolerance.enable --fault_tolerance.replica_id=0 --fault_tolerance.group_size=2 --parallelism.data_parallel_shard_degree=4
```
3. Launch the second TorchTitan instance:

```bash
NGPU=4 CUDA_VISIBLE_DEVICES=4,5,6,7 CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --fault_tolerance.enable --fault_tolerance.replica_id=1 --fault_tolerance.group_size=2 --parallelism.data_parallel_shard_degree=4
```
Expand All @@ -48,3 +53,44 @@ NGPU=4 CUDA_VISIBLE_DEVICES=4,5,6,7 CONFIG_FILE="./torchtitan/models/llama3/trai
* Note that the alive replica group with the smallest replica ID will perform checkpointing saving.

In a real-world scenario, `torchft_lighthouse` would likely be on a different machine. The `TORCHFT_LIGHTHOUSE` environment variable is used to tell TorchFT how to communicate with `torchft_lighthouse`. The default value is `http://localhost:29510`.

### Using semi-synchronous training (Example 2)

TorchFT provides algorithms that do not require per-step synchronization and
the replica groups can sychronize weights every N steps.

**Note on Batch Sizes**: For DiLoCo, there's an important distinction in batch size terminology:

The `--training.global_batch_size` parameter refers to global batch size that will be split across all replica groups.

- **Global batch size**: The total batch size across all DiLoCo islands/replica groups
- **Inner global batch size**: The batch size within each individual DiLoCo island. This is determined by dividing global batch size by number of replica groups.

#### Replica Group 0
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 NGPU=4 ./run_train.sh --parallelism.data_parallel_shard_degree=4 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --fault_tolerance.semi_sync_method="diloco" --experimental.custom_args_module=torchtitan.components.ft.config
```

#### Replica Group 1
```bash
CUDA_VISIBLE_DEVICES=4,5,6,7 NGPU=4 ./run_train.sh --parallelism.data_parallel_shard_degree=4 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=1 --fault_tolerance.semi_sync_method="diloco" --experimental.custom_args_module=torchtitan.components.ft.config
```

## Fault Tolerance Configuration Options

For complete configuration options, see [job_config.py](../../config/job_config.py).

[Optional] Only for semi-synchronous training:

- `--fault_tolerance.sync_steps`: The number of training steps before synchronization.
- `--fault_tolerance.semi_sync_method`: Synchronization method (e.g., "local_sgd", "diloco")

For more semi-synchronouse configuration options, see [ft/config/job_config.py](config/job_config.py).
Copy link
Contributor

@fegin fegin Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just run NGPU=1 ./run_train.sh --help to get the help message instead of understanding job_config.py.


## Environment Variables

- `TORCHFT_LIGHTHOUSE`: URL of the lighthouse service
- `TORCHFT_MANAGER_PORT`: Port for the TorchFT manager
- `REPLICA_GROUP_ID`: Identifier for the replica group
- `RUST_LOGS`: Logging level for Rust components
- `RUST_BACKTRACE`: Enable backtrace for debugging
Loading