|
| 1 | +# TorchFT - Fault Tolerance for TorchTitan |
| 2 | + |
| 3 | +TorchFT provides fault tolerance capabilities for distributed training in TorchTitan, enabling resilient training across multiple replica groups with automatic failure recovery. |
| 4 | + |
| 5 | +This component integrates with [TorchFT](https://github.com/pytorch/torchft), PyTorch's fault tolerance library. **Prerequisites from the TorchFT repository must be installed before using this functionality.** |
| 6 | + |
| 7 | +## Fault Tolerance Configuration Options |
| 8 | + |
| 9 | +The fault tolerance system can be configured using the following options: |
| 10 | + |
| 11 | +- `--fault_tolerance.enable`: Enable fault tolerance mode |
| 12 | +- `--fault_tolerance.group_size`: Number of replicas in the fault tolerance group |
| 13 | +- `--fault_tolerance.replica_id`: Unique identifier for this replica within the group |
| 14 | + |
| 15 | +For complete configuration options, see [job_config.py](../../config/job_config.py). |
| 16 | + |
| 17 | +[Optional] Only for semi-synchronous training: |
| 18 | + |
| 19 | +- `--fault_tolerance.sync_steps`: The number of training steps before synchronization. |
| 20 | +- `--fault_tolerance.semi_sync_method`: Synchronization method (e.g., "local_sgd", "diloco") |
| 21 | + |
| 22 | +For more semi-synchronouse configuration options, see [ft/config/job_config.py](config/job_config.py). |
| 23 | + |
| 24 | +## Starting the Lighthouse Service |
| 25 | + |
| 26 | +The lighthouse service coordinates fault tolerance across replica groups. Start it with: |
| 27 | + |
| 28 | +```bash |
| 29 | +# Requires 2 replica groups to join |
| 30 | +RUST_LOGS=debug RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 2 --quorum_tick_ms 100 --join_timeout_ms 10000 |
| 31 | + |
| 32 | +# For single replica group (development) |
| 33 | +RUST_LOGS=debug RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000 |
| 34 | +``` |
| 35 | + |
| 36 | +## Examples: Running with Two Replica Groups |
| 37 | + |
| 38 | +### Example 1: HSDP |
| 39 | + |
| 40 | +Each replica group has 4 GPUs (assuming this host has 8 GPUs total, we use `CUDA_VISIBLE_DEVICES` to simulate two "hosts" for single host development experience). Each replica group has 4 GPUs which are sharded with FSDP, and they synchronize per-step for fault tolerant HSDP. |
| 41 | + |
| 42 | +#### Replica Group 0 |
| 43 | +```bash |
| 44 | +RUST_LOGS=debug TORCHFT_LIGHTHOUSE=http://<hostname>:29510 TORCHFT_MANAGER_PORT=29520 REPLICA_GROUP_ID=0 CUDA_VISIBLE_DEVICES=0,1,2,3 NGPU=4 ./run_train.sh --parallelism.data_parallel_shard_degree=4 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 |
| 45 | +``` |
| 46 | + |
| 47 | +#### Replica Group 1 |
| 48 | +```bash |
| 49 | +TORCHFT_LIGHTHOUSE=http://<hostname>:29510 TORCHFT_MANAGER_PORT=29522 REPLICA_GROUP_ID=1 CUDA_VISIBLE_DEVICES=4,5,6,7 NGPU=4 ./run_train.sh --parallelism.data_parallel_shard_degree=4 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=1 |
| 50 | +``` |
| 51 | + |
| 52 | +### Example 2: With Semi-synchronous Training |
| 53 | + |
| 54 | +TorchFT provides algorithms that do not require per-step synchronization and |
| 55 | +the replica groups can sychronize weights every N steps. |
| 56 | + |
| 57 | +#### Replica Group 0 |
| 58 | +```bash |
| 59 | +RUST_LOGS=debug TORCHFT_LIGHTHOUSE=http://<hostname>:29510 TORCHFT_MANAGER_PORT=29520 REPLICA_GROUP_ID=0 CUDA_VISIBLE_DEVICES=0,1,2,3 NGPU=4 ./run_train.sh --parallelism.data_parallel_shard_degree=4 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --fault_tolerance.semi_sync_method="diloco" |
| 60 | +``` |
| 61 | + |
| 62 | +#### Replica Group 1 |
| 63 | +```bash |
| 64 | +TORCHFT_LIGHTHOUSE=http://<hostname>:29510 TORCHFT_MANAGER_PORT=29522 REPLICA_GROUP_ID=1 CUDA_VISIBLE_DEVICES=4,5,6,7 NGPU=4 ./run_train.sh --parallelism.data_parallel_shard_degree=4 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=1 --fault_tolerance.semi_sync_method="diloco" |
| 65 | +``` |
| 66 | + |
| 67 | +## Environment Variables |
| 68 | + |
| 69 | +- `TORCHFT_LIGHTHOUSE`: URL of the lighthouse service |
| 70 | +- `TORCHFT_MANAGER_PORT`: Port for the TorchFT manager |
| 71 | +- `REPLICA_GROUP_ID`: Identifier for the replica group |
| 72 | +- `RUST_LOGS`: Logging level for Rust components |
| 73 | +- `RUST_BACKTRACE`: Enable backtrace for debugging |
0 commit comments