Skip to content

Commit c87fd44

Browse files
committed
Add TorchFT README and instructions
1 parent 7f1fa48 commit c87fd44

File tree

1 file changed

+73
-0
lines changed

1 file changed

+73
-0
lines changed

torchtitan/components/ft/README.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# TorchFT - Fault Tolerance for TorchTitan
2+
3+
TorchFT provides fault tolerance capabilities for distributed training in TorchTitan, enabling resilient training across multiple replica groups with automatic failure recovery.
4+
5+
This component integrates with [TorchFT](https://github.com/pytorch/torchft), PyTorch's fault tolerance library. **Prerequisites from the TorchFT repository must be installed before using this functionality.**
6+
7+
## Fault Tolerance Configuration Options
8+
9+
The fault tolerance system can be configured using the following options:
10+
11+
- `--fault_tolerance.enable`: Enable fault tolerance mode
12+
- `--fault_tolerance.group_size`: Number of replicas in the fault tolerance group
13+
- `--fault_tolerance.replica_id`: Unique identifier for this replica within the group
14+
15+
For complete configuration options, see [job_config.py](../../config/job_config.py).
16+
17+
[Optional] Only for semi-synchronous training:
18+
19+
- `--fault_tolerance.sync_steps`: The number of training steps before synchronization.
20+
- `--fault_tolerance.semi_sync_method`: Synchronization method (e.g., "local_sgd", "diloco")
21+
22+
For more semi-synchronouse configuration options, see [ft/config/job_config.py](config/job_config.py).
23+
24+
## Starting the Lighthouse Service
25+
26+
The lighthouse service coordinates fault tolerance across replica groups. Start it with:
27+
28+
```bash
29+
# Requires 2 replica groups to join
30+
RUST_LOGS=debug RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 2 --quorum_tick_ms 100 --join_timeout_ms 10000
31+
32+
# For single replica group (development)
33+
RUST_LOGS=debug RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000
34+
```
35+
36+
## Examples: Running with Two Replica Groups
37+
38+
### Example 1: HSDP
39+
40+
Each replica group has 4 GPUs (assuming this host has 8 GPUs total, we use `CUDA_VISIBLE_DEVICES` to simulate two "hosts" for single host development experience). Each replica group has 4 GPUs which are sharded with FSDP, and they synchronize per-step for fault tolerant HSDP.
41+
42+
#### Replica Group 0
43+
```bash
44+
RUST_LOGS=debug TORCHFT_LIGHTHOUSE=http://<hostname>:29510 TORCHFT_MANAGER_PORT=29520 REPLICA_GROUP_ID=0 CUDA_VISIBLE_DEVICES=0,1,2,3 NGPU=4 ./run_train.sh --parallelism.data_parallel_shard_degree=4 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0
45+
```
46+
47+
#### Replica Group 1
48+
```bash
49+
TORCHFT_LIGHTHOUSE=http://<hostname>:29510 TORCHFT_MANAGER_PORT=29522 REPLICA_GROUP_ID=1 CUDA_VISIBLE_DEVICES=4,5,6,7 NGPU=4 ./run_train.sh --parallelism.data_parallel_shard_degree=4 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=1
50+
```
51+
52+
### Example 2: With Semi-synchronous Training
53+
54+
TorchFT provides algorithms that do not require per-step synchronization and
55+
the replica groups can sychronize weights every N steps.
56+
57+
#### Replica Group 0
58+
```bash
59+
RUST_LOGS=debug TORCHFT_LIGHTHOUSE=http://<hostname>:29510 TORCHFT_MANAGER_PORT=29520 REPLICA_GROUP_ID=0 CUDA_VISIBLE_DEVICES=0,1,2,3 NGPU=4 ./run_train.sh --parallelism.data_parallel_shard_degree=4 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --fault_tolerance.semi_sync_method="diloco"
60+
```
61+
62+
#### Replica Group 1
63+
```bash
64+
TORCHFT_LIGHTHOUSE=http://<hostname>:29510 TORCHFT_MANAGER_PORT=29522 REPLICA_GROUP_ID=1 CUDA_VISIBLE_DEVICES=4,5,6,7 NGPU=4 ./run_train.sh --parallelism.data_parallel_shard_degree=4 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=1 --fault_tolerance.semi_sync_method="diloco"
65+
```
66+
67+
## Environment Variables
68+
69+
- `TORCHFT_LIGHTHOUSE`: URL of the lighthouse service
70+
- `TORCHFT_MANAGER_PORT`: Port for the TorchFT manager
71+
- `REPLICA_GROUP_ID`: Identifier for the replica group
72+
- `RUST_LOGS`: Logging level for Rust components
73+
- `RUST_BACKTRACE`: Enable backtrace for debugging

0 commit comments

Comments
 (0)