Commit 6819be1
feat(models): multibackend all_to_all wrapper (#95)
Small addition to fallback to support alltoall when using the Gloo backend to
torch.distributed. This PR is needed to be able to run the transformer
model on CPU. for 99.9% of users running on GPUs with the NCCL
background, this change should not effect them
Gloo does not offer an alltoall primitive, as shown
[here](https://pytorch.org/docs/stable/distributed.html#backends)
This commit implements am all_to_all fallback for Gloo, using the 'Linear
Shift' algorithm from [Hoffman and Rünger,
2013](https://www.tu-chemnitz.de/informatik/PI/forschung/publikationen/download/HR_eurompi13.pdf).
Because of syntax for `torch.dist` changing in torch 2.6, older versions
of torch are not supported.
---------
Co-authored-by: Harrison Cook <[email protected]>1 parent 9fc5923 commit 6819be1
1 file changed
+38
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
21 | 57 | | |
22 | 58 | | |
23 | 59 | | |
| |||
52 | 88 | | |
53 | 89 | | |
54 | 90 | | |
55 | | - | |
| 91 | + | |
56 | 92 | | |
57 | 93 | | |
58 | 94 | | |
| |||
79 | 115 | | |
80 | 116 | | |
81 | 117 | | |
82 | | - | |
| 118 | + | |
83 | 119 | | |
84 | 120 | | |
85 | 121 | | |
| |||
0 commit comments