Commit f7f4bea
authored
[train] Add try-except for pg.wait() (#60743)
## Description
placement_group() is async and returns a handle immediately, If the PG
gets removed between creation request and readiness, `pg.wait()` throws
“not found” during wait, which in turn creates a controller error, we
should treat this as WGStartupTimeoutError.
## Related issues
[#870](anyscale#870)
---------
Signed-off-by: Lehui Liu <lehui@anyscale.com>1 parent 7bf368c commit f7f4bea
File tree
2 files changed
+29
-2
lines changed- python/ray/train/v2
- _internal/execution/worker_group
- tests
2 files changed
+29
-2
lines changedLines changed: 19 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
1 | 2 | | |
2 | 3 | | |
3 | 4 | | |
| |||
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
| 11 | + | |
| 12 | + | |
10 | 13 | | |
11 | 14 | | |
12 | 15 | | |
| |||
63 | 66 | | |
64 | 67 | | |
65 | 68 | | |
66 | | - | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
67 | 77 | | |
68 | 78 | | |
69 | 79 | | |
| |||
83 | 93 | | |
84 | 94 | | |
85 | 95 | | |
86 | | - | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
87 | 104 | | |
88 | 105 | | |
89 | 106 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
59 | 69 | | |
60 | 70 | | |
61 | 71 | | |
| |||
0 commit comments