Revise support and worker state callbacks#17
Conversation
jpsamaroo
left a comment
There was a problem hiding this comment.
Great stuff! I have some questions about the semantics of these callbacks w.r.t error conditions, but other than that, this all seems solid.
|
One additional thought: should we have a way to detect a worker exit reason? If a worker exits due to a segfault or OOM, this might be important to know so that I can do some recovery actions or reporting to the user. It could also be something we add later, as its own set of callbacks, but I figure it's worth thinking about before merging this. |
That is an excellent point 🤔 I think it should go with the current set of callbacks, e.g. what if the signature for the worker-exiting callbacks was |
|
Btw, I think this is really cool stuff @JamesWrigley! @jpsamaroo and I where discussing this in the context advanced scheduling (i.e. your compute might be growing or shrinking) -- or you're on shared nodes where the sysadmin might kill a worker which is leaking memory. I this context I've been building "nanny workers" that will re-add dead workers, etc. But this has been a pain in the ^%#@ I am happy with this PR as is, but fi there are spare cycles, I want propose additional improvements:
|
|
Thanks for taking a look :) About those things:
|
In general I find it harder to change an API after the fact ;) -- Anyway the situation you're describing is where one worker is responsible for managing the overall workflow, as is common. However this is not always the case at scale (eg. what if we had >10k workers?). In those situations you often lay out your workers on a tree with several "management" nodes (eg. think fat trees, but for workers and not networks). In this situation you want to build callbacks that notify those manager nodes that the leaves have just changed (or are about to change).
Slurm can be configured to send a signal
That's neat! I love it -- it can also help with overall workflow tooling. All-in-all I am happy with this PR as is -- my comments are meant to make somethings that's great even better. |
|
Alrighty, switched to
On second thoughts I'm unsure about promising logs because I don't know how the Am I correct in thinking that worker statuses and the worker-exited callbacks are sufficient for the uses you're talking about? The way I think about it is that there's three possible scenarios for exits:
|
267cb18 to
468fcc0
Compare
|
Some updates:
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17 +/- ##
==========================================
+ Coverage 88.05% 88.17% +0.11%
==========================================
Files 11 12 +1
Lines 2118 2190 +72
==========================================
+ Hits 1865 1931 +66
- Misses 253 259 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Alrighty, implemented worker statuses in 64aba00. Now they'll be passed to the worker-exited callbacks. Apologies for how big this PR is becoming 😅 I've tried to keep the commits atomic so you can review them one-by-one. |
|
Hmm, interestingly 0d5aaa3 seems to have almost entirely fixed #6. There are no more timeouts on Linux/OSX and I see only one on Windows. The common problem with these hangs seems to be lingering tasks blocking Julia from exiting. At some point we should probably audit Distributed[Next] to remove all of them. |
|
(bump) |
jpsamaroo
left a comment
There was a problem hiding this comment.
Awesome changes, I think the newest implementation is really clear! I do have concerns about setstatus/getstatus, but otherwise everything looks good.
| """ | ||
| other_workers() = filter(!=(myid()), workers()) | ||
|
|
||
| """ |
There was a problem hiding this comment.
I'm a little concerned about the composability of this mechanism. If a library wants to set a status for debugging how it interacts with workers, that would conflict with the user or another library trying to do the same thing. Maybe we can also pass a key and track multiple statuses? I've used the calling Module in the past as a key (usually automated with a macro) when a package-global key is needed for a system like this.
I know my proposal isn't as convenient for worker-exiting callbacks to handle, so I'm open to alternatives, but I do think we'll need to consider the usability of this mechanism before merging it.
There was a problem hiding this comment.
Oooo yes that's a very good point 😬 I'm inclined to agree about using the calling module as a key.
Right now the status is passed in explicitly, but what if we leave that up to the user and let them call e.g. @getstatus(2)/getstatus(2, Dagger) to get the statuses for the modules they're interested in?
There was a problem hiding this comment.
@JBlaschke any thoughts on this? If not then I think I'll implement per-module statuses.
There was a problem hiding this comment.
This has languished for waaaay too long 😅 I'm gonna implement per-module statuses and then merge it, probably in the next few weeks.
Previously we were not filtering out the current worker when calling `deregister_worker()` on `workers()`.
The new `WorkerState_exterminated` state is for indicating that a worker was killed by something other than us.
This should fix an exception seen in CI from the lingering timeout task:
```
Test Summary: | Pass Total Time
Deserialization error recovery and include() | 11 11 3.9s
From worker 4: Unhandled Task ERROR: EOFError: read end of file
From worker 4: Stacktrace:
From worker 4: [1] wait
From worker 4: @ .\asyncevent.jl:159 [inlined]
From worker 4: [2] sleep(sec::Float64)
From worker 4: @ Base .\asyncevent.jl:265
From worker 4: [3] (::DistributedNext.var"#34#37"{DistributedNext.Worker, Float64})()
From worker 4: @ DistributedNext D:\a\DistributedNext.jl\DistributedNext.jl\src\cluster.jl:213
```
There's a few changes in here, I would recommend reviewing each commit individually. The big ones are:
Depends on timholy/Revise.jl#871.