Revise support and worker state callbacks #17

JamesWrigley · 2024-12-10T23:39:48Z

There's a few changes in here, I would recommend reviewing each commit individually. The big ones are:

Support for worker added/exiting/exited callbacks.
Revise support (requires callbacks to implement it properly).

jpsamaroo

Great stuff! I have some questions about the semantics of these callbacks w.r.t error conditions, but other than that, this all seems solid.

src/cluster.jl

jpsamaroo · 2024-12-12T11:44:10Z

src/cluster.jl

+    new_workers = @lock worker_lock addprocs_locked(manager::ClusterManager; kwargs...)
+    for worker in new_workers
+        for callback in values(worker_added_callbacks)
+            callback(worker)


If one of these callbacks throws, what should we do? Right now we'll just bail out of addprocs, but it might make sense to make it a non-fatal error (printed with @error)?

No I think we should definitely throw the exception so that it's obvious that worker initialization somehow failed, otherwise code that runs later assuming initialization succeeded may cause even more errors. But I did change how we run them in 90f44f6 so that they execute concurrently to not slow down addprocs() too much and so that we can have warnings about slow callbacks.

jpsamaroo · 2024-12-12T11:52:35Z

src/cluster.jl

+    add_worker_added_callback(f::Base.Callable; key=nothing)
+
+Register a callback to be called on the master process whenever a worker is
+added. The callback will be called with the added worker ID,


We should document the on-error behavior here.

Good point, fixed in 90f44f6.

jpsamaroo · 2024-12-12T11:53:30Z

src/cluster.jl

+"""
+    remove_worker_added_callback(key)
+
+Remove the callback for `key`.


Suggested change

Remove the callback for `key`.

Remove the callback for `key` that was added via `add_worker_added_callback`.

Fixed in 90f44f6.

jpsamaroo · 2024-12-12T11:54:41Z

src/cluster.jl

+
+Register a callback to be called on the master process whenever a worker is
+added. The callback will be called with the added worker ID,
+e.g. `f(w::Int)`. Returns a unique key for the callback.


Suggested change

e.g. `f(w::Int)`. Returns a unique key for the callback.

e.g. `f(w::Int)`. Chooses and returns a unique key for the callback

if `key` is not specified.

Fixed in 90f44f6.

jpsamaroo · 2024-12-12T11:56:04Z

src/cluster.jl

+Register a callback to be called on the master process when a worker has exited
+for any reason (i.e. not only because of [`rmprocs()`](@ref) but also the worker
+segfaulting etc). The callback will be called with the worker ID,
+e.g. `f(w::Int)`. Returns a unique key for the callback.


Suggested change

e.g. `f(w::Int)`. Returns a unique key for the callback.

e.g. `f(w::Int)`. Chooses and returns a unique key for the callback

if `key` is not specified.

Fixed in 90f44f6.

jpsamaroo · 2024-12-12T12:06:20Z

src/cluster.jl

+"""
+    remove_worker_exiting_callback(key)
+
+Remove the callback for `key`.


Suggested change

Remove the callback for `key`.

Remove the callback for `key` that was added via `add_worker_exiting_callback`.

Fixed in 90f44f6.

jpsamaroo · 2024-12-12T12:06:37Z

src/cluster.jl

+"""
+    remove_worker_exited_callback(key)
+
+Remove the callback for `key`.


Suggested change

Remove the callback for `key`.

Remove the callback for `key` that was added via `add_worker_exited_callback`.

Fixed in 90f44f6.

jpsamaroo · 2024-12-12T12:07:50Z

src/cluster.jl

+        end
+
+        if timedwait(() -> all(istaskdone.(callback_tasks)), callback_timeout) === :timed_out
+            @warn "Some callbacks timed out, continuing to remove workers anyway"


Suggested change

@warn "Some callbacks timed out, continuing to remove workers anyway"

@warn "Some worker-exiting callbacks have not yet finished, continuing to remove workers anyway"

Fixed in 90f44f6.

jpsamaroo · 2024-12-12T12:09:13Z

src/cluster.jl

+    # Call callbacks on the master
+    if myid() == 1
+        for callback in values(worker_exited_callbacks)
+            callback(pid)


try/catch -> @error?

Yeah that's better, added in 90f44f6.

jpsamaroo · 2024-12-12T15:06:45Z

One additional thought: should we have a way to detect a worker exit reason? If a worker exits due to a segfault or OOM, this might be important to know so that I can do some recovery actions or reporting to the user. It could also be something we add later, as its own set of callbacks, but I figure it's worth thinking about before merging this.

Previously we were not filtering out the current worker when calling `deregister_worker()` on `workers()`.

JamesWrigley · 2024-12-12T23:50:55Z

One additional thought: should we have a way to detect a worker exit reason? If a worker exits due to a segfault or OOM, this might be important to know so that I can do some recovery actions or reporting to the user. It could also be something we add later, as its own set of callbacks, but I figure it's worth thinking about before merging this.

That is an excellent point 🤔 I think it should go with the current set of callbacks, e.g. what if the signature for the worker-exiting callbacks was f(w::Int, reason::ExitReasonEnum)?

JBlaschke · 2024-12-23T21:01:45Z

Btw, I think this is really cool stuff @JamesWrigley!

@jpsamaroo and I where discussing this in the context advanced scheduling (i.e. your compute might be growing or shrinking) -- or you're on shared nodes where the sysadmin might kill a worker which is leaking memory. I this context I've been building "nanny workers" that will re-add dead workers, etc. But this has been a pain in the ^%#@

I am happy with this PR as is, but fi there are spare cycles, I want propose additional improvements:

Instead of the {added, exiting, exited} states, I propose using {starting, started, exiting, exited}. I.e follow the semantics of event => conclusion. So request was made to add a worker (starting) => the worker is ready for work (started); and request was made to shut down a worker (exiting) => worker is gone (exited). This brings the API in line with resource managers like Slurm (which have distinct "I've started doing something" and "I'm done doing that thing" phases).
Is there a way to more effectively capture the state of an exiting worker? I.e. capturing the exit code is a step in the right direction, but perhaps we also want to capture a little more context that will help an application automatically recover from an unexpected failure (e.g. "working on dataset 23", or /path/to/coredump). I.e. have the user be able to define additional optional input args to some of the callbacks?

JamesWrigley · 2024-12-25T00:51:45Z

Thanks for taking a look :)

About those things:

I don't have any objection to adding a started callback, but what would it be used for? I can't think of a situation where a library that isn't calling addprocs() itself (in which case it would already know which workers are being started) would need to know about a worker it can't connect to yet.
In the case of the worker being killed gracefully I think this could be implemented already by an exiting callback to extract some status information from the worker. But if it died from e.g. a segfault/OOM it gets a bit more complicated 🤔 Checking for a coredump would involve adding another worker on the same node (which might not even be possible if its slurm allocation is expired), and that's a bit more side-effects than I'm comfortable with. Maybe we could pass the last ~100 lines of the logs to the callback?

Alternatively, we could support some kind of per-worker status that a worker can set for itself (e.g. setmystatus("working on dataset 23")), and pass the last status to the exited callback?

JBlaschke · 2024-12-25T04:26:23Z

I don't have any objection to adding a started callback, but what would it be used for? I can't think of a situation where a library that isn't calling addprocs() itself (in which case it would already know which workers are being started) would need to know about a worker it can't connect to yet.

In general I find it harder to change an API after the fact ;) -- Anyway the situation you're describing is where one worker is responsible for managing the overall workflow, as is common. However this is not always the case at scale (eg. what if we had >10k workers?). In those situations you often lay out your workers on a tree with several "management" nodes (eg. think fat trees, but for workers and not networks). In this situation you want to build callbacks that notify those manager nodes that the leaves have just changed (or are about to change).

In the case of the worker being killed gracefully I think this could be implemented already by an exiting callback to extract some status information from the worker. But if it died from e.g. a segfault/OOM it gets a bit more complicated 🤔 Checking for a coredump would involve adding another worker on the same node (which might not even be possible if its slurm allocation is expired), and that's a bit more side-effects than I'm comfortable with. Maybe we could pass the last ~100 lines of the logs to the callback?

Slurm can be configured to send a signal N seconds before killing a process. We also have the ability to add hooks to Slurm jobs upon process startup/completion. Oh, and coredumps don't have to land in /tmp -- I recommend $SCRATCH for these things at NERSC. Things like passing back part of the Slurm log is definitely in the right direction. I was thinking of something that the users and sysadmins can configure for a given library / application though.

Alternatively, we could support some kind of per-worker status that a worker can set for itself (e.g. setmystatus("working on dataset 23")), and pass the last status to the exited callback?

That's neat! I love it -- it can also help with overall workflow tooling.

All-in-all I am happy with this PR as is -- my comments are meant to make somethings that's great even better.

JamesWrigley · 2024-12-26T20:28:38Z

Alrighty, switched to {starting, started, exiting, exited} callbacks in 8b04241 / 21da9fc.

Slurm can be configured to send a signal N seconds before killing a process. We also have the ability to add hooks to Slurm jobs upon process startup/completion. Oh, and coredumps don't have to land in /tmp -- I recommend $SCRATCH for these things at NERSC. Things like passing back part of the Slurm log is definitely in the right direction. I was thinking of something that the users and sysadmins can configure for a given library / application though.

On second thoughts I'm unsure about promising logs because I don't know how the RemoteLogger's thing will pan out: JuliaLang/Distributed.jl#94. We can always promise to send the stdout of the worker because we're guaranteed to get that, but it might not have any useful information.

Am I correct in thinking that worker statuses and the worker-exited callbacks are sufficient for the uses you're talking about? The way I think about it is that there's three possible scenarios for exits:

Worker gets a request to exit (e.g. by slurm signal): it sets its status and then remotecalls the master to rmproc() itself. The worker-exited callback is called like f(pid, last_status, ExitKind_graceful).
Worker is removed by the master: it may get a chance to set a final status if there's a worker-exiting callback that will tell it do so, either way the worker-exited callback gets called like f(pid, last_status, ExitKind_graceful).
Worker dies suddenly (segfault/OOM etc): the worker-exited callback gets called like f(pid, last_status, ExitKind_forceful) and it can look for coredumps etc.

The new `WorkerState_exterminated` state is for indicating that a worker was killed by something other than us.

JamesWrigley · 2025-01-02T22:23:19Z

Some updates:

Implement an interface for Distributed-like libraries timholy/Revise.jl#871 is merged and released now, but with a new minimum Julia version of 1.10 so I bumped the minimum Julia version for DistributedNext too in 914e50c.
I took the liberty of rebasing since the commit history was getting a bit long.
The worker-exited callbacks now get the worker state as well as their ID, which will tell them if the worker exited gracefully or not. As part of this I renamed the WorkerState instances and added a new state for forceful exits (WorkerState_exterminated) in d5fd837. The next thing to do is add support for worker statuses and pass those to the worker-exited callbacks, and then I think this will be ready to merge.

codecov-commenter · 2025-01-02T22:26:07Z

Codecov Report

Attention: Patch coverage is 95.37037% with 5 lines in your changes missing coverage. Please review.

Project coverage is 88.11%. Comparing base (0cca4d3) to head (ae3225e).

Files with missing lines	Patch %	Lines
src/cluster.jl	94.68%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #17      +/-   ##
==========================================
- Coverage   88.24%   88.11%   -0.13%     
==========================================
  Files          11       12       +1     
  Lines        2068     2138      +70     
==========================================
+ Hits         1825     1884      +59     
- Misses        243      254      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This should fix an exception seen in CI from the lingering timeout task: ``` Test Summary: | Pass Total Time Deserialization error recovery and include() | 11 11 3.9s From worker 4: Unhandled Task ERROR: EOFError: read end of file From worker 4: Stacktrace: From worker 4: [1] wait From worker 4: @ .\asyncevent.jl:159 [inlined] From worker 4: [2] sleep(sec::Float64) From worker 4: @ Base .\asyncevent.jl:265 From worker 4: [3] (::DistributedNext.var"#34#37"{DistributedNext.Worker, Float64})() From worker 4: @ DistributedNext D:\a\DistributedNext.jl\DistributedNext.jl\src\cluster.jl:213 ```

JamesWrigley · 2025-01-05T15:10:42Z

Alrighty, implemented worker statuses in 64aba00. Now they'll be passed to the worker-exited callbacks.

Apologies for how big this PR is becoming 😅 I've tried to keep the commits atomic so you can review them one-by-one.

JamesWrigley · 2025-01-06T20:29:30Z

Hmm, interestingly 0d5aaa3 seems to have almost entirely fixed #6. There are no more timeouts on Linux/OSX and I see only one on Windows.

The common problem with these hangs seems to be lingering tasks blocking Julia from exiting. At some point we should probably audit Distributed[Next] to remove all of them.

JamesWrigley · 2025-01-14T10:33:40Z

(bump)

jpsamaroo

Awesome changes, I think the newest implementation is really clear! I do have concerns about setstatus/getstatus, but otherwise everything looks good.

jpsamaroo · 2025-01-16T01:44:35Z

src/cluster.jl

+    end
+
+    # Wait on the tasks so that exceptions bubble up
+    wait.(values(callback_tasks))


Suggested change

wait.(values(callback_tasks))

foreach(wait, values(callback_tasks))

Should reduce allocations a bit, maybe

Fixed in a9f5dc8.

jpsamaroo · 2025-01-16T01:48:25Z

src/cluster.jl

+
+!!! warning
+    Adding workers can fail so it is not guaranteed that the workers requested
+    will exist.


What does "will exist" mean here? It feels unclear what the implied behavior is.

It means that if e.g. 2 workers are requested on specific nodes they may not actually exist in the future since adding them may fail if the nodes are unreachable or something. I'll reword that to make it more clear.

Fixed in a9f5dc8.

jpsamaroo · 2025-01-16T01:49:08Z

src/cluster.jl

+    will exist.
+
+The worker-starting callbacks will be executed concurrently. If one throws an
+exception it will not be caught and will bubble up through [`addprocs`](@ref).


"bubble up" isn't the clearest language - I think we mean to say that any errors will be rethrown in addprocs?

Yeah agreed, reworded it in a9f5dc8.

jpsamaroo · 2025-01-16T01:50:15Z

src/cluster.jl

+not specified.
+
+The worker-started callbacks will be executed concurrently. If one throws an
+exception it will not be caught and will bubble up through [`addprocs()`](@ref).


Ditto to above w.r.t "bubble up"

Fixed in a9f5dc8.

jpsamaroo · 2025-01-16T01:51:18Z

src/cluster.jl

+"""
+    add_worker_started_callback(f::Base.Callable; key=nothing)
+
+Register a callback to be called on the master process whenever a worker is


Suggested change

Register a callback to be called on the master process whenever a worker is

Register a callback to be called on the master process whenever a worker has been

Fixed in a9f5dc8.

jpsamaroo · 2025-01-16T01:52:40Z

src/cluster.jl

+if `key` is not specified.
+
+All worker-exiting callbacks will be executed concurrently and if they don't
+all finish before the `callback_timeout` passed to `rmprocs()` then the process


Suggested change

all finish before the `callback_timeout` passed to `rmprocs()` then the process

all finish before the `callback_timeout` passed to `rmprocs()` then the worker

To keep terminology consistent

Agreed, I went through all the docstrings and replaced process with worker in a9f5dc8.

jpsamaroo · 2025-01-16T01:56:10Z

src/cluster.jl

+"working on dataset 42"
+```
+"""
+function setstatus(x, pid::Int=myid())


This is mutating the cluster-global status so I think a ! is warranted:

Suggested change

function setstatus(x, pid::Int=myid())

function setstatus!(x, pid::Int=myid())

etc. for all other calls. Agree or disagree?

To me this is in the same class of function as the IO methods so it's not necessary, but I don't feel very strongly about it and it is more explicit this way. Renamed in 8ebd9bf.

jpsamaroo · 2025-01-16T02:00:38Z

src/cluster.jl

 """
 other_workers() = filter(!=(myid()), workers())

+"""


I'm a little concerned about the composability of this mechanism. If a library wants to set a status for debugging how it interacts with workers, that would conflict with the user or another library trying to do the same thing. Maybe we can also pass a key and track multiple statuses? I've used the calling Module in the past as a key (usually automated with a macro) when a package-global key is needed for a system like this.

I know my proposal isn't as convenient for worker-exiting callbacks to handle, so I'm open to alternatives, but I do think we'll need to consider the usability of this mechanism before merging it.

Oooo yes that's a very good point 😬 I'm inclined to agree about using the calling module as a key.
Right now the status is passed in explicitly, but what if we leave that up to the user and let them call e.g. @getstatus(2)/getstatus(2, Dagger) to get the statuses for the modules they're interested in?

@JBlaschke any thoughts on this? If not then I think I'll implement per-module statuses.

JamesWrigley requested a review from jpsamaroo December 10, 2024 23:39

JamesWrigley self-assigned this Dec 10, 2024

jpsamaroo approved these changes Dec 12, 2024

View reviewed changes

jpsamaroo mentioned this pull request Dec 12, 2024

Roadmap #1

Open

8 tasks

Don't recursively call deregister_worker() on the current worker

8779372

Previously we were not filtering out the current worker when calling `deregister_worker()` on `workers()`.

JamesWrigley force-pushed the revise branch from a16bb03 to 90f44f6 Compare December 12, 2024 23:45

Rename the WorkerState instances and add an exterminated state

d5fd837

The new `WorkerState_exterminated` state is for indicating that a worker was killed by something other than us.

JamesWrigley force-pushed the revise branch from 21da9fc to e40f456 Compare January 2, 2025 21:54

JamesWrigley added 3 commits January 2, 2025 23:15

Add support for worker state callbacks

764ceec

Add an extension to support Revise

914e50c

Clean up CI a bit

468fcc0

JamesWrigley force-pushed the revise branch from 267cb18 to 468fcc0 Compare January 2, 2025 22:15

JamesWrigley added 4 commits January 3, 2025 00:13

fixup! Add support for worker state callbacks

ecd83b2

fixup! Add an extension to support Revise

e23c490

Add support for worker statuses

64aba00

mofeing mentioned this pull request Jan 7, 2025

Register callbacks to emit Distributed states bsc-quantic/Extrae.jl#22

Open

jpsamaroo requested changes Jan 16, 2025

View reviewed changes

fixup! Don't recursively call deregister_worker() on the current worker

54b4cf6

JamesWrigley added 3 commits January 16, 2025 13:35

fixup! Add support for worker statuses

8ebd9bf

fixup! Add support for worker state callbacks

a9f5dc8

fixup! Add support for worker statuses

ae3225e

JamesWrigley mentioned this pull request Jan 21, 2025

Moar threadsafe moar better JuliaLang/Distributed.jl#101

Merged

	Remove the callback for `key`.
	Remove the callback for `key` that was added via `add_worker_added_callback`.

	e.g. `f(w::Int)`. Returns a unique key for the callback.
	e.g. `f(w::Int)`. Chooses and returns a unique key for the callback
	if `key` is not specified.

	@warn "Some callbacks timed out, continuing to remove workers anyway"
	@warn "Some worker-exiting callbacks have not yet finished, continuing to remove workers anyway"

	wait.(values(callback_tasks))
	foreach(wait, values(callback_tasks))

	Register a callback to be called on the master process whenever a worker is
	Register a callback to be called on the master process whenever a worker has been

	all finish before the `callback_timeout` passed to `rmprocs()` then the process
	all finish before the `callback_timeout` passed to `rmprocs()` then the worker

	function setstatus(x, pid::Int=myid())
	function setstatus!(x, pid::Int=myid())

Revise support and worker state callbacks #17

Are you sure you want to change the base?

Revise support and worker state callbacks #17

Uh oh!

Conversation

JamesWrigley commented Dec 10, 2024

Uh oh!

jpsamaroo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpsamaroo commented Dec 12, 2024

Uh oh!

JamesWrigley commented Dec 12, 2024

Uh oh!

JBlaschke commented Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JamesWrigley commented Dec 25, 2024

Uh oh!

JBlaschke commented Dec 25, 2024

Uh oh!

JamesWrigley commented Dec 26, 2024

Uh oh!

JamesWrigley commented Jan 2, 2025

Uh oh!

codecov-commenter commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JamesWrigley commented Jan 5, 2025

Uh oh!

JamesWrigley commented Jan 6, 2025

Uh oh!

JamesWrigley commented Jan 14, 2025

Uh oh!

jpsamaroo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

JBlaschke commented Dec 23, 2024 •

edited

Loading

codecov-commenter commented Jan 2, 2025 •

edited

Loading