-
-
Notifications
You must be signed in to change notification settings - Fork 740
improve support for grouped workers in SpecCluster
#9121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nsistency Establish four key concepts with precise terminology: - Spec name: String key in worker_spec dict (e.g., "0", "my-worker") - Worker name: String scheduler sees (e.g., "0-0", "0-1") - Worker spec: Dict with 'cls', 'options', 'group' (in worker_spec) - Worker instance: Actual Worker/Nanny object (in workers) Changes: - Rename _new_worker_name() → _new_spec_name() for clarity - Add type hints: worker_spec: dict[str, dict], workers: dict[str, Worker | Nanny] - Update SpecCluster docstring with Terminology and Grouped Workers sections - Fix examples to use string keys ("0", "1") and consistent variable names - Add failing tests (TDD): test_new_spec_name_returns_string, test_worker_spec_keys_are_strings - Ensure spec numbers are strings in worker spec dictionary
…ted tests to enable grouped worker support in SpecCluster - `_spec_name_to_worker_names()`: Maps spec name → worker names - Regular workers: 1:1 mapping (spec "0" → worker "0") - Grouped workers: 1:many mapping (spec "0" → workers "0-0", "0-1", "0-2") - `_worker_name_to_spec_name()`: Maps worker name → spec name - Handles both regular and grouped workers - Returns None if worker name not found
1. _update_worker_status(): When a grouped worker fails, removes entire spec 2. _correct_state_internal(): Maps spec names → worker names before retiring workers 3. scale(): Correctly identifies launched specs by mapping worker names → spec names 4. scale_down(): Simplified using helper methods, handles both regular and grouped workers
Changes scale() behavior for grouped workers to use conservative (floor) rounding instead of ceiling, ensuring resource limits are never exceeded: - scale(n): Rounds DOWN to complete specs (e.g., scale(5) with 2-worker specs → 4 workers, not 6) - scale(memory/cores): Uses floor division for grouped workers to stay under limits - Special case: When scaling from 0 workers, creates at least 1 spec to prevent deadlock (e.g., scale(1) with 2-worker specs → 2 workers) This makes all scaling parameters consistent ("at most X") and prevents resource overcommitment, OOM, and CPU oversubscription. Updated scale() docstring with comprehensive documentation of conservative behavior and updated tests to reflect new semantics.
The `_update_worker_status()` method was checking worker names (e.g., "2-0") against `self.workers` which is keyed by spec names (e.g., "2"). This caused the grouped worker spec removal code to never execute. The bug manifested when: 1. Spec "0" is removed (leaving spec "1") 2. Adaptive scaling creates new spec "2" 3. Attempting to remove spec "2" fails silently Fix: - Map worker name → spec name first using `_worker_name_to_spec_name()` - Use spec name to access `self.workers[spec_name]` for grouped workers - Properly close and remove MultiWorker instances before removing specs Testing: - Added `test_grouped_worker_spec_removal_multiple_rounds` to validate spec removal works across multiple rounds with different spec names also checked that all usages of self.workers used correct keys (spec names)
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files ± 0 27 suites ±0 9h 33m 26s ⏱️ - 3m 11s For more details on these failures, see this check. Results for commit 6901bc1. ± Comparison against base commit c9e7aca. This pull request removes 1 and adds 13 tests. Note that renamed tests count towards both.
|
Thanks for taking the time to pull things apart and figure this stuff out! As you can imagine with a big PR like this it's going to take a little while to review. I particularly want to open up some draft PRs in downstream projects like In the meantime here are a few high-level thoughts:
I like the terminology around differentiating spec name from worker name
I want to challenge this expectation. If I did I wonder if we want to pick a position for
This logic sounds fine to me, but we should check that things are removed gracefully enough that we don't loose too much in one go. Imagine each spec was a large node with many workers, we may need to gradually remove them.
This is great. We like type hints, but there is so much code here without it the best we can do is aim for eventual consistency. |
@jacobtomlinson thanks for taking a look quickly, glad the work is appreciated! Your points are all valid and understood I think. I'll incorporate your suggestions and no rush from my side to get this in, let's do it right 🙂 One thing I'm not 100% on is your comment about the whole spec removal when one worker in a group disappears - could you be more specific with your suggestion around graceful closure? I was building against the assumption that the reason for unexpected worker loss was preemption of a whole allocation in a HPC system, but the use case you present makes sense... happy to add delays/more grace to the spec removal process or rethink this Look forward to hearing about how it goes with dask-kubernetes, I'm running some tests against dask-jobqueue with this and will report any issues that crop up there too |
Good news, so far this PR has supported workers with multiple processes from SLURMCluster from dask-jobqueue without issue - previous issues with this setup have not cropped up (yet?) |
That's great! I've opened a few draft PRs in downstream projects that install the changes from this PR to see if CI is happy. It looks like there are some scaling failures in dask/dask-jobqueue#694 that might be related to the changes here. |
This PR supersedes #9104 and implements missing support for grouped workers in
SpecCluster
.As described in #9104, support for grouped workers was originally implemented in #3013 and is still documented in the
SpecCluster
docstring but appears to have been broken for a while from use in dask-jobqueue (dask/dask-jobqueue#498 and dask/dask-jobqueue#691)Key Concepts Disambiguated
Added clear terminology and documentation distinguishing:
This disambiguation was critical - fixed a bug where worker names were incorrectly used as keys to access self.workers.
Semantic Decisions
New Tests
Helper method tests:
Grouped worker functionality:
String spec name tests:
Implementation Notes
This implementation uses type hints for new methods and parameters). This is not consistent with the rest of the codebase and am happy to remove these if that's the maintainers' preference.