feat!: proposal for dynamic partition selection #321

cademirch · 2025-06-06T22:23:38Z

Based on some of the work in https://github.com/harvardinformatics/snakemake-executor-plugin-cannon, I've put together this PR to add dynamic partition selection based on job resource requirements. Relevant discussion in #106

This works by having the user provide a YML file that specifies their cluster's partitions and the resource limits of each via option --slurm-partition-config. The expected structure of the file is simple:

partitions:
  some_partition:
    max_runtime: 100
  another_partition:
    ...

I realize adding another config file isn't ideal given that workflow specific configs, workflow profiles, and global profiles already exist. But this approach avoids the complexity of having to determine cluster configurations through SLURM commands, which can vary wildly (as discussed in #106). It also sidesteps the need for complex expressions in set-resources to determine partitions, which can get unwieldy for big workflows with lots of rules. Ideally, users can craft a partitions.yml once and set it in their global profile.

I'm not super familiar with SLURM, and partition resource limits, so I came up with a list based on the Snakemake standard resources and SLURM resources:

The following limits can be defined for each partition:

Parameter	Type	Description	Default
`max_runtime`	int	Maximum walltime in minutes	unlimited
`max_mem_mb`	int	Maximum total memory in MB	unlimited
`max_mem_mb_per_cpu`	int	Maximum memory per CPU in MB	unlimited
`max_cpus_per_task`	int	Maximum CPUs per task	unlimited
`max_nodes`	int	Maximum number of nodes	unlimited
`max_tasks`	int	Maximum number of tasks	unlimited
`max_tasks_per_node`	int	Maximum tasks per node	unlimited
`max_gpu`	int	Maximum number of GPUs	0
`available_gpu_models`	list[str]	List of available GPU models	none
`max_cpus_per_gpu`	int	Maximum CPUs per GPU	unlimited
`supports_mpi`	bool	Whether MPI jobs are supported	true
`max_mpi_tasks`	int	Maximum MPI tasks	unlimited
`available_constraints`	list[str]	List of available node constraints	none

It could be possible to support any arbitrary resource though, by pattern matching: "max_{resource}". Though I've not implemented this in this PR.

To pick the "best" partition for a job, I went with a naive scoring approach that calculates a score by summing the ratios of requested resources to partition limits. Higher scores should indicate better resource utilization, for example: a job requesting 8 CPUs would prefer a 16-CPU partition (score 0.5) over a 64-CPU partition (score 0.125). Partitions that cannot satisfy a job's requirements are not considered.

This feature is opt in and respects the slurm_partition job resource, as well as existing fallback partition logic.

@cmeesters, @johanneskoester let me know what you think of this approach! I'm not particularly experienced with SLURM, so I've made decisions here (limits, partition specs, etc.) based on my limited experience and the available docs - so feedback is much appreciated.

Summary by CodeRabbit

New Features
- Introduced automatic SLURM partition selection based on job resource requirements and a user-provided YAML configuration file.
- Added support for specifying partition configuration via a new command line option.
Documentation
- Updated documentation to describe the new automatic partition selection feature, including configuration examples and usage details.
Tests
- Added comprehensive tests to validate partition configuration parsing and job-to-partition matching logic.

coderabbitai · 2025-06-06T22:23:46Z

Walkthrough

This update introduces an automatic SLURM partition selection feature for the executor plugin, enabling dynamic partition assignment based on job resource requirements and a user-supplied YAML configuration. The implementation includes new logic for partition scoring and selection, updates to the executor's control flow, and comprehensive tests validating partition parsing and matching.

Changes

File(s)	Change Summary
docs/further.md	Added documentation for the new automatic SLURM partition selection feature, including configuration details and usage explanation.
snakemake_executor_plugin_slurm/init.py	Integrated partition config loading, deferred partition argument construction, and added automatic partition selection logic.
snakemake_executor_plugin_slurm/partitions.py	New module: defines partition/resource data classes, YAML config parsing, job resource parsing, and partition scoring/selection.
snakemake_executor_plugin_slurm/submit_string.py	Removed partition parameter from the SLURM submit command construction logic.
tests/tests.py	Added comprehensive tests for partition config reading and automatic partition selection logic, including edge cases.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Executor
    participant PartitionsModule
    participant SLURM

    User->>Executor: Submit job with resource requirements
    Executor->>PartitionsModule: Load partition config (if provided)
    Executor->>PartitionsModule: get_best_partition(job)
    PartitionsModule->>PartitionsModule: Score job fit for each partition
    PartitionsModule-->>Executor: Return best partition (or None)
    Executor->>SLURM: Submit job with selected partition (if any)
    SLURM-->>Executor: Job scheduled
    Executor-->>User: Job submission result

Suggested reviewers

dlaehnemann

Poem

In the warren of compute, where jobs all compete,
A clever new feature makes partitioning neat!
With YAML in paw and resources in mind,
The plugin now chooses the partition aligned.
So hop on and run, let your workflows take flight—
The best fit is chosen, your jobs run just right!
🐇✨

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

snakemake_executor_plugin_slurm/partitions.py (3)
44-53: Use appropriate logging levels.

The function uses logger.warning for informational messages about partition selection. These are normal operational messages, not warnings.
-        logger.warning(
+        logger.info(
             f"Auto-selected partition '{partition}' for job {job.name} "
             f"with score {best_score:.3f}"
         )
         return partition
     else:
         logger.warning(
             f"No suitable partition found for job {job.name} based on "
             f"resource requirements. Falling back to default behavior."
         )
248-250: Simplify nested if statements.

As suggested by static analysis, these nested if statements can be combined for better readability.
-            if gpu_model and self.limits.available_gpu_models:
-                if gpu_model not in self.limits.available_gpu_models:
-                    return None
+            if gpu_model and self.limits.available_gpu_models and gpu_model not in self.limits.available_gpu_models:
+                return None
🧰 Tools

🪛 Ruff (0.11.9)

248-249: Use a single if statement instead of nested if statements

(SIM102)

184-223: Consider documenting and refining the scoring algorithm.

The current scoring algorithm simply sums the ratios of requested resources to limits, giving equal weight to all resources. This may not reflect real-world partition selection priorities.

Consider:

Adding weights to different resources based on their importance

Documenting the scoring rationale in the method docstring

Making the scoring configurable through the partition configuration

For example, memory and GPU availability might be more critical than other factors in partition selection.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 184-184: Too many return statements (8/6)

(R0911)

[refactor] 184-184: Too many branches (21/12)

(R0912)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec82a70 and f5511ff.

📒 Files selected for processing (5)

docs/further.md (1 hunks)
snakemake_executor_plugin_slurm/__init__.py (6 hunks)
snakemake_executor_plugin_slurm/partitions.py (1 hunks)
snakemake_executor_plugin_slurm/submit_string.py (0 hunks)
tests/tests.py (2 hunks)

💤 Files with no reviewable changes (1)

snakemake_executor_plugin_slurm/submit_string.py

🧰 Additional context used

🧬 Code Graph Analysis (2)

snakemake_executor_plugin_slurm/__init__.py (1)

snakemake_executor_plugin_slurm/partitions.py (2)

read_partition_file (13-27)

get_best_partition (30-54)

tests/tests.py (4)

snakemake_executor_plugin_slurm/__init__.py (1)

ExecutorSettings (36-119)

snakemake_executor_plugin_slurm/utils.py (1)

set_gres_string (51-123)

snakemake_executor_plugin_slurm/submit_string.py (1)

get_submit_command (5-69)

snakemake_executor_plugin_slurm/partitions.py (2)

read_partition_file (13-27)

get_best_partition (30-54)

🪛 Pylint (3.3.7)

tests/tests.py

[refactor] 485-485: Too many public methods (28/20)

(R0904)

snakemake_executor_plugin_slurm/partitions.py

[refactor] 41-54: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 78-88: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

[refactor] 85-88: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

[refactor] 115-144: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

[refactor] 93-93: Too many branches (14/12)

(R0912)

[refactor] 150-150: Too many instance attributes (13/7)

(R0902)

[refactor] 184-184: Too many return statements (8/6)

(R0911)

[refactor] 184-184: Too many branches (21/12)

(R0912)

🪛 Ruff (0.11.9)

snakemake_executor_plugin_slurm/partitions.py

248-249: Use a single if statement instead of nested if statements

(SIM102)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: testing

🔇 Additional comments (4)

docs/further.md (1)

67-139: Well-documented feature with clear examples!

The documentation for the automatic partition selection feature is comprehensive and well-structured. The table of configurable limits, example YAML configuration, and explanation of the scoring algorithm provide users with all the necessary information to effectively use this feature.

tests/tests.py (1)

485-974: Excellent test coverage for partition selection feature!

The test suite comprehensively covers all aspects of the partition selection feature:

Various partition configuration scenarios (minimal, comprehensive, invalid)

Edge cases (empty partitions, missing keys, invalid YAML)

Different job resource requirements (CPU, GPU, MPI, constraints)

Proper error handling and logging verification

The use of pytest fixtures and parametrization makes the tests maintainable and easy to extend.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 485-485: Too many public methods (28/20)

(R0904)

snakemake_executor_plugin_slurm/__init__.py (1)

639-652: Well-implemented partition selection priority logic!

The enhanced get_partition_arg method correctly implements the partition selection priority:

Explicit partition specification (via slurm_partition resource)

Automatic selection based on loaded partition config

Fallback to default partition behavior

This maintains backward compatibility while adding the new automatic selection feature.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 649-652: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

snakemake_executor_plugin_slurm/partitions.py (1)

149-175: Well-designed dataclass for partition limits.

The PartitionLimits dataclass appropriately models SLURM partition resource constraints with sensible defaults (infinity for unlimited, 0 for GPUs).

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 150-150: Too many instance attributes (13/7)

(R0902)

coderabbitai · 2025-06-06T22:31:22Z

snakemake_executor_plugin_slurm/__init__.py

+    partition_config: Optional[Path] = field(
+        default=None,
+        metadata={
+            "help": "Path to YAML file defining partition limits for automatic partition selection. "
+            "When provided, jobs will be automatically assigned to the best-fitting partition based on "
+            "See documentation for complete list of available limits.",
+            "env_var": False,
+            "required": False,
+        },
+    )


⚠️ Potential issue

Fix incomplete help text for partition_config.

The help text appears to be incomplete. It currently reads:

"When provided, jobs will be automatically assigned to the best-fitting partition based on "

This sentence is missing its conclusion.

- "help": "Path to YAML file defining partition limits for automatic partition selection. " - "When provided, jobs will be automatically assigned to the best-fitting partition based on " - "See documentation for complete list of available limits.", + "help": "Path to YAML file defining partition limits for automatic partition selection. " + "When provided, jobs will be automatically assigned to the best-fitting partition based on " + "their resource requirements. See documentation for complete list of available limits.",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

partition_config: Optional[Path] = field(

default=None,

metadata={

"help": "Path to YAML file defining partition limits for automatic partition selection. "

"When provided, jobs will be automatically assigned to the best-fitting partition based on "

"See documentation for complete list of available limits.",

"env_var": False,

"required": False,

},

)

partition_config: Optional[Path] = field(

default=None,

metadata={

"help": "Path to YAML file defining partition limits for automatic partition selection. "

"When provided, jobs will be automatically assigned to the best-fitting partition based on "

"their resource requirements. See documentation for complete list of available limits.",

"env_var": False,

"required": False,

},

)

🤖 Prompt for AI Agents

In snakemake_executor_plugin_slurm/__init__.py around lines 110 to 119, the help text for the partition_config field is incomplete and ends abruptly. Complete the sentence to clearly explain how jobs are assigned to partitions when this config is provided, ensuring the help message fully describes the behavior and usage of the partition limits YAML file.

coderabbitai · 2025-06-06T22:31:22Z

snakemake_executor_plugin_slurm/partitions.py

+        if cpus_per_task < 0:
+            return (0, "none")


🛠️ Refactor suggestion

Consider raising an error for negative CPU values.

When cpus_per_task is negative, the function returns (0, "none"). This silently ignores an invalid configuration that likely indicates a user error.

if cpus_per_task < 0: - return (0, "none") + raise WorkflowError(f"Invalid negative value for cpus_per_task: {cpus_per_task}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if cpus_per_task < 0:

return (0, "none")

if cpus_per_task < 0:

raise WorkflowError(f"Invalid negative value for cpus_per_task: {cpus_per_task}")

🤖 Prompt for AI Agents

In snakemake_executor_plugin_slurm/partitions.py around lines 125 to 126, the function currently returns (0, "none") when cpus_per_task is negative, which silently ignores invalid input. Instead, raise a ValueError with a clear message indicating that cpus_per_task cannot be negative to properly signal the configuration error.

coderabbitai · 2025-06-06T22:31:22Z

snakemake_executor_plugin_slurm/partitions.py

+        # Parse gres string format: gpu:<number> or gpu:<model>:<number>
+        gpu_parts = [part for part in gres.split(",") if part.strip().startswith("gpu")]
+        if gpu_parts:
+            gpu_spec = gpu_parts[0].strip().split(":")
+            if len(gpu_spec) == 2:  # gpu:<number>
+                return int(gpu_spec[1]), None
+            elif len(gpu_spec) == 3:  # gpu:<model>:<number>
+                return int(gpu_spec[2]), gpu_spec[1]
+


⚠️ Potential issue

Add error handling for gres GPU parsing.

The gres parsing logic has potential issues:

Integer conversion could fail if the number part isn't numeric

Only the first GPU specification is processed from comma-separated values

No validation for unexpected formats

# Parse gres string format: gpu:<number> or gpu:<model>:<number> gpu_parts = [part for part in gres.split(",") if part.strip().startswith("gpu")] if gpu_parts: gpu_spec = gpu_parts[0].strip().split(":") if len(gpu_spec) == 2: # gpu:<number> - return int(gpu_spec[1]), None + try: + return int(gpu_spec[1]), None + except ValueError: + raise WorkflowError(f"Invalid GPU count in gres: {gpu_spec[1]}") elif len(gpu_spec) == 3: # gpu:<model>:<number> - return int(gpu_spec[2]), gpu_spec[1] + try: + return int(gpu_spec[2]), gpu_spec[1] + except ValueError: + raise WorkflowError(f"Invalid GPU count in gres: {gpu_spec[2]}") + else: + raise WorkflowError(f"Invalid gres GPU format: {gpu_parts[0]}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Parse gres string format: gpu:<number> or gpu:<model>:<number>

gpu_parts = [part for part in gres.split(",") if part.strip().startswith("gpu")]

if gpu_parts:

gpu_spec = gpu_parts[0].strip().split(":")

if len(gpu_spec) == 2: # gpu:<number>

return int(gpu_spec[1]), None

elif len(gpu_spec) == 3: # gpu:<model>:<number>

return int(gpu_spec[2]), gpu_spec[1]

# Parse gres string format: gpu:<number> or gpu:<model>:<number>

gpu_parts = [part for part in gres.split(",") if part.strip().startswith("gpu")]

if gpu_parts:

gpu_spec = gpu_parts[0].strip().split(":")

if len(gpu_spec) == 2: # gpu:<number>

try:

return int(gpu_spec[1]), None

except ValueError:

raise WorkflowError(f"Invalid GPU count in gres: {gpu_spec[1]}")

elif len(gpu_spec) == 3: # gpu:<model>:<number>

try:

return int(gpu_spec[2]), gpu_spec[1]

except ValueError:

raise WorkflowError(f"Invalid GPU count in gres: {gpu_spec[2]}")

else:

raise WorkflowError(f"Invalid gres GPU format: {gpu_parts[0]}")

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 85-88: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

🤖 Prompt for AI Agents

In snakemake_executor_plugin_slurm/partitions.py around lines 81 to 89, the current GPU gres parsing lacks error handling for non-numeric values during integer conversion, processes only the first GPU spec ignoring others, and does not validate unexpected formats. To fix this, add try-except blocks around integer conversions to catch and handle ValueError, iterate over all GPU specs in the comma-separated list instead of just the first, and include validation logic to handle or report unexpected formats gracefully.

coderabbitai · 2025-06-06T22:31:22Z

snakemake_executor_plugin_slurm/partitions.py

+def read_partition_file(partition_file: Path) -> List["Partition"]:
+    with open(partition_file, "r") as f:
+        out = []
+        partitions_dict = yaml.safe_load(f)["partitions"]
+        for partition_name, partition_config in partitions_dict.items():
+            if not partition_name or not partition_name.strip():
+                raise KeyError("Partition name cannot be empty")
+
+            out.append(
+                Partition(
+                    name=partition_name,
+                    limits=PartitionLimits(**partition_config),
+                )
+            )
+        return out


⚠️ Potential issue

Add error handling for file operations and YAML parsing.

The function lacks error handling for several potential failures:

File not found or permission errors

Invalid YAML syntax

Missing "partitions" key

Invalid partition configuration structure

Consider adding comprehensive error handling:

def read_partition_file(partition_file: Path) -> List["Partition"]: - with open(partition_file, "r") as f: - out = [] - partitions_dict = yaml.safe_load(f)["partitions"] + try: + with open(partition_file, "r") as f: + config = yaml.safe_load(f) + except FileNotFoundError: + raise WorkflowError(f"Partition configuration file not found: {partition_file}") + except yaml.YAMLError as e: + raise WorkflowError(f"Invalid YAML in partition configuration: {e}") + except Exception as e: + raise WorkflowError(f"Error reading partition configuration: {e}") + + if not isinstance(config, dict) or "partitions" not in config: + raise WorkflowError("Partition configuration must contain a 'partitions' key") + + partitions_dict = config["partitions"] + if not isinstance(partitions_dict, dict): + raise WorkflowError("'partitions' must be a dictionary") + + out = [] for partition_name, partition_config in partitions_dict.items(): if not partition_name or not partition_name.strip(): raise KeyError("Partition name cannot be empty") - + + if not isinstance(partition_config, dict): + raise WorkflowError(f"Configuration for partition '{partition_name}' must be a dictionary") + out.append( Partition( name=partition_name, limits=PartitionLimits(**partition_config), ) ) return out

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def read_partition_file(partition_file: Path) -> List["Partition"]:

with open(partition_file, "r") as f:

out = []

partitions_dict = yaml.safe_load(f)["partitions"]

for partition_name, partition_config in partitions_dict.items():

if not partition_name or not partition_name.strip():

raise KeyError("Partition name cannot be empty")

out.append(

Partition(

name=partition_name,

limits=PartitionLimits(**partition_config),

)

)

return out

def read_partition_file(partition_file: Path) -> List["Partition"]:

try:

with open(partition_file, "r") as f:

config = yaml.safe_load(f)

except FileNotFoundError:

raise WorkflowError(f"Partition configuration file not found: {partition_file}")

except yaml.YAMLError as e:

raise WorkflowError(f"Invalid YAML in partition configuration: {e}")

except Exception as e:

raise WorkflowError(f"Error reading partition configuration: {e}")

if not isinstance(config, dict) or "partitions" not in config:

raise WorkflowError("Partition configuration must contain a 'partitions' key")

partitions_dict = config["partitions"]

if not isinstance(partitions_dict, dict):

raise WorkflowError("'partitions' must be a dictionary")

out = []

for partition_name, partition_config in partitions_dict.items():

if not partition_name or not partition_name.strip():

raise KeyError("Partition name cannot be empty")

if not isinstance(partition_config, dict):

raise WorkflowError(

f"Configuration for partition '{partition_name}' must be a dictionary"

)

out.append(

Partition(

name=partition_name,

limits=PartitionLimits(**partition_config),

)

)

return out

🤖 Prompt for AI Agents

In snakemake_executor_plugin_slurm/partitions.py around lines 13 to 27, the read_partition_file function lacks error handling for file I/O and YAML parsing issues. Add try-except blocks to catch and handle exceptions such as FileNotFoundError, PermissionError when opening the file, yaml.YAMLError for invalid YAML syntax, KeyError for missing "partitions" key, and TypeError or ValueError for invalid partition configuration structure. Provide clear error messages or re-raise exceptions with context to improve robustness and debuggability.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

snakemake_executor_plugin_slurm/partitions.py (3)

13-27: Add error handling for file operations and YAML parsing.

The function lacks error handling for several potential failures as previously noted.

81-89: Add error handling for gres GPU parsing.

The gres parsing logic needs error handling for integer conversion failures and format validation as previously noted.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 85-88: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

125-126: Consider raising an error for negative CPU values.

When cpus_per_task is negative, raising an error would better signal the configuration issue.

🧹 Nitpick comments (2)

snakemake_executor_plugin_slurm/partitions.py (2)
248-250: Simplify nested if statements.

Combine the nested conditions for better readability.
-            if gpu_model and self.limits.available_gpu_models:
-                if gpu_model not in self.limits.available_gpu_models:
-                    return None
+            if gpu_model and self.limits.available_gpu_models and gpu_model not in self.limits.available_gpu_models:
+                return None
🧰 Tools

🪛 Ruff (0.11.9)

248-249: Use a single if statement instead of nested if statements

(SIM102)

190-223: The resource ratio scoring approach promotes efficient partition utilization.

The scoring algorithm effectively:

Favors partitions that closely match job requirements

Prevents small jobs from monopolizing large partitions

Handles unlimited resources gracefully with isinf checks

This should lead to better overall cluster utilization.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5511ff and 01a94c6.

📒 Files selected for processing (3)

snakemake_executor_plugin_slurm/__init__.py (7 hunks)
snakemake_executor_plugin_slurm/partitions.py (1 hunks)
tests/tests.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

snakemake_executor_plugin_slurm/init.py

🧰 Additional context used

🪛 Pylint (3.3.7)

tests/tests.py

[refactor] 486-486: Too many public methods (28/20)

(R0904)

snakemake_executor_plugin_slurm/partitions.py

[refactor] 41-54: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 78-88: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

[refactor] 85-88: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

[refactor] 115-144: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

[refactor] 93-93: Too many branches (14/12)

(R0912)

[refactor] 150-150: Too many instance attributes (13/7)

(R0902)

[refactor] 184-184: Too many return statements (8/6)

(R0911)

[refactor] 184-184: Too many branches (21/12)

(R0912)

🪛 Ruff (0.11.9)

snakemake_executor_plugin_slurm/partitions.py

248-249: Use a single if statement instead of nested if statements

(SIM102)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: testing

🔇 Additional comments (4)

tests/tests.py (2)

6-9: LGTM! Appropriate imports for partition testing functionality.

The new imports support the comprehensive test suite for the partition selection feature.

Also applies to: 13-16

486-976: Excellent test coverage for the partition selection feature!

The test suite comprehensively covers:

Various partition configurations (basic, minimal, comprehensive)

Different job types (CPU-only, GPU, MPI)

Edge cases and error conditions

Partition scoring and selection logic

The tests ensure robust validation of the new automatic partition selection functionality.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 486-486: Too many public methods (28/20)

(R0904)

snakemake_executor_plugin_slurm/partitions.py (2)

30-55: Well-implemented partition selection logic!

The function correctly:

Filters out incompatible partitions

Selects the best-scoring partition

Provides informative logging for both success and failure cases

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 41-54: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

149-175: Well-structured partition limits dataclass!

The dataclass appropriately:

Uses inf for unlimited numerical resources

Defaults max_gpu to 0 for non-GPU partitions

Organizes resources by type (standard, SLURM-specific, GPU, MPI, constraints)

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 150-150: Too many instance attributes (13/7)

(R0902)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/tests.py (1)
547-550: Improve test case for missing required fields.

The current test uses an empty partition name "" but doesn't truly test missing required fields. Consider testing with a partition that completely lacks the name field.
 @pytest.fixture
 def missing_name_config(self):
     """Configuration with missing name field."""
-    return {"partitions": {"": {}}}  # Empty partition name
+    return {"partitions": {None: {}}}  # None as key to test missing name handling

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01a94c6 and bb8c471.

📒 Files selected for processing (1)

tests/tests.py (2 hunks)

🧰 Additional context used

🪛 Pylint (3.3.7)

tests/tests.py

[refactor] 486-486: Too many public methods (28/20)

(R0904)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: testing

🔇 Additional comments (6)

tests/tests.py (6)

6-8: LGTM! Necessary imports for partition testing.

The added imports for tempfile, yaml, and Path are appropriate for creating temporary configuration files and testing the YAML parsing functionality.

13-16: LGTM! Proper imports for partition functionality.

The imports for read_partition_file and get_best_partition are correctly importing the new partition selection functions being tested.

486-486: Comprehensive test coverage is appropriate despite static analysis warning.

The pylint warning about too many public methods (28/20) is a false positive for test classes. Comprehensive test suites naturally require many test methods to cover all scenarios, and this level of coverage is beneficial for the new partition selection feature.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 486-486: Too many public methods (28/20)

(R0904)

557-567: LGTM! Well-designed helper fixture for temporary files.

The temp_yaml_file fixture provides a clean way to create temporary YAML files for testing. The implementation properly handles file creation and returns the path for cleanup.

569-709: Excellent coverage of file reading scenarios.

The tests comprehensively cover various file reading scenarios including:

Valid partition configurations with different complexity levels

Invalid YAML syntax

Missing files and required keys

Proper cleanup of temporary files

The error handling validation ensures the partition parsing is robust.

734-970: Comprehensive partition selection test coverage.

The partition selection tests effectively validate the core functionality including:

CPU and GPU job matching

MPI support validation

Constraint compatibility checking

Resource limit enforcement

Edge cases and error scenarios

The tests properly verify both successful partition selection and appropriate handling when no suitable partition exists.

cmeesters · 2025-06-08T20:03:56Z

@cademirch Hoi Cade,

this is most impressive and looks like a major step forward! Alas, I will be on a conference for the next week and this PR really needs to be thoroughly tested. So, please allow me a bit more time than usual.

On the plus side: I will add a '!' to the title, as this might be worth a major release.

cademirch · 2025-06-08T20:17:34Z

@cademirch Hoi Cade,

this is most impressive and looks like a major step forward! Alas, I will be on a conference for the next week and this PR really needs to be thoroughly tested. So, please allow me a bit more time than usual.

On the plus side: I will add a '!' to the title, as this might be worth a major release.
Hi @cmeesters!

Of course, I implemented various tests (albeit with LLM assistance, though I looked them over carefully and they all mostly make sense). I expect there are likely use-cases and SLURM configurations I've overlooked - on which your feedback will be very helpful.

Enjoy your conference!

cademirch · 2025-06-28T20:50:36Z

@cmeesters Gentle bump here, would be great to discuss this.

cmeesters · 2025-06-29T04:42:46Z

@cademirch I'm sorry - I forgot to notify you: I didn't have a cluster for the last week (cooling issues). Hopefully, by next week, we open again. Yet, I do not know how much time I will find: holidays are approaching. I really want to test this. It would be a wonderful addition!

update: still no working cluster for me. As I will be on holiday until mid-July, there will be little progress, I'm afraid. I want to attempt a first test, once the cluster is up again, but whether this leave me time to do something substantial, I do not know.

johanneskoester · 2025-08-22T14:58:00Z

Sounds like a good idea in principle. I believe the same could be achieved by providing a python expression for slurm_partition as a default resource. But I understand that complex clusters might turn such an expression into a very complex thing. A thought: what about discussing the logic here with the dev(s) (@gwct, https://github.com/harvardinformatics/snakemake-executor-plugin-cannon) of the cannon plugin, which is a fork of the slurm plugin specific for the Harvard cannon cluster. As far as I know, it does auto-selection of partitions. Maybe the outcome could be that what is developed here could replace the cannon plugin or make it at least easier to maintain.

cmeesters · 2025-08-22T16:57:47Z

regarding cannon: interesting. And sad, that they forked without trying to provide something upstream.

I was imaging to parse the partition definitions of SLURM. That is a nightmare, but feasible - no quick implementation on the horizon. The hardest part would be to get an idea of preferences. That a "parallel" partition is intended to receive jobs for a whole or multiple of nodes might be understandable for us, albeit not deducible from a name as "parallel" might have quite different translations. The same applies for "smp" (which is the limit? This can be part of the partition definitions or the qos) or "a100dl" (which kind of GPU applications is to run there?).

Perhaps, it is worth to invest time, compare outputs of certain cluster definitions (scontrol show partition ...) and sacctmgr commands for qos. I think one person alone will always develop in ignorance of other clusters. That could be part of our regular meetings. I am willing to invest some time in this. What do you guys think (@cademirch , @johanneskoester )?

cademirch · 2025-08-22T17:09:21Z

A thought: what about discussing the logic here with the dev(s) (@gwct, https://github.com/harvardinformatics/snakemake-executor-plugin-cannon) of the cannon plugin

@gwct's cannon plugin is what inspired me to write this PR. Him and I have collaborated on other work, I'm sure he'll be happy to chat about this.

Perhaps, it is worth to invest time, compare outputs of certain cluster definitions (scontrol show partition ...) and sacctmgr commands for qos. I think one person alone will always develop in ignorance of other clusters. That could be part of our regular meetings

If I understand correctly, you are wanting to explore parsing partition limits from scontrol/sacctmgr? I'm open to this, but would defer to you on this idea since I have very little experience with SLURM clusters in general. In any case, I think some sort of manual way to specify cluster limits will be needed in case a user's cluster setup does not work with parsing limits from scontrol or the like.

johanneskoester · 2025-08-25T08:13:49Z

I agree, both automatic and guided like proposed here makes sense.

cmeesters · 2025-08-25T08:21:48Z

ok, let's focus here on the guided approach. I will be travelling, but might(!) find some time in between. Or else, next week.

gwct · 2025-08-25T20:00:48Z

Hey there all,
Sorry for the silence, I've been traveling the past week.

The Cannon plugin was a quick hackathon project. While I definitely think it could be expanded upon, I didn't want to clutter up your PRs with something that has a lot of stuff hard coded for our cluster specifically.

Basically all is just a little decision tree in the get_partition_arg() function to assign partitions based first on memory requested, then CPUs and time. Of course if GPUs are required that takes precedence on our cluster since there is only one GPU partition. The rest of the additions are just helper functions in cannon.py.

As I think we all realize, a universal parser for something like scontrol would be ideal, but extremely difficult. Even in the cannon plugin, we ran into some inconsistencies, notably the only working way to specify GPUs was with slurm_extra, even though both gres: and gpu: seem to be possible options. I'm not sure what about our cluster makes this so, but this is probably a good example of one of the quirks we might run into when developing a general partition selector.

That being said, I think @cademirch's guided approach above, with a YAML file, is probably a good place to start, though still requires a lot of user involvement in generating the YAML file. Perhaps after that's in place, the next step would be to write a separate parsing function that generates the YAML file automatically from scontrol, that way we can just build on top of it and the user can still provide their own YAML if needed.

Happy to discuss more and open to help testing as well! I'll also try to take a better look at Cade's code in this PR, since it seems more robust than our simple decision tree.

johanneskoester · 2025-08-26T08:15:57Z

That sounds like a reasonable plan @gwct. In particular, I like that the user/admin can then review the yaml that has been autogenerated from scontrol in order to refine or extend stuff.

cademirch added 4 commits June 6, 2025 13:20

feat: initial commit of automated partition selection

3bb795f

fix: flatten expected yaml structure; remove partition description

6f475c3

fix: update cli arg name and help

f8d342e

docs: update docs and help

f5511ff

cademirch requested a review from johanneskoester June 6, 2025 22:23

cademirch requested a review from cmeesters June 6, 2025 22:23

cademirch changed the title ~~feat proposal for automatic partition selection~~ feat: proposal for automatic partition selection Jun 6, 2025

coderabbitai bot reviewed Jun 6, 2025

View reviewed changes

cademirch changed the title ~~feat: proposal for automatic partition selection~~ feat: proposal for dynamic partition selection Jun 7, 2025

chore: linting; change automatic to dynamic partition selection

01a94c6

coderabbitai bot reviewed Jun 7, 2025

View reviewed changes

chore: format tests

bb8c471

coderabbitai bot reviewed Jun 7, 2025

View reviewed changes

cmeesters changed the title ~~feat: proposal for dynamic partition selection~~ feat!: proposal for dynamic partition selection Jun 8, 2025

This was referenced Jul 20, 2025

failed to open persistent connection to host:slurm:6819: Connection refused #331

Open

Allow more flexible resource specifications #175

Open

cmeesters mentioned this pull request Aug 6, 2025

Dynamic determination of queue #61

Open

feat!: proposal for dynamic partition selection #321

Are you sure you want to change the base?

feat!: proposal for dynamic partition selection #321

Uh oh!

Conversation

cademirch commented Jun 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cmeesters commented Jun 8, 2025

Uh oh!

cademirch commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cademirch commented Jun 28, 2025

Uh oh!

cmeesters commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johanneskoester commented Aug 22, 2025

Uh oh!

cmeesters commented Aug 22, 2025

Uh oh!

cademirch commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johanneskoester commented Aug 25, 2025

Uh oh!

cmeesters commented Aug 25, 2025

Uh oh!

gwct commented Aug 25, 2025

Uh oh!

johanneskoester commented Aug 26, 2025

Uh oh!

Uh oh!

cademirch commented Jun 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 6, 2025 •

edited

Loading

cademirch commented Jun 8, 2025 •

edited

Loading

cmeesters commented Jun 29, 2025 •

edited

Loading

cademirch commented Aug 22, 2025 •

edited

Loading