Skip to content

feat!: proposal for dynamic partition selection #321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions docs/further.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,79 @@ See the [snakemake documentation on profiles](https://snakemake.readthedocs.io/e
How and where you set configurations on factors like file size or increasing the runtime with every `attempt` of running a job (if [`--retries` is greater than `0`](https://snakemake.readthedocs.io/en/stable/executing/cli.html#snakemake.cli-get_argument_parser-behavior)).
[There are detailed examples for these in the snakemake documentation.](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-resources)

#### Automatic Partition Selection

The SLURM executor plugin supports automatic partition selection based on job resource requirements, via the command line option `--slurm-partition-config`. This feature allows the plugin to choose the most appropriate partition for each job, without the need to manually specify partitions for different job types. This also enables variable partition selection as a job's resource requirements change based on [dynamic resources](#dynamic-resource-specification), ensuring that jobs are always scheduled to an appropriate partition.

*Jobs that explicitly specify a `slurm_partition` resource will bypass automatic selection and use the specified partition directly.*

##### Partition Limits Specification

To enable automatic partition selection, create a YAML configuration file that defines the available partitions and their resource limits. This file should be structured as follows:

```yaml
partitions:
some_partition:
max_runtime: 100
another_partition:
...
```
Where `some_partition` and `another_partition` are the names of the partition on your cluster, according to `sinfo`.

The following limits can be defined for each partition:

| Parameter | Type | Description | Default |
| ----------------------- | --------- | ---------------------------------- | --------- |
| `max_runtime` | int | Maximum walltime in minutes | unlimited |
| `max_mem_mb` | int | Maximum total memory in MB | unlimited |
| `max_mem_mb_per_cpu` | int | Maximum memory per CPU in MB | unlimited |
| `max_cpus_per_task` | int | Maximum CPUs per task | unlimited |
| `max_nodes` | int | Maximum number of nodes | unlimited |
| `max_tasks` | int | Maximum number of tasks | unlimited |
| `max_tasks_per_node` | int | Maximum tasks per node | unlimited |
| `max_gpu` | int | Maximum number of GPUs | 0 |
| `available_gpu_models` | list[str] | List of available GPU models | none |
| `max_cpus_per_gpu` | int | Maximum CPUs per GPU | unlimited |
| `supports_mpi` | bool | Whether MPI jobs are supported | true |
| `max_mpi_tasks` | int | Maximum MPI tasks | unlimited |
| `available_constraints` | list[str] | List of available node constraints | none |

##### Example Partition Configuration

```yaml
partitions:
standard:
max_runtime: 720 # 12 hours
max_mem_mb: 64000 # 64 GB
max_cpus_per_task: 24
max_nodes: 1

highmem:
max_runtime: 1440 # 24 hours
max_mem_mb: 512000 # 512 GB
max_mem_mb_per_cpu: 16000
max_cpus_per_task: 48
max_nodes: 1

gpu:
max_runtime: 2880 # 48 hours
max_mem_mb: 128000 # 128 GB
max_cpus_per_task: 32
max_gpu: 8
available_gpu_models: ["a100", "v100", "rtx3090"]
max_cpus_per_gpu: 8
```

##### How Partition Selection Works

When automatic partition selection is enabled, the plugin evaluates each job's resource requirements against the defined partition limits to ensure the job is placed on a partition that can accommodate all of its requirements. When multiple partitions are compatible, the plugin uses a scoring algorithm that favors partitions with limits closer to the job's needs, preventing jobs from being assigned to partitions with excessively high resource limits.

The scoring algorithm calculates a score by summing the ratios of requested resources to partition limits (e.g., if a job requests 8 CPUs and a partition allows 16, this contributes 0.5 to the score). Higher scores indicate better resource utilization, so a job requesting 8 CPUs would prefer a 16-CPU partition (score 0.5) over a 64-CPU partition (score 0.125).

##### Fallback Behavior

If no suitable partition is found based on the job's resource requirements, the plugin falls back to the default SLURM behavior, which typically uses the cluster's default partition or any partition specified explicitly in the job's resources.


#### Standard Resources

Expand Down
30 changes: 27 additions & 3 deletions snakemake_executor_plugin_slurm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@

from .utils import delete_slurm_environment, delete_empty_dirs, set_gres_string
from .submit_string import get_submit_command
from .partitions import read_partition_file, get_best_partition


@dataclass
Expand Down Expand Up @@ -106,6 +107,17 @@ class ExecutorSettings(ExecutorSettingsBase):
"required": False,
},
)
partition_config: Optional[Path] = field(
default=None,
metadata={
"help": "Path to YAML file defining partition limits for dynamic "
"partition selection. When provided, jobs will be dynamically "
"assigned to the best-fitting partition based on "
"See documentation for complete list of available limits.",
"env_var": False,
"required": False,
},
)


# Required:
Expand Down Expand Up @@ -149,6 +161,11 @@ def __post_init__(self, test_mode: bool = False):
if self.workflow.executor_settings.logdir
else Path(".snakemake/slurm_logs").resolve()
)
self._partitions = (
read_partition_file(self.workflow.executor_settings.partition_config)
if self.workflow.executor_settings.partition_config
else None
)
atexit.register(self.clean_old_logs)

def clean_old_logs(self) -> None:
Expand Down Expand Up @@ -231,12 +248,13 @@ def run_job(self, job: JobExecutorInterface):
if job.resources.get("slurm_extra"):
self.check_slurm_extra(job)

# NOTE removed partition from below, such that partition
# selection can benefit from resource checking as the call is built up.
job_params = {
"run_uuid": self.run_uuid,
"slurm_logfile": slurm_logfile,
"comment_str": comment_str,
"account": self.get_account_arg(job),
"partition": self.get_partition_arg(job),
"workdir": self.workflow.workdir_init,
}

Expand Down Expand Up @@ -279,6 +297,8 @@ def run_job(self, job: JobExecutorInterface):
"Probably not what you want."
)

call += self.get_partition_arg(job)

exec_job = self.format_job_exec(job)

# and finally the job to execute with all the snakemake parameters
Expand Down Expand Up @@ -378,7 +398,7 @@ async def check_active_jobs(

# We use this sacct syntax for argument 'starttime' to keep it compatible
# with slurm < 20.11
sacct_starttime = f"{datetime.now() - timedelta(days = 2):%Y-%m-%dT%H:00}"
sacct_starttime = f"{datetime.now() - timedelta(days=2):%Y-%m-%dT%H:00}"
# previously we had
# f"--starttime now-2days --endtime now --name {self.run_uuid}"
# in line 218 - once v20.11 is definitively not in use any more,
Expand Down Expand Up @@ -618,9 +638,13 @@ def get_partition_arg(self, job: JobExecutorInterface):
returns a default partition, if applicable
else raises an error - implicetly.
"""
partition = None
if job.resources.get("slurm_partition"):
partition = job.resources.slurm_partition
else:
elif self._partitions:
partition = get_best_partition(self._partitions, job, self.logger)
# we didnt get a partition yet so try fallback.
if not partition:
if self._fallback_partition is None:
self._fallback_partition = self.get_default_partition(job)
partition = self._fallback_partition
Expand Down
Loading