Skip to content

Commit 70b57e2

Browse files
LukasFehringLukas Fehring
andauthored
1226 paralilisation (#1234)
* Play With paralelisation * Update Parallelism Doc * Update paralelisation example * Update docs * Update parallelism docs * Update parallelism * Adapt Changelog.md --------- Co-authored-by: Lukas Fehring <[email protected]>
1 parent 99e4454 commit 70b57e2

File tree

4 files changed

+97
-98
lines changed

4 files changed

+97
-98
lines changed

.gitignore

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,9 +133,13 @@ dmypy.json
133133
# Pyre type checker
134134
.pyre/
135135

136+
# SMAC Logs
136137
*smac3-output_*
137138
*smac3_output*
138139

140+
# Dask Logs
141+
tmp/smac_dask_slurm
142+
139143
# macOS files
140144
.DS_Store
141145

@@ -150,4 +154,5 @@ src
150154
.vscode
151155

152156
projects
153-
_api
157+
_api
158+
branin.pkl

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414

1515
## Examples
1616
- Add target function with additional arguments (#1134)
17+
- Adapt parallelization example (#1214)
1718

1819
## Improvements
1920
- Submit trials to runners in SMBO instead of running configs directly (#937)
@@ -23,6 +24,7 @@
2324

2425
## Documentation
2526
- Ask and tell without initial design and warmstarting
27+
- Add a description of parallelization in the documentation (#1226)
2628

2729
## Bugfixes
2830
- Ask and tell without initial design may no longer return a config from the initial design - if it is not "removed".
Lines changed: 55 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,76 @@
11
# Parallelism
22

3-
SMAC supports multiple workers natively via Dask. Just specify ``n_workers`` in the scenario and you are ready to go.
4-
3+
To facilitate parallel execution, SMAC supports executing multiple workers simultaneously via [Dask](https://www.dask.org/). Using this functionality, splits SMAC into a main process, and DASK workers which handle the execution.
4+
The main job handles the optimization process, and coordinates the executor jobs. The executors are queried with the target function and hyperparameter configurations, execute them, and return their result. The executors remain open between different executions.
55

66
!!! note
77

88
Please keep in mind that additional workers are only used to evaluate trials. The main thread still orchestrates the
99
optimization process, including training the surrogate model.
1010

11-
1211
!!! warning
1312

14-
Using high number of workers when the target function evaluation is fast might be counterproductive due to the
15-
overhead of communcation. Consider using only one worker in this case.
16-
13+
When using multiple workers, SMAC is not reproducible anymore.
1714

18-
!!! warning
1915

20-
When using multiple workers, SMAC is not reproducible anymore.
16+
## Parallelizing Locally
2117

18+
To utilize parallelism locally, that means running workers on the same machine as the main jobs, specify the ``n_workers`` keyword when creating the scenario.
19+
```python
20+
Scenario(model.configspace, n_workers=5)
21+
```
2222

23-
## Running on a Cluster
2423

25-
You can also pass a custom dask client, e.g. to run on a slurm cluster.
26-
See our [parallelism example](../examples/1%20Basics/7_parallelization_cluster.md).
24+
## Parallelizing on SLURM
2725

28-
!!! warning
26+
To utilize this split of main and execution jobs on a [SLURM cluster](https://slurm.schedmd.com/), SMAC supports manually specifying a [Dask](https://www.dask.org/) client.
27+
This allows executing the target function on dedicated SLURM jobs that are necessarily configured with the same hardware requirements,.
2928

30-
On some clusters you cannot spawn new jobs when running a SLURMCluster inside a
31-
job instead of on the login node. No obvious errors might be raised but it can hang silently.
29+
!!! note
3230

33-
!!! warning
31+
While most SLURM clusters behave similarly, the example DASK client might not work for every cluster. For example, some clusters only allow spawning new jobs
32+
from the login node.
3433

35-
Sometimes you need to modify your launch command which can be done with
36-
`SLURMCluster.job_class.submit_command`.
34+
To configure SMAC properly for each cluster, you need to know the ports which allow communication between main and worker jobs. The dask client is then created as follows:
3735

3836
```python
39-
cluster.job_cls.submit_command = submit_command
40-
cluster.job_cls.cancel_command = cancel_command
41-
```
37+
...
38+
from smac import BlackBoxFacade, Scenario
39+
from dask_jobqueue import SLURMCluster
40+
41+
cluster = SLURMCluster(
42+
queue="partition_name", # Name of the partition
43+
cores=4, # CPU cores requested
44+
memory="4 GB", # RAM requested
45+
walltime="00:10:00", # Walltime limit for a runner job.
46+
processes=1, # Number of processes per worker
47+
log_directory="tmp/smac_dask_slurm", # Logging directory
48+
nanny=False, # False unless you want to use pynisher
49+
worker_extra_args=[
50+
"--worker-port", # Worker port range
51+
"60010:60100"], # Worker port range
52+
scheduler_options={
53+
"port": 60001, # Main Job Port
54+
},
55+
)
56+
cluster.scale(jobs=n_workers)
57+
58+
# Dask creates n_workers jobs on the cluster which stay open.
59+
client = Client(
60+
address=cluster,
61+
)
62+
63+
#Dask waits for n_workers workers to be created
64+
client.wait_for_workers(n_workers)
65+
66+
# Now we use SMAC to find the best hyperparameters
67+
smac = BlackBoxFacade(
68+
scenario, # Pass scenario
69+
model.train, # Pass Pass target-function
70+
overwrite=True, # Overrides any previous result
71+
dask_client=client, # Pass dask_client
72+
)
73+
incumbent = smac.optimize()
74+
```
75+
76+
The full example of this code is given in [parallelism example](../examples/1%20Basics/7_parallelization_cluster.md).
Lines changed: 34 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -1,66 +1,31 @@
11
"""Parallelization on Cluster
22
3-
An example of applying SMAC to optimize Branin using parallelization via Dask client on a
3+
An example of applying SMAC to optimize Branin using parallelization via Dask client on a
44
SLURM cluster. If you do not want to use a cluster but your local machine, set dask_client
55
to `None` and pass `n_workers` to the `Scenario`.
6-
7-
Sometimes, the submitted jobs by the slurm client might be cancelled once it starts. In that
8-
case, you could try to start your job from a computing node
9-
10-
:warning: On some clusters you cannot spawn new jobs when running a SLURMCluster inside a
11-
job instead of on the login node. No obvious errors might be raised but it can hang silently.
12-
13-
Sometimes you need to modify your launch command which can be done with
14-
`SLURMCluster.job_class.submit_command`.
15-
16-
```python
17-
cluster.job_cls.submit_command = submit_command
18-
cluster.job_cls.cancel_command = cancel_command
19-
```
20-
21-
Here we optimize the synthetic 2d function Branin.
22-
We use the black-box facade because it is designed for black-box function optimization.
23-
The black-box facade uses a [Gaussian Process][GP] as its surrogate model.
24-
The facade works best on a numerical hyperparameter configuration space and should not
25-
be applied to problems with large evaluation budgets (up to 1000 evaluations).
266
"""
277

28-
import numpy as np
29-
from ConfigSpace import Configuration, ConfigurationSpace, Float
308
from dask.distributed import Client
319
from dask_jobqueue import SLURMCluster
32-
3310
from smac import BlackBoxFacade, Scenario
3411

3512
__copyright__ = "Copyright 2025, Leibniz University Hanover, Institute of AI"
3613
__license__ = "3-clause BSD"
3714

15+
import numpy as np
16+
from ConfigSpace import ConfigurationSpace, Float
17+
from ConfigSpace import Configuration # for type hints
3818

39-
class Branin(object):
40-
@property
41-
def configspace(self) -> ConfigurationSpace:
42-
cs = ConfigurationSpace(seed=0)
19+
class Branin:
20+
def __init__(self, seed: int = 0):
21+
cs = ConfigurationSpace(seed=seed)
4322
x0 = Float("x0", (-5, 10), default=-5, log=False)
4423
x1 = Float("x1", (0, 15), default=2, log=False)
4524
cs.add([x0, x1])
4625

47-
return cs
26+
self.cs = cs
4827

4928
def train(self, config: Configuration, seed: int = 0) -> float:
50-
"""Branin function
51-
52-
Parameters
53-
----------
54-
config : Configuration
55-
Contains two continuous hyperparameters, x0 and x1
56-
seed : int, optional
57-
Not used, by default 0
58-
59-
Returns
60-
-------
61-
float
62-
Branin function value
63-
"""
6429
x0 = config["x0"]
6530
x1 = config["x1"]
6631
a = 1.0
@@ -69,39 +34,40 @@ def train(self, config: Configuration, seed: int = 0) -> float:
6934
r = 6.0
7035
s = 10.0
7136
t = 1.0 / (8.0 * np.pi)
72-
ret = a * (x1 - b * x0**2 + c * x0 - r) ** 2 + s * (1 - t) * np.cos(x0) + s
73-
74-
return ret
75-
37+
return a * (x1 - b * x0**2 + c * x0 - r) ** 2 \
38+
+ s * (1 - t) * np.cos(x0) + s
7639

7740
if __name__ == "__main__":
7841
model = Branin()
7942

8043
# Scenario object specifying the optimization "environment"
81-
scenario = Scenario(model.configspace, deterministic=True, n_trials=100, trial_walltime_limit=100)
44+
scenario = Scenario(
45+
model.cs,
46+
deterministic=True,
47+
n_trials=100,
48+
trial_walltime_limit=100,
49+
n_workers=5,
50+
)
8251

83-
# Create cluster
84-
n_workers = 4 # Use 4 workers on the cluster
52+
n_workers = 5 # Use 5 workers on the cluster
8553
# Please note that the number of workers is directly set in the
8654
# cluster / client. `scenario.n_workers` is ignored in this case.
8755

8856
cluster = SLURMCluster(
89-
# This is the partition of our slurm cluster.
90-
queue="cpu_short",
91-
# Your account name
92-
# account="myaccount",
93-
cores=1,
94-
memory="1 GB",
95-
# Walltime limit for each worker. Ensure that your function evaluations
96-
# do not exceed this limit.
97-
# More tips on this here: https://jobqueue.dask.org/en/latest/advanced-tips-and-tricks.html#how-to-handle-job-queueing-system-walltime-killing-workers
98-
walltime="00:10:00",
99-
processes=1,
100-
log_directory="tmp/smac_dask_slurm",
101-
# if you would like to limit the resources consumption of each function evaluation with pynisher, you need to
102-
# set nanny as False
103-
# Otherwise, an error `daemonic processes are not allowed to have children` will raise!
104-
nanny=False, # if you do not use pynisher to limit the memory/time usage, feel free to set this one as True
57+
queue="partition_name", # Name of the partition
58+
cores=4, # CPU cores requested
59+
memory="4 GB", # RAM requested
60+
walltime="00:10:00", # Walltime limit for a runner job.
61+
processes=1, # Number of processes per worker
62+
log_directory="tmp/smac_dask_slurm", # Logging directory
63+
nanny=False, # False unless you want to use pynisher
64+
worker_extra_args=[
65+
"--worker-port", # Worker port range
66+
"60010:60100"], # Worker port range
67+
scheduler_options={
68+
"port": 60001, # Main Job Port
69+
},
70+
# account="myaccount", # Account name on the cluster (optional)
10571
)
10672
cluster.scale(jobs=n_workers)
10773

@@ -110,8 +76,7 @@ def train(self, config: Configuration, seed: int = 0) -> float:
11076
client = Client(
11177
address=cluster,
11278
)
113-
# Instead, you can also do
114-
# client = cluster.get_client()
79+
client.wait_for_workers(n_workers)
11580

11681
# Now we use SMAC to find the best hyperparameters
11782
smac = BlackBoxFacade(
@@ -120,13 +85,5 @@ def train(self, config: Configuration, seed: int = 0) -> float:
12085
overwrite=True, # Overrides any previous results that are found that are inconsistent with the meta-data
12186
dask_client=client,
12287
)
123-
12488
incumbent = smac.optimize()
125-
126-
# Get cost of default configuration
127-
default_cost = smac.validate(model.configspace.get_default_configuration())
128-
print(f"Default cost: {default_cost}")
129-
130-
# Let's calculate the cost of the incumbent
131-
incumbent_cost = smac.validate(incumbent)
132-
print(f"Incumbent cost: {incumbent_cost}")
89+
print(f"Best configuration found: {incumbent}")

0 commit comments

Comments
 (0)