Skip to content

Commit 2d61741

Browse files
sjpbkbendl
andauthored
Enable use of custom Slurm builds (#163)
* remove drain and resume functionality * allow install and runtime taskbooks to be used directly * fix linter complaints * fix slurmctld state * move common tasks to pre.yml * remove unused openhpc_slurm_service * fix ini_file use for some community.general versions * fix var precedence in molecule test13 * fix var precedence in all molecule tests * fix slurmd always starting on control node * move install to install-ohpc.yml * remove unused ohpc_slurm_services var * add install-generic for binary-only install * distinguish between system and user slurm binaries for generic install * remove support for CentOS7 / OpenHPC * remove post-configure, not needed as of slurm v20.02 * add openmpi/IMB-MPI1 by default for generic install * allow removal of slurm.conf options * update README * enable openhpc_extra_repos for both generic and ohpc installs * README tweak * add openhpc_config_files parameter * change library_dir to lib_dir * fix perms * fix/silence linter warnings * remove packages only required for hpctests * document openhpc_config_files restart behaviour * bugfix missing newline in slurm.conf * make path for slurm.conf configurable * make slurm.conf template src configurable * symlink slurm user tools so monitoring works * fix slurm directories * fix slurmdbd path for non-default slurm.conf paths * default gres.conf to correct directory * document <absent> for openhpc_config * minor merge diff fixes * Fix EPEL not getting installed * build RL9.3 container images with systemd * allow use on image containing slurm binaries * prepend slurm binaries to PATH instead of symlinking * ensure cgroup.conf is always next to slurm.conf and allow overriding template * Add group.node_params to partitions/groups. (#182) (#185) * Add group.node_params to partitions/groups. (#182) Allows Features, etc., to be added to partitions. * update SelectType from legacy to current default (#167) --------- Co-authored-by: Kurt Bendl <[email protected]> * update readme * fixup mode parameters * tidy slurmd restart line --------- Co-authored-by: Kurt Bendl <[email protected]>
1 parent 190f8ca commit 2d61741

File tree

10 files changed

+221
-54
lines changed

10 files changed

+221
-54
lines changed

README.md

Lines changed: 33 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,35 +2,35 @@
22

33
# stackhpc.openhpc
44

5-
This Ansible role installs packages and performs configuration to provide an OpenHPC v2.x Slurm cluster.
5+
This Ansible role installs packages and performs configuration to provide a Slurm cluster. By default this uses packages from [OpenHPC](https://openhpc.community/) but it can also use user-provided Slurm binaries.
66

77
As a role it must be used from a playbook, for which a simple example is given below. This approach means it is totally modular with no assumptions about available networks or any cluster features except for some hostname conventions. Any desired cluster fileystem or other required functionality may be freely integrated using additional Ansible roles or other approaches.
88

99
The minimal image for nodes is a RockyLinux 8 GenericCloud image.
1010

11+
## Task files
12+
This role provides four task files which can be selected by using the `tasks_from` parameter of Ansible's `import_role` or `include_role` modules:
13+
- `main.yml`: Runs `install-ohpc.yml` and `runtime.yml`. Default if no `tasks_from` parameter is used.
14+
- `install-ohpc.yml`: Installs repos and packages for OpenHPC.
15+
- `install-generic.yml`: Installs systemd units etc. for user-provided binaries.
16+
- `runtime.yml`: Slurm/service configuration.
17+
1118
## Role Variables
1219

20+
Variables only relevant for `install-ohpc.yml` or `install-generic.yml` task files are marked as such below.
21+
1322
`openhpc_extra_repos`: Optional list. Extra Yum repository definitions to configure, following the format of the Ansible
14-
[yum_repository](https://docs.ansible.com/ansible/2.9/modules/yum_repository_module.html) module. Respected keys for
15-
each list element:
16-
* `name`: Required
17-
* `description`: Optional
18-
* `file`: Required
19-
* `baseurl`: Optional
20-
* `metalink`: Optional
21-
* `mirrorlist`: Optional
22-
* `gpgcheck`: Optional
23-
* `gpgkey`: Optional
24-
25-
`openhpc_slurm_service_enabled`: boolean, whether to enable the appropriate slurm service (slurmd/slurmctld).
23+
[yum_repository](https://docs.ansible.com/ansible/2.9/modules/yum_repository_module.html) module.
24+
25+
`openhpc_slurm_service_enabled`: Optional boolean, whether to enable the appropriate slurm service (slurmd/slurmctld). Default `true`.
2626

2727
`openhpc_slurm_service_started`: Optional boolean. Whether to start slurm services. If set to false, all services will be stopped. Defaults to `openhpc_slurm_service_enabled`.
2828

2929
`openhpc_slurm_control_host`: Required string. Ansible inventory hostname (and short hostname) of the controller e.g. `"{{ groups['cluster_control'] | first }}"`.
3030

3131
`openhpc_slurm_control_host_address`: Optional string. IP address or name to use for the `openhpc_slurm_control_host`, e.g. to use a different interface than is resolved from `openhpc_slurm_control_host`.
3232

33-
`openhpc_packages`: additional OpenHPC packages to install.
33+
`openhpc_packages`: Optional list. Additional OpenHPC packages to install (`install-ohpc.yml` only).
3434

3535
`openhpc_enable`:
3636
* `control`: whether to enable control host
@@ -44,7 +44,15 @@ each list element:
4444

4545
`openhpc_login_only_nodes`: Optional. If using "configless" mode specify the name of an ansible group containing nodes which are login-only nodes (i.e. not also control nodes), if required. These nodes will run `slurmd` to contact the control node for config.
4646

47-
`openhpc_module_system_install`: Optional, default true. Whether or not to install an environment module system. If true, lmod will be installed. If false, You can either supply your own module system or go without one.
47+
`openhpc_module_system_install`: Optional, default true. Whether or not to install an environment module system. If true, lmod will be installed. If false, You can either supply your own module system or go without one (`install-ohpc.yml` only).
48+
49+
`openhpc_generic_packages`: Optional. List of system packages to install, see `defaults/main.yml` for details (`install-generic.yml` only).
50+
51+
`openhpc_sbin_dir`: Optional. Path to slurm daemon binaries such as `slurmctld`, default `/usr/sbin` (`install-generic.yml` only).
52+
53+
`openhpc_bin_dir`: Optional. Path to Slurm user binaries such as `sinfo`, default `/usr/bin` (`install-generic.yml` only).
54+
55+
`openhpc_lib_dir`: Optional. Path to Slurm libraries, default `/usr/lib64/slurm` (`install-generic.yml` only).
4856

4957
### slurm.conf
5058

@@ -122,6 +130,16 @@ that this is *not the same* as the Ansible `omit` [special variable](https://doc
122130

123131
`openhpc_state_save_location`: Optional. Absolute path for Slurm controller state (`slurm.conf` parameter [StateSaveLocation](https://slurm.schedmd.com/slurm.conf.html#OPT_StateSaveLocation))
124132

133+
`openhpc_slurmd_spool_dir`: Optional. Absolute path for slurmd state (`slurm.conf` parameter [SlurmdSpoolDir](https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir))
134+
135+
`openhpc_slurm_conf_template`: Optional. Path of Jinja template for `slurm.conf` configuration file. Default is `slurm.conf.j2` template in role. **NB:** The required templating is complex, if just setting specific parameters use `openhpc_config` intead.
136+
137+
`openhpc_slurm_conf_path`: Optional. Path to template `slurm.conf` configuration file to. Default `/etc/slurm/slurm.conf`
138+
139+
`openhpc_gres_template`: Optional. Path of Jinja template for `gres.conf` configuration file. Default is `gres.conf.j2` template in role.
140+
141+
`openhpc_cgroup_template`: Optional. Path of Jinja template for `cgroup.conf` configuration file. Default is `cgroup.conf.j2` template in role.
142+
125143
#### Accounting
126144

127145
By default, no accounting storage is configured. OpenHPC v1.x and un-updated OpenHPC v2.0 clusters support file-based accounting storage which can be selected by setting the role variable `openhpc_slurm_accounting_storage_type` to `accounting_storage/filetxt`<sup id="accounting_storage">[1](#slurm_ver_footnote)</sup>. Accounting for OpenHPC v2.1 and updated OpenHPC v2.0 clusters requires the Slurm database daemon, `slurmdbd` (although job completion may be a limited alternative, see [below](#Job-accounting). To enable accounting:

defaults/main.yml

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,12 @@ openhpc_cgroup_default_config:
4949
openhpc_config: {}
5050
openhpc_cgroup_config: {}
5151
openhpc_gres_template: gres.conf.j2
52+
openhpc_cgroup_template: cgroup.conf.j2
5253

5354
openhpc_state_save_location: /var/spool/slurm
55+
openhpc_slurmd_spool_dir: /var/spool/slurm
56+
openhpc_slurm_conf_path: /etc/slurm/slurm.conf
57+
openhpc_slurm_conf_template: slurm.conf.j2
5458

5559
# Accounting
5660
openhpc_slurm_accounting_storage_host: "{{ openhpc_slurmdbd_host }}"
@@ -80,6 +84,15 @@ openhpc_enable:
8084
database: false
8185
runtime: false
8286

87+
# Only used for install-generic.yml:
88+
openhpc_generic_packages:
89+
- munge
90+
- mariadb-connector-c # only required on slurmdbd
91+
- hwloc-libs # only required on slurmd
92+
openhpc_sbin_dir: /usr/sbin # path to slurm daemon binaries (e.g. slurmctld)
93+
openhpc_bin_dir: /usr/bin # path to slurm user binaries (e.g sinfo)
94+
openhpc_lib_dir: /usr/lib64/slurm # path to slurm libraries
95+
8396
# Repository configuration
8497
openhpc_extra_repos: []
8598

@@ -127,12 +140,9 @@ ohpc_default_extra_repos:
127140
gpgcheck: true
128141
gpgkey: "https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-8"
129142

130-
# Concatenate all repo definitions here
131-
ohpc_repos: "{{ ohpc_openhpc_repos[ansible_distribution_major_version] + ohpc_default_extra_repos[ansible_distribution_major_version] + openhpc_extra_repos }}"
132-
133143
openhpc_munge_key_b64:
134144
openhpc_login_only_nodes: ''
135-
openhpc_module_system_install: true
145+
openhpc_module_system_install: true # only works for install-ohpc.yml/main.yml
136146

137147
# Auto detection
138148
openhpc_ram_multiplier: 0.95

tasks/install-generic.yml

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
- include_tasks: pre.yml
2+
3+
- name: Create a list of slurm daemons
4+
set_fact:
5+
_ohpc_daemons: "{{ _ohpc_daemon_map | dict2items | selectattr('value') | items2dict | list }}"
6+
vars:
7+
_ohpc_daemon_map:
8+
slurmctld: "{{ openhpc_enable.control }}"
9+
slurmd: "{{ openhpc_enable.batch }}"
10+
slurmdbd: "{{ openhpc_enable.database }}"
11+
12+
- name: Ensure extra repos
13+
ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module]
14+
loop: "{{ openhpc_extra_repos }}"
15+
loop_control:
16+
label: "{{ item.name }}"
17+
18+
- name: Install system packages
19+
dnf:
20+
name: "{{ openhpc_generic_packages }}"
21+
22+
- name: Create Slurm user
23+
user:
24+
name: slurm
25+
comment: SLURM resource manager
26+
home: /etc/slurm
27+
shell: /sbin/nologin
28+
29+
- name: Create Slurm unit files
30+
template:
31+
src: "{{ item }}.service.j2"
32+
dest: /lib/systemd/system/{{ item }}.service
33+
owner: root
34+
group: root
35+
mode: ug=rw,o=r
36+
loop: "{{ _ohpc_daemons }}"
37+
register: _slurm_systemd_units
38+
39+
- name: Get current library locations
40+
shell:
41+
cmd: "ldconfig -v | grep -v ^$'\t'" # noqa: no-tabs risky-shell-pipe
42+
register: _slurm_ldconfig
43+
changed_when: false
44+
45+
- name: Add library locations to ldd search path
46+
copy:
47+
dest: /etc/ld.so.conf.d/slurm.conf
48+
content: "{{ openhpc_lib_dir }}"
49+
owner: root
50+
group: root
51+
mode: ug=rw,o=r
52+
when: openhpc_lib_dir not in _ldd_paths
53+
vars:
54+
_ldd_paths: "{{ _slurm_ldconfig.stdout_lines | map('split', ':') | map('first') }}"
55+
56+
- name: Reload Slurm unit files
57+
# Can't do just this from systemd module
58+
command: systemctl daemon-reload # noqa: command-instead-of-module no-changed-when no-handler
59+
when: _slurm_systemd_units.changed
60+
61+
- name: Prepend $PATH with slurm user binary location
62+
lineinfile:
63+
path: /etc/environment
64+
line: "{{ new_path }}"
65+
regexp: "^{{ new_path | regex_escape }}"
66+
owner: root
67+
group: root
68+
mode: u=gw,go=r
69+
vars:
70+
new_path: PATH="{{ openhpc_bin_dir }}:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin"
71+
72+
- meta: reset_connection # to get new environment

tasks/install.yml renamed to tasks/install-ohpc.yml

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,14 @@
33
- include_tasks: pre.yml
44

55
- name: Ensure OpenHPC repos
6-
ansible.builtin.yum_repository:
7-
name: "{{ item.name }}"
8-
description: "{{ item.description | default(omit) }}"
9-
file: "{{ item.file }}"
10-
baseurl: "{{ item.baseurl | default(omit) }}"
11-
metalink: "{{ item.metalink | default(omit) }}"
12-
mirrorlist: "{{ item.mirrorlist | default(omit) }}"
13-
gpgcheck: "{{ item.gpgcheck | default(omit) }}"
14-
gpgkey: "{{ item.gpgkey | default(omit) }}"
15-
loop: "{{ ohpc_repos }}"
6+
ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module]
7+
loop: "{{ ohpc_openhpc_repos[ansible_distribution_major_version] }}"
8+
loop_control:
9+
label: "{{ item.name }}"
10+
11+
- name: Ensure extra repos
12+
ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module]
13+
loop: "{{ ohpc_default_extra_repos[ansible_distribution_major_version] + openhpc_extra_repos }}"
1614
loop_control:
1715
label: "{{ item.name }}"
1816

tasks/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
- name: Install packages
1010
block:
11-
- include_tasks: install.yml
11+
- include_tasks: install-ohpc.yml
1212
when: openhpc_enable.runtime | default(false) | bool
1313
tags: install
1414

tasks/runtime.yml

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,19 @@
1111

1212
- name: Ensure Slurm directories exists
1313
file:
14-
path: "{{ openhpc_state_save_location }}"
14+
path: "{{ item.path }}"
1515
owner: slurm
1616
group: slurm
17-
mode: 0755
17+
mode: '0755'
1818
state: directory
19-
when: inventory_hostname == openhpc_slurm_control_host
19+
loop:
20+
- path: "{{ openhpc_state_save_location }}" # StateSaveLocation
21+
enable: control
22+
- path: "{{ openhpc_slurm_conf_path | dirname }}"
23+
enable: control
24+
- path: "{{ openhpc_slurmd_spool_dir }}" # SlurmdSpoolDir
25+
enable: batch
26+
when: "openhpc_enable[item.enable] | default(false) | bool"
2027

2128
- name: Retrieve Munge key from control host
2229
# package install generates a node-unique one
@@ -32,7 +39,7 @@
3239
dest: "/etc/munge/munge.key"
3340
owner: munge
3441
group: munge
35-
mode: 0400
42+
mode: '0400'
3643
register: _openhpc_munge_key_copy
3744

3845
- name: Ensure JobComp logfile exists
@@ -41,15 +48,15 @@
4148
state: touch
4249
owner: slurm
4350
group: slurm
44-
mode: 0644
51+
mode: '0644'
4552
access_time: preserve
4653
modification_time: preserve
4754
when: openhpc_slurm_job_comp_type == 'jobcomp/filetxt'
4855

4956
- name: Template slurmdbd.conf
5057
template:
5158
src: slurmdbd.conf.j2
52-
dest: /etc/slurm/slurmdbd.conf
59+
dest: "{{ openhpc_slurm_conf_path | dirname }}/slurmdbd.conf"
5360
mode: "0600"
5461
owner: slurm
5562
group: slurm
@@ -58,11 +65,11 @@
5865

5966
- name: Template slurm.conf
6067
template:
61-
src: slurm.conf.j2
62-
dest: /etc/slurm/slurm.conf
68+
src: "{{ openhpc_slurm_conf_template }}"
69+
dest: "{{ openhpc_slurm_conf_path }}"
6370
owner: root
6471
group: root
65-
mode: 0644
72+
mode: '0644'
6673
when: openhpc_enable.control | default(false)
6774
notify:
6875
- Restart slurmctld service
@@ -72,7 +79,7 @@
7279
- name: Create gres.conf
7380
template:
7481
src: "{{ openhpc_gres_template }}"
75-
dest: /etc/slurm/gres.conf
82+
dest: "{{ openhpc_slurm_conf_path | dirname }}/gres.conf"
7683
mode: "0600"
7784
owner: slurm
7885
group: slurm
@@ -85,8 +92,8 @@
8592
- name: Template cgroup.conf
8693
# appears to be required even with NO cgroup plugins: https://slurm.schedmd.com/cgroups.html#cgroup_design
8794
template:
88-
src: cgroup.conf.j2
89-
dest: /etc/slurm/cgroup.conf
95+
src: "{{ openhpc_cgroup_template }}"
96+
dest: "{{ openhpc_slurm_conf_path | dirname }}/cgroup.conf"
9097
mode: "0644" # perms/ownership based off src from ohpc package
9198
owner: root
9299
group: root
@@ -96,15 +103,6 @@
96103
register: ohpc_cgroup_conf
97104
# NB uses restart rather than reload as this is needed in some cases
98105

99-
- name: Remove local tempfile for slurm.conf templating
100-
ansible.builtin.file:
101-
path: "{{ _slurm_conf_tmpfile.path }}"
102-
state: absent
103-
when: _slurm_conf_tmpfile.path is defined
104-
delegate_to: localhost
105-
changed_when: false # so molecule doesn't fail
106-
become: no
107-
108106
- name: Ensure Munge service is running
109107
service:
110108
name: munge
@@ -129,7 +127,9 @@
129127
changed_when: true
130128
when:
131129
- openhpc_slurm_control_host in ansible_play_hosts
132-
- hostvars[openhpc_slurm_control_host].ohpc_slurm_conf.changed or hostvars[openhpc_slurm_control_host].ohpc_cgroup_conf.changed or hostvars[openhpc_slurm_control_host].ohpc_gres_conf.changed # noqa no-handler
130+
- hostvars[openhpc_slurm_control_host].ohpc_slurm_conf.changed or
131+
hostvars[openhpc_slurm_control_host].ohpc_cgroup_conf.changed or
132+
hostvars[openhpc_slurm_control_host].ohpc_gres_conf.changed # noqa no-handler
133133
notify:
134134
- Restart slurmd service
135135

@@ -143,7 +143,7 @@
143143
create: yes
144144
owner: root
145145
group: root
146-
mode: 0644
146+
mode: '0644'
147147
when:
148148
- openhpc_enable.batch | default(false)
149149
notify:

templates/slurm.conf.j2

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ ClusterName={{ openhpc_cluster_name }}
22

33
# PARAMETERS
44
{% for k, v in openhpc_default_config | combine(openhpc_config) | items %}
5-
{% if v != "omit" %}{# allow removing items using setting key: null #}
5+
{% if v != "omit" %}{# allow removing items using setting key: omit #}
66
{% if k != 'SlurmctldParameters' %}{# handled separately due to configless mode #}
77
{{ k }}={{ v | join(',') if (v is sequence and v is not string) else v }}
88
{% endif %}

templates/slurmctld.service.j2

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
[Unit]
2+
Description=Slurm controller daemon
3+
After=network-online.target munge.service
4+
Wants=network-online.target
5+
ConditionPathExists={{ openhpc_slurm_conf_path }}
6+
7+
[Service]
8+
Type=simple
9+
EnvironmentFile=-/etc/sysconfig/slurmctld
10+
EnvironmentFile=-/etc/default/slurmctld
11+
ExecStart={{ openhpc_sbin_dir }}/slurmctld -D -s -f {{ openhpc_slurm_conf_path }} $SLURMCTLD_OPTIONS
12+
ExecReload=/bin/kill -HUP $MAINPID
13+
LimitNOFILE=65536
14+
TasksMax=infinity
15+
16+
# Uncomment the following lines to disable logging through journald.
17+
# NOTE: It may be preferable to set these through an override file instead.
18+
#StandardOutput=null
19+
#StandardError=null
20+
21+
[Install]
22+
WantedBy=multi-user.target

0 commit comments

Comments
 (0)