Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
582e15d
remove drain and resume functionality
sjpb Apr 21, 2023
b5af186
allow install and runtime taskbooks to be used directly
sjpb Apr 25, 2023
a53ba13
Merge branch 'master' into installonly
sjpb Apr 25, 2023
8d3bac8
Merge branch 'master' into installonly
sjpb May 12, 2023
47b2fd1
fix linter complaints
sjpb May 12, 2023
fe139b2
fix slurmctld state
sjpb May 12, 2023
28baf23
Merge branch 'master' into installonly
sjpb Sep 13, 2023
080cf97
move common tasks to pre.yml
sjpb Sep 19, 2023
f83e334
remove unused openhpc_slurm_service
sjpb Sep 19, 2023
77a628f
fix ini_file use for some community.general versions
sjpb Sep 19, 2023
5d88ca5
fix var precedence in molecule test13
sjpb Sep 19, 2023
33ad0e2
fix var precedence in all molecule tests
sjpb Sep 19, 2023
9683401
fix slurmd always starting on control node
sjpb Sep 19, 2023
d4163bc
move install to install-ohpc.yml
sjpb Sep 19, 2023
d4c5621
remove unused ohpc_slurm_services var
sjpb Sep 19, 2023
09cb57a
Merge branch 'installonly' into feat/no-ohpc
sjpb Sep 19, 2023
5090860
add install-generic for binary-only install
sjpb Sep 19, 2023
253f2b1
distinguish between system and user slurm binaries for generic install
sjpb Sep 19, 2023
1b92b5e
remove support for CentOS7 / OpenHPC
sjpb Sep 19, 2023
985dd3d
remove post-configure, not needed as of slurm v20.02
sjpb Sep 19, 2023
bb0ad77
add openmpi/IMB-MPI1 by default for generic install
sjpb Sep 19, 2023
caebc4f
allow removal of slurm.conf options
sjpb Sep 19, 2023
7e71087
update README
sjpb Sep 20, 2023
f658f4b
Merge branch 'installonly' into feat/no-ohpc
sjpb Sep 20, 2023
336ba63
enable openhpc_extra_repos for both generic and ohpc installs
sjpb Sep 20, 2023
050e449
README tweak
sjpb Sep 20, 2023
b096101
add openhpc_config_files parameter
sjpb Sep 20, 2023
d0d7dbf
change library_dir to lib_dir
sjpb Sep 20, 2023
10cb71a
fix perms
sjpb Sep 20, 2023
6168d45
Merge branch 'master' into feat/no-ohpc
sjpb Sep 20, 2023
cb6edfc
fix/silence linter warnings
sjpb Sep 20, 2023
0871414
remove packages only required for hpctests
sjpb Sep 20, 2023
58526d5
document openhpc_config_files restart behaviour
sjpb Sep 22, 2023
0fcaf69
bugfix missing newline in slurm.conf
sjpb Sep 26, 2023
5b9b106
make path for slurm.conf configurable
sjpb Sep 26, 2023
95c4df8
make slurm.conf template src configurable
sjpb Sep 26, 2023
2b8b8c5
symlink slurm user tools so monitoring works
sjpb Sep 27, 2023
edcfb00
fix slurm directories
sjpb Oct 6, 2023
1f14dbd
fix slurmdbd path for non-default slurm.conf paths
sjpb Oct 10, 2023
295f943
Merge branch 'master' into feat/no-ohpc
sjpb Jan 24, 2024
a5d106f
default gres.conf to correct directory
sjpb Feb 16, 2024
5b73b8a
document <absent> for openhpc_config
sjpb Feb 20, 2024
8412606
Merge branch 'master' into feat/no-ohpc
sjpb Feb 27, 2024
69e25ac
minor merge diff fixes
sjpb Feb 27, 2024
23ddc82
Fix EPEL not getting installed
sjpb Feb 27, 2024
59ee7cc
build RL9.3 container images with systemd
sjpb Mar 19, 2024
2aaa605
Merge branch 'master' into feat/no-ohpc
sjpb Mar 20, 2024
513516c
allow use on image containing slurm binaries
sjpb Jul 23, 2024
a34dace
prepend slurm binaries to PATH instead of symlinking
sjpb Jul 24, 2024
51b5031
ensure cgroup.conf is always next to slurm.conf and allow overriding …
sjpb Feb 25, 2025
d2d4d3f
Add group.node_params to partitions/groups. (#182) (#185)
sjpb May 9, 2025
7f5941c
Merge branch 'master' into feat/no-ohpc
sjpb Sep 4, 2025
a0ef4ee
update readme
sjpb Sep 4, 2025
4926ebb
fixup mode parameters
sjpb Sep 4, 2025
84e8c0f
tidy slurmd restart line
sjpb Sep 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 33 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,35 +2,35 @@

# stackhpc.openhpc

This Ansible role installs packages and performs configuration to provide an OpenHPC v2.x Slurm cluster.
This Ansible role installs packages and performs configuration to provide a Slurm cluster. By default this uses packages from [OpenHPC](https://openhpc.community/) but it can also use user-provided Slurm binaries.

As a role it must be used from a playbook, for which a simple example is given below. This approach means it is totally modular with no assumptions about available networks or any cluster features except for some hostname conventions. Any desired cluster fileystem or other required functionality may be freely integrated using additional Ansible roles or other approaches.

The minimal image for nodes is a RockyLinux 8 GenericCloud image.

## Task files
This role provides four task files which can be selected by using the `tasks_from` parameter of Ansible's `import_role` or `include_role` modules:
- `main.yml`: Runs `install-ohpc.yml` and `runtime.yml`. Default if no `tasks_from` parameter is used.
- `install-ohpc.yml`: Installs repos and packages for OpenHPC.
- `install-generic.yml`: Installs systemd units etc. for user-provided binaries.
- `runtime.yml`: Slurm/service configuration.

## Role Variables

Variables only relevant for `install-ohpc.yml` or `install-generic.yml` task files are marked as such below.

`openhpc_extra_repos`: Optional list. Extra Yum repository definitions to configure, following the format of the Ansible
[yum_repository](https://docs.ansible.com/ansible/2.9/modules/yum_repository_module.html) module. Respected keys for
each list element:
* `name`: Required
* `description`: Optional
* `file`: Required
* `baseurl`: Optional
* `metalink`: Optional
* `mirrorlist`: Optional
* `gpgcheck`: Optional
* `gpgkey`: Optional

`openhpc_slurm_service_enabled`: boolean, whether to enable the appropriate slurm service (slurmd/slurmctld).
[yum_repository](https://docs.ansible.com/ansible/2.9/modules/yum_repository_module.html) module.

`openhpc_slurm_service_enabled`: Optional boolean, whether to enable the appropriate slurm service (slurmd/slurmctld). Default `true`.

`openhpc_slurm_service_started`: Optional boolean. Whether to start slurm services. If set to false, all services will be stopped. Defaults to `openhpc_slurm_service_enabled`.

`openhpc_slurm_control_host`: Required string. Ansible inventory hostname (and short hostname) of the controller e.g. `"{{ groups['cluster_control'] | first }}"`.

`openhpc_slurm_control_host_address`: Optional string. IP address or name to use for the `openhpc_slurm_control_host`, e.g. to use a different interface than is resolved from `openhpc_slurm_control_host`.

`openhpc_packages`: additional OpenHPC packages to install.
`openhpc_packages`: Optional list. Additional OpenHPC packages to install (`install-ohpc.yml` only).

`openhpc_enable`:
* `control`: whether to enable control host
Expand All @@ -44,7 +44,15 @@ each list element:

`openhpc_login_only_nodes`: Optional. If using "configless" mode specify the name of an ansible group containing nodes which are login-only nodes (i.e. not also control nodes), if required. These nodes will run `slurmd` to contact the control node for config.

`openhpc_module_system_install`: Optional, default true. Whether or not to install an environment module system. If true, lmod will be installed. If false, You can either supply your own module system or go without one.
`openhpc_module_system_install`: Optional, default true. Whether or not to install an environment module system. If true, lmod will be installed. If false, You can either supply your own module system or go without one (`install-ohpc.yml` only).

`openhpc_generic_packages`: Optional. List of system packages to install, see `defaults/main.yml` for details (`install-generic.yml` only).

`openhpc_sbin_dir`: Optional. Path to slurm daemon binaries such as `slurmctld`, default `/usr/sbin` (`install-generic.yml` only).

`openhpc_bin_dir`: Optional. Path to Slurm user binaries such as `sinfo`, default `/usr/bin` (`install-generic.yml` only).

`openhpc_lib_dir`: Optional. Path to Slurm libraries, default `/usr/lib64/slurm` (`install-generic.yml` only).

### slurm.conf

Expand Down Expand Up @@ -122,6 +130,16 @@ that this is *not the same* as the Ansible `omit` [special variable](https://doc

`openhpc_state_save_location`: Optional. Absolute path for Slurm controller state (`slurm.conf` parameter [StateSaveLocation](https://slurm.schedmd.com/slurm.conf.html#OPT_StateSaveLocation))

`openhpc_slurmd_spool_dir`: Optional. Absolute path for slurmd state (`slurm.conf` parameter [SlurmdSpoolDir](https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir))

`openhpc_slurm_conf_template`: Optional. Path of Jinja template for `slurm.conf` configuration file. Default is `slurm.conf.j2` template in role. **NB:** The required templating is complex, if just setting specific parameters use `openhpc_config` intead.

`openhpc_slurm_conf_path`: Optional. Path to template `slurm.conf` configuration file to. Default `/etc/slurm/slurm.conf`

`openhpc_gres_template`: Optional. Path of Jinja template for `gres.conf` configuration file. Default is `gres.conf.j2` template in role.

`openhpc_cgroup_template`: Optional. Path of Jinja template for `cgroup.conf` configuration file. Default is `cgroup.conf.j2` template in role.

#### Accounting

By default, no accounting storage is configured. OpenHPC v1.x and un-updated OpenHPC v2.0 clusters support file-based accounting storage which can be selected by setting the role variable `openhpc_slurm_accounting_storage_type` to `accounting_storage/filetxt`<sup id="accounting_storage">[1](#slurm_ver_footnote)</sup>. Accounting for OpenHPC v2.1 and updated OpenHPC v2.0 clusters requires the Slurm database daemon, `slurmdbd` (although job completion may be a limited alternative, see [below](#Job-accounting). To enable accounting:
Expand Down
18 changes: 14 additions & 4 deletions defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,12 @@ openhpc_cgroup_default_config:
openhpc_config: {}
openhpc_cgroup_config: {}
openhpc_gres_template: gres.conf.j2
openhpc_cgroup_template: cgroup.conf.j2

openhpc_state_save_location: /var/spool/slurm
openhpc_slurmd_spool_dir: /var/spool/slurm
openhpc_slurm_conf_path: /etc/slurm/slurm.conf
openhpc_slurm_conf_template: slurm.conf.j2

# Accounting
openhpc_slurm_accounting_storage_host: "{{ openhpc_slurmdbd_host }}"
Expand Down Expand Up @@ -80,6 +84,15 @@ openhpc_enable:
database: false
runtime: false

# Only used for install-generic.yml:
openhpc_generic_packages:
- munge
- mariadb-connector-c # only required on slurmdbd
- hwloc-libs # only required on slurmd
openhpc_sbin_dir: /usr/sbin # path to slurm daemon binaries (e.g. slurmctld)
openhpc_bin_dir: /usr/bin # path to slurm user binaries (e.g sinfo)
openhpc_lib_dir: /usr/lib64/slurm # path to slurm libraries

# Repository configuration
openhpc_extra_repos: []

Expand Down Expand Up @@ -127,12 +140,9 @@ ohpc_default_extra_repos:
gpgcheck: true
gpgkey: "https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-8"

# Concatenate all repo definitions here
ohpc_repos: "{{ ohpc_openhpc_repos[ansible_distribution_major_version] + ohpc_default_extra_repos[ansible_distribution_major_version] + openhpc_extra_repos }}"

openhpc_munge_key_b64:
openhpc_login_only_nodes: ''
openhpc_module_system_install: true
openhpc_module_system_install: true # only works for install-ohpc.yml/main.yml

# Auto detection
openhpc_ram_multiplier: 0.95
Expand Down
72 changes: 72 additions & 0 deletions tasks/install-generic.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
- include_tasks: pre.yml

- name: Create a list of slurm daemons
set_fact:
_ohpc_daemons: "{{ _ohpc_daemon_map | dict2items | selectattr('value') | items2dict | list }}"
vars:
_ohpc_daemon_map:
slurmctld: "{{ openhpc_enable.control }}"
slurmd: "{{ openhpc_enable.batch }}"
slurmdbd: "{{ openhpc_enable.database }}"

- name: Ensure extra repos
ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module]
loop: "{{ openhpc_extra_repos }}"
loop_control:
label: "{{ item.name }}"

- name: Install system packages
dnf:
name: "{{ openhpc_generic_packages }}"

- name: Create Slurm user
user:
name: slurm
comment: SLURM resource manager
home: /etc/slurm
shell: /sbin/nologin

- name: Create Slurm unit files
template:
src: "{{ item }}.service.j2"
dest: /lib/systemd/system/{{ item }}.service
owner: root
group: root
mode: ug=rw,o=r
loop: "{{ _ohpc_daemons }}"
register: _slurm_systemd_units

- name: Get current library locations
shell:
cmd: "ldconfig -v | grep -v ^$'\t'" # noqa: no-tabs risky-shell-pipe
register: _slurm_ldconfig
changed_when: false

- name: Add library locations to ldd search path
copy:
dest: /etc/ld.so.conf.d/slurm.conf
content: "{{ openhpc_lib_dir }}"
owner: root
group: root
mode: ug=rw,o=r
when: openhpc_lib_dir not in _ldd_paths
vars:
_ldd_paths: "{{ _slurm_ldconfig.stdout_lines | map('split', ':') | map('first') }}"

- name: Reload Slurm unit files
# Can't do just this from systemd module
command: systemctl daemon-reload # noqa: command-instead-of-module no-changed-when no-handler
when: _slurm_systemd_units.changed

- name: Prepend $PATH with slurm user binary location
lineinfile:
path: /etc/environment
line: "{{ new_path }}"
regexp: "^{{ new_path | regex_escape }}"
owner: root
group: root
mode: u=gw,go=r
vars:
new_path: PATH="{{ openhpc_bin_dir }}:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin"

- meta: reset_connection # to get new environment
18 changes: 8 additions & 10 deletions tasks/install.yml → tasks/install-ohpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,14 @@
- include_tasks: pre.yml

- name: Ensure OpenHPC repos
ansible.builtin.yum_repository:
name: "{{ item.name }}"
description: "{{ item.description | default(omit) }}"
file: "{{ item.file }}"
baseurl: "{{ item.baseurl | default(omit) }}"
metalink: "{{ item.metalink | default(omit) }}"
mirrorlist: "{{ item.mirrorlist | default(omit) }}"
gpgcheck: "{{ item.gpgcheck | default(omit) }}"
gpgkey: "{{ item.gpgkey | default(omit) }}"
loop: "{{ ohpc_repos }}"
ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module]
loop: "{{ ohpc_openhpc_repos[ansible_distribution_major_version] }}"
loop_control:
label: "{{ item.name }}"

- name: Ensure extra repos
ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module]
loop: "{{ ohpc_default_extra_repos[ansible_distribution_major_version] + openhpc_extra_repos }}"
loop_control:
label: "{{ item.name }}"

Expand Down
2 changes: 1 addition & 1 deletion tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

- name: Install packages
block:
- include_tasks: install.yml
- include_tasks: install-ohpc.yml
when: openhpc_enable.runtime | default(false) | bool
tags: install

Expand Down
46 changes: 23 additions & 23 deletions tasks/runtime.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,19 @@

- name: Ensure Slurm directories exists
file:
path: "{{ openhpc_state_save_location }}"
path: "{{ item.path }}"
owner: slurm
group: slurm
mode: 0755
mode: '0755'
state: directory
when: inventory_hostname == openhpc_slurm_control_host
loop:
- path: "{{ openhpc_state_save_location }}" # StateSaveLocation
enable: control
- path: "{{ openhpc_slurm_conf_path | dirname }}"
enable: control
- path: "{{ openhpc_slurmd_spool_dir }}" # SlurmdSpoolDir
enable: batch
when: "openhpc_enable[item.enable] | default(false) | bool"

- name: Retrieve Munge key from control host
# package install generates a node-unique one
Expand All @@ -32,7 +39,7 @@
dest: "/etc/munge/munge.key"
owner: munge
group: munge
mode: 0400
mode: '0400'
register: _openhpc_munge_key_copy

- name: Ensure JobComp logfile exists
Expand All @@ -41,15 +48,15 @@
state: touch
owner: slurm
group: slurm
mode: 0644
mode: '0644'
access_time: preserve
modification_time: preserve
when: openhpc_slurm_job_comp_type == 'jobcomp/filetxt'

- name: Template slurmdbd.conf
template:
src: slurmdbd.conf.j2
dest: /etc/slurm/slurmdbd.conf
dest: "{{ openhpc_slurm_conf_path | dirname }}/slurmdbd.conf"
mode: "0600"
owner: slurm
group: slurm
Expand All @@ -58,11 +65,11 @@

- name: Template slurm.conf
template:
src: slurm.conf.j2
dest: /etc/slurm/slurm.conf
src: "{{ openhpc_slurm_conf_template }}"
dest: "{{ openhpc_slurm_conf_path }}"
owner: root
group: root
mode: 0644
mode: '0644'
when: openhpc_enable.control | default(false)
notify:
- Restart slurmctld service
Expand All @@ -72,7 +79,7 @@
- name: Create gres.conf
template:
src: "{{ openhpc_gres_template }}"
dest: /etc/slurm/gres.conf
dest: "{{ openhpc_slurm_conf_path | dirname }}/gres.conf"
mode: "0600"
owner: slurm
group: slurm
Expand All @@ -85,8 +92,8 @@
- name: Template cgroup.conf
# appears to be required even with NO cgroup plugins: https://slurm.schedmd.com/cgroups.html#cgroup_design
template:
src: cgroup.conf.j2
dest: /etc/slurm/cgroup.conf
src: "{{ openhpc_cgroup_template }}"
dest: "{{ openhpc_slurm_conf_path | dirname }}/cgroup.conf"
mode: "0644" # perms/ownership based off src from ohpc package
owner: root
group: root
Expand All @@ -96,15 +103,6 @@
register: ohpc_cgroup_conf
# NB uses restart rather than reload as this is needed in some cases

- name: Remove local tempfile for slurm.conf templating
ansible.builtin.file:
path: "{{ _slurm_conf_tmpfile.path }}"
state: absent
when: _slurm_conf_tmpfile.path is defined
delegate_to: localhost
changed_when: false # so molecule doesn't fail
become: no

- name: Ensure Munge service is running
service:
name: munge
Expand All @@ -129,7 +127,9 @@
changed_when: true
when:
- openhpc_slurm_control_host in ansible_play_hosts
- hostvars[openhpc_slurm_control_host].ohpc_slurm_conf.changed or hostvars[openhpc_slurm_control_host].ohpc_cgroup_conf.changed or hostvars[openhpc_slurm_control_host].ohpc_gres_conf.changed # noqa no-handler
- hostvars[openhpc_slurm_control_host].ohpc_slurm_conf.changed or
hostvars[openhpc_slurm_control_host].ohpc_cgroup_conf.changed or
hostvars[openhpc_slurm_control_host].ohpc_gres_conf.changed # noqa no-handler
notify:
- Restart slurmd service

Expand All @@ -143,7 +143,7 @@
create: yes
owner: root
group: root
mode: 0644
mode: '0644'
when:
- openhpc_enable.batch | default(false)
notify:
Expand Down
2 changes: 1 addition & 1 deletion templates/slurm.conf.j2
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ ClusterName={{ openhpc_cluster_name }}

# PARAMETERS
{% for k, v in openhpc_default_config | combine(openhpc_config) | items %}
{% if v != "omit" %}{# allow removing items using setting key: null #}
{% if v != "omit" %}{# allow removing items using setting key: omit #}
{% if k != 'SlurmctldParameters' %}{# handled separately due to configless mode #}
{{ k }}={{ v | join(',') if (v is sequence and v is not string) else v }}
{% endif %}
Expand Down
22 changes: 22 additions & 0 deletions templates/slurmctld.service.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
[Unit]
Description=Slurm controller daemon
After=network-online.target munge.service
Wants=network-online.target
ConditionPathExists={{ openhpc_slurm_conf_path }}

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmctld
EnvironmentFile=-/etc/default/slurmctld
ExecStart={{ openhpc_sbin_dir }}/slurmctld -D -s -f {{ openhpc_slurm_conf_path }} $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
TasksMax=infinity

# Uncomment the following lines to disable logging through journald.
# NOTE: It may be preferable to set these through an override file instead.
#StandardOutput=null
#StandardError=null

[Install]
WantedBy=multi-user.target
Loading