Skip to content

Commit bfe3cdc

Browse files
committed
Merge remote-tracking branch 'origin/main' into production-end-to-end-deployment-docs
2 parents 040790e + 13fa5c2 commit bfe3cdc

File tree

36 files changed

+533
-199
lines changed

36 files changed

+533
-199
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ venv
55
*.pyc
66
packer/openhpc2
77
.vscode
8+
requirements.yml.last

README.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,6 @@ The default configuration in this repository may be used to create a cluster to
2525
- Persistent state backed by an OpenStack volume.
2626
- NFS-based shared file system backed by another OpenStack volume.
2727

28-
Note that the Open OnDemand portal and its remote apps are not usable with this default configuration.
29-
3028
It requires an OpenStack cloud, and an Ansible "deploy host" with access to that cloud.
3129

3230
Before starting ensure that:

ansible/bootstrap.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@
143143
- appliances_mode == 'configure'
144144
- not (dnf_repos_allow_insecure_creds | default(false)) # useful for development
145145

146-
- hosts: cacerts:!builder
146+
- hosts: cacerts
147147
tags: cacerts
148148
gather_facts: false
149149
tasks:

ansible/roles/cacerts/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Configure CA certificates and trusts.
44

55
## Role variables
66

7-
- `ca-certificates`: Optional str. Path to directory containing certificates
7+
- `cacerts_cert_dir`: Optional str. Path to directory containing certificates
88
in PEM or DER format. Any files here will be added to the list of CAs trusted
99
by the system.
1010

ansible/roles/cuda/README.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,10 @@
22

33
Install NVIDIA drivers and optionally CUDA packages. CUDA binaries are added to the `$PATH` for all users, and the [NVIDIA persistence daemon](https://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon) is enabled.
44

5-
## Prerequisites
6-
7-
Requires OFED to be installed to provide required kernel-* packages.
8-
95
## Role Variables
106

117
- `cuda_repo_url`: Optional. URL of `.repo` file. Default is upstream for appropriate OS/architecture.
128
- `cuda_nvidia_driver_stream`: Optional. Version of `nvidia-driver` stream to enable. This controls whether the open or proprietary drivers are installed and the major version. Changing this once the drivers are installed does not change the version.
13-
- `cuda_packages`: Optional. Default: `['cuda', 'nvidia-gds', 'cmake', 'cuda-toolkit-12-8']`.
9+
- `cuda_packages`: Optional. Default provides CUDA Toolkit and GPUDirect Storage (GDS).
1410
- `cuda_package_version`: Optional. Default `latest` which will install the latest packages if not installed but won't upgrade already-installed packages. Use `'none'` to skip installing CUDA.
1511
- `cuda_persistenced_state`: Optional. State of systemd `nvidia-persistenced` service. Values as [ansible.builtin.systemd:state](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_module.html#parameter-state). Default `started`.

ansible/roles/cuda/defaults/main.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
cuda_repo_url: "https://developer.download.nvidia.com/compute/cuda/repos/rhel{{ ansible_distribution_major_version }}/{{ ansible_architecture }}/cuda-rhel{{ ansible_distribution_major_version }}.repo"
2-
cuda_nvidia_driver_stream: '570-open'
3-
cuda_package_version: '12.8.1-1'
4-
cuda_version_short: '12.8'
2+
cuda_nvidia_driver_stream: '575-open'
3+
cuda_nvidia_driver_pkg: "nvidia-open-3:575.57.08-1.el{{ ansible_distribution_major_version }}"
4+
cuda_package_version: '12.9.1-1'
5+
cuda_version_short: "{{ (cuda_package_version | split('.'))[0:2] | join('.') }}" # major.minor
56
cuda_packages:
6-
- "cuda{{ ('-' + cuda_package_version) if cuda_package_version != 'latest' else '' }}"
7+
- "cuda-toolkit-{{ cuda_package_version }}"
78
- nvidia-gds
89
- cmake
9-
- cuda-toolkit-12-8
1010
cuda_samples_release_url: "https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v{{ cuda_version_short }}.tar.gz"
1111
cuda_samples_path: "/var/lib/{{ ansible_user }}/cuda_samples"
1212
cuda_samples_programs:

ansible/roles/cuda/tasks/install.yml

Lines changed: 5 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,5 @@
11

2-
# Based on https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#redhat8-installation
3-
4-
- name: Check for OFED/DOCA
5-
command:
6-
cmd: dnf list --installed rdma-core
7-
register: _dnf_rdma_core
8-
changed_when: false
9-
10-
- name: Assert OFED installed
11-
assert:
12-
that: "'mlnx' in _dnf_rdma_core.stdout"
13-
fail_msg: "Did not find 'mlnx' in installed rdma-core package, is OFED/DOCA installed?"
2+
# Based on https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/
143

154
- name: Install cuda repo
165
get_url:
@@ -29,23 +18,18 @@
2918
when: "'No matching Modules to list' in _cuda_driver_module_enabled.stderr"
3019
changed_when: "'Nothing to do' not in _cuda_driver_module_enable.stdout"
3120

32-
- name: Check if nvidia driver module is installed
33-
ansible.builtin.command: dnf module list --installed nvidia-driver
34-
changed_when: false
35-
failed_when: false
36-
register: _cuda_driver_module_installed
37-
3821
- name: Install nvidia drivers
39-
ansible.builtin.command: dnf module install -y nvidia-driver
22+
ansible.builtin.dnf:
23+
name: "{{ cuda_nvidia_driver_pkg }}"
4024
register: _cuda_driver_install
41-
when: "'No matching Modules to list' in _cuda_driver_module_installed.stderr"
42-
changed_when: "'Nothing to do' not in _cuda_driver_install.stdout"
4325

4426
- name: Check kernel has not been modified
4527
assert:
4628
that: "'kernel ' not in _cuda_driver_install.stdout | default('')" # space ensures we don't flag e.g. kernel-devel-matched
4729
fail_msg: "{{ _cuda_driver_install.stdout_lines | default([]) | select('search', 'kernel ') }}"
4830

31+
# Based on https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
32+
4933
- name: Install cuda packages
5034
ansible.builtin.dnf:
5135
name: "{{ cuda_packages }}"

ansible/roles/cuda/tasks/samples.yml

Lines changed: 0 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -25,36 +25,3 @@
2525
cmd: . /etc/profile.d/sh.local && cmake .. && make -j {{ ansible_processor_vcpus }}
2626
chdir: "{{ cuda_samples_path }}/cuda-samples-{{ cuda_version_short }}/build"
2727
creates: "{{ cuda_samples_path }}/cuda-samples-{{ cuda_version_short }}/build/Samples/1_Utilities/deviceQuery/deviceQuery"
28-
29-
- name: Run CUDA deviceQuery
30-
command:
31-
cmd: "{{ cuda_samples_path }}/cuda-samples-{{ cuda_version_short }}/build/Samples/1_Utilities/deviceQuery/deviceQuery"
32-
register: _cuda_devicequery
33-
34-
- name: Set fact for CUDA devices
35-
set_fact:
36-
cuda_devices: "{{ _cuda_devicequery.stdout | regex_findall('Device (\\d+):') }}"
37-
38-
- name: Run CUDA bandwidth test
39-
command:
40-
cmd: "{{ cuda_samples_path }}/cuda-samples-{{ cuda_version_short }}/build/Samples/1_Utilities/bandwidthTest/bandwidthTest --device={{ item }}"
41-
register: _cuda_bandwidthtest
42-
loop: "{{ cuda_devices }}"
43-
loop_control:
44-
label: "Device {{ item }}" # e.g '0'
45-
46-
- name: Summarise bandwidth test output
47-
debug:
48-
msg: |
49-
{{ _parts[1].splitlines()[0] | trim }}
50-
Bandwidths: (Gb/s)
51-
Host to Device: {{ _parts[2].split()[-1] }}
52-
Device to Host: {{ _parts[3].split()[-1] }}
53-
Device to Device: {{ _parts[4].split()[-1] }}
54-
{{ ': '.join(_parts[5].split('=') | map('trim')) }}
55-
{{ _parts[6] }}
56-
loop: "{{ _cuda_bandwidthtest.results }}"
57-
vars:
58-
_parts: "{{ item.stdout.split('\n\n') }}"
59-
loop_control:
60-
label: "Device {{ item.item }}" # e.g '0'

ansible/roles/lustre/README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,7 @@ Install and configure a Lustre client. This builds RPM packages from source.
77
**NB:** Currently this only supports RockyLinux 9.
88

99
## Role Variables
10-
11-
- `lustre_version`: Optional str. Version of lustre to build, default `2.15.6` which is the first version with EL9.5 support
10+
The following variables control configuration of Lustre clients.
1211
- `lustre_lnet_label`: Optional str. The "lnet label" part of the host's NID, e.g. `tcp0`. Only the `tcp` protocol type is currently supported. Default `tcp`.
1312
- `lustre_mgs_nid`: Required str. The NID(s) for the MGS, e.g. `192.168.227.11@tcp1` (separate mutiple MGS NIDs using `:`).
1413
- `lustre_mounts`: Required list. Define Lustre filesystems and mountpoints as a list of dicts with keys:
@@ -19,7 +18,11 @@ Install and configure a Lustre client. This builds RPM packages from source.
1918
- `lustre_mount_state`. Optional default mount state for all mounts, as for [ansible.posix.mount](https://docs.ansible.com/ansible/latest/collections/ansible/posix/mount_module.html#parameter-state). Default is `mounted`.
2019
- `lustre_mount_options`. Optional default mount options. Default values are systemd defaults from [Lustre client docs](http://wiki.lustre.org/Mounting_a_Lustre_File_System_on_Client_Nodes).
2120

22-
The following variables control the package build and and install and should not generally be required:
21+
The following variables control the package build and and install:
22+
- `lustre_version`: Optional str. Version of lustre to build, default `2.15.6/lu-18085`
23+
which is the first version with EL9.5 support, plus a fix for https://jira.whamcloud.com/browse/LU-18085.
24+
- `lustre_repo`: Optional str. URL for Lustre repo. Default is a StackHPC repo
25+
incorporating the above fix.
2326
- `lustre_build_packages`: Optional list. Prerequisite packages required to build Lustre. See `defaults/main.yml`.
2427
- `lustre_build_dir`: Optional str. Path to build lustre at, default `/tmp/lustre-release`.
2528
- `lustre_configure_opts`: Optional list. Options to `./configure` command. Default builds client rpms supporting Mellanox OFED, without support for GSS keys.

ansible/roles/lustre/defaults/main.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
lustre_version: '2.15.6' # https://www.lustre.org/lustre-2-15-6-released/
1+
lustre_repo: https://github.com/stackhpc/lustre-release.git
2+
lustre_version: '2.15.6/lu-18085' # Fixes https://jira.whamcloud.com/browse/LU-18085
23
lustre_lnet_label: tcp
34
#lustre_mgs_nid:
45
lustre_mounts: []

0 commit comments

Comments
 (0)