From b27da4066a6c9f2cce610db085fcf224cdd84897 Mon Sep 17 00:00:00 2001 From: Matt Crees Date: Thu, 22 May 2025 14:03:27 +0100 Subject: [PATCH 1/8] First draft of production end-to-end docs --- docs/production.md | 312 ++++++++++++++++++++++++++++++++++++--------- 1 file changed, 250 insertions(+), 62 deletions(-) diff --git a/docs/production.md b/docs/production.md index 7fcff1d7e..b990609c8 100644 --- a/docs/production.md +++ b/docs/production.md @@ -1,74 +1,251 @@ # Production Deployments -This page contains some brief notes about differences between the default/demo -configuration (as described in the main [README.md](../README.md)) and -production-ready deployments. +This page will guide you on how to create production-ready deployments. While +you can start right away with this guide, you may find it useful to try with a +demo deployment first, as described in the [main README](../README.md). -- Get it agreed up front what the cluster names will be. Changing this later - requires instance deletion/recreation. +## Prerequisites -- At least three environments should be created: - - `site`: site-specific base environment - - `production`: production environment - - `staging`: staging environment +Before starting ensure that: - A `dev` environment should also be created if considered required, or this - can be left until later. + - You have root access on the deploy host. - These can all be produced using the cookicutter instructions, but the - `production` and `staging` environments will need their - `environments/$ENV/ansible.cfg` file modifying so that they point to the - `site` environment: + - You can create instances from the [latest Slurm appliance + image](https://github.com/stackhpc/ansible-slurm-appliance/releases), + which already contains the required packages. This is built and tested in + StackHPC's CI. - ```ini - inventory = ../common/inventory,../site/inventory,inventory - ``` + - You have an SSH keypair defined in OpenStack, with the private key + available on the deploy host. + + - Created instances have access to internet (note proxies can be setup + through the appliance if necessary). + + - Created instances have accurate/synchronised time (for VM instances this is + usually provided by the hypervisor; if not or for bare metal instances it + may be necessary to configure a time service via the appliance). + + - Three security groups are present: ``default`` allowing intra-cluster + communication, ``SSH`` allowing external access via SSH and ``HTTPS`` + allowing access for Open OnDemand. + +### Setup deploy host + +The following operating systems are supported for the deploy host: + + - Rocky Linux 9 + + - Rocky Linux 8 + +These instructions assume the deployment host is running Rocky Linux 8: + +```bash +sudo yum install -y git python38 +git clone https://github.com/stackhpc/ansible-slurm-appliance +cd ansible-slurm-appliance +git checkout ${latest-release-tag} +./dev/setup-env.sh +``` + +You will also need to install +[OpenTofu](https://opentofu.org/docs/intro/install/rpm/). + +## Version control + +A production deployment should be set up under version control, so you should +create a fork of this repo. + +To start, you should use the [latest tagged +release](https://github.com/stackhpc/ansible-slurm-appliance/releases). v1.161 +has been used as an example here, make sure to channge this. Do not use the +default main branch, as this may have features that are still works in +progress. The steps below show how to create a site-specific branch. + + ```bash + git clone https://github.com/your-fork/ansible-slurm-appliance + git checkout v1.161 + git checkout -b site/main + git push -u origin site/main + ``` + +## Environment setup + +Get it agreed up front what the cluster names will be. Changing this later +requires instance deletion/recreation. + +### Cookiecutter instructions + +- Run the following from the repository root to activate the venv: + + ```bash + . venv/bin/activate + ``` + +- Use the `cookiecutter` template to create a new environment to hold your + configuration: + + ```bash + cd environments + cookiecutter skeleton + ``` + + and follow the prompts to complete the environment name and description. + + **NB:** In subsequent sections this new environment is referred to as `$ENV`. + +- Go back to the root folder and activate the new environment: + + ```bash + cd .. + . environments/$ENV/activate + ``` -- To avoid divergence of configuration all possible overrides for group/role + And generate secrets for it: + + ```bash + ansible-playbook ansible/adhoc/generate-passwords.yml + ``` + +### Environments structure + +At least three environments will be created: + + - `site`: site-specific base environment + + - `production`: production environment + + - `staging`: staging environment + +A `dev` environment should also be created if considered required, or this can +be left until later. + +These will all be produced using the cookicutter instructions, but the +`production` and `staging` environments will need their +`environments/$ENV/ansible.cfg` file modifying so that they point to the `site` +environment: + + ```ini + inventory = ../common/inventory,../site/inventory,inventory + ``` + +To avoid divergence of configuration all possible overrides for group/role vars should be placed in `environments/site/inventory/group_vars/all/*.yml` unless the value really is environment-specific (e.g. DNS names for `openondemand_servername`). -- Where possible hooks should also be placed in `environments/site/hooks/` +Where possible hooks should also be placed in `environments/site/hooks/` and referenced from the `site` and `production` environments, e.g.: - ```yaml - # environments/production/hooks/pre.yml: - - name: Import parent hook - import_playbook: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/../site/hooks/pre.yml" - ``` + ```yaml + # environments/production/hooks/pre.yml: + - name: Import parent hook + import_playbook: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/../site/hooks/pre.yml" + ``` + +OpenTofu configurations should be defined in the `site` environment and used +as a module from the other environments. This can be done with the +cookie-cutter generated configurations: -- OpenTofu configurations should be defined in the `site` environment and used - as a module from the other environments. This can be done with the - cookie-cutter generated configurations: - Delete the *contents* of the cookie-cutter generated `tofu/` directories from the `production` and `staging` environments. + - Create a `main.tf` in those directories which uses `site/tofu/` as a [module](https://opentofu.org/docs/language/modules/), e.g. : - ``` - ... - module "cluster" { - source = "../../site/tofu/" + ``` + ... + module "cluster" { + source = "../../site/tofu/" + cluster_name = "foo" + ... + } + ``` - cluster_name = "foo" - ... - } - ``` +Note that: + + - Environment-specific variables (`cluster_name`) should be hardcoded into + the module block. - Note that: - - Environment-specific variables (`cluster_name`) should be hardcoded - into the module block. - - Environment-independent variables (e.g. maybe `cluster_net` if the - same is used for staging and production) should be set as *defaults* - in `environments/site/tofu/variables.tf`, and then don't need to - be passed in to the module. + - Environment-independent variables (e.g. maybe `cluster_net` if the same + is used for staging and production) should be set as *defaults* in + `environments/site/tofu/variables.tf`, and then don't need to be passed + in to the module. + +## Define and deploy infrastructure + +Create an OpenTofu variables file to define the required infrastructure, e.g.: + + ``` + # environments/$ENV/tofu/terraform.tfvars + cluster_name = "mycluster" + cluster_networks = [ + { + network = "some_network" # * + subnet = "some_subnet" # * + } + ] + key_pair = "my_key" # * + control_node_flavor = "some_flavor_name" + login = { + # Arbitrary group name for these login nodes + interactive = { + nodes: ["login-0"] + flavor: "login_flavor_name" # * + } + } + cluster_image_id = "rocky_linux_9_image_uuid" + compute = { + # Group name used for compute node partition definition + general = { + nodes: ["compute-0", "compute-1"] + flavor: "compute_flavor_name" # * + } + } + ``` + +Variables marked `*` refer to OpenStack resources which must already exist. + +The above is a minimal configuration - for all variables and descriptions see +`environments/$ENV/tofu/variables.tf`. + +To deploy this infrastructure, ensure the venv and the environment are +[activated](#cookiecutter-instructions) and run: + + ```bash + export OS_CLOUD=openstack + cd environments/$ENV/tofu/ + tofu init + tofu apply + ``` + +and follow the prompts. Note the OS_CLOUD environment variable assumes that +OpenStack credentials are defined using a +[clouds.yaml](https://docs.openstack.org/python-openstackclient/latest/configuration/index.html#clouds-yaml) +file in a default location with the default cloud name of `openstack`. + +### Configure appliance + +To configure the appliance, ensure the venv and the environment are +[activated](#create-a-new-environment) and run: + + ```bash + ansible-playbook ansible/site.yml + ``` + +Once it completes you can log in to the cluster using: + + ```bash + ./dev/ansible-ssh login + ``` + +## Production further configuration - Vault-encrypt secrets. Running the `generate-passwords.yml` playbook creates a secrets file at `environments/$ENV/inventory/group_vars/all/secrets.yml`. To ensure staging environments are a good model for production this should generally be moved into the `site` environment. It should be encrypted - using [Ansible vault](https://docs.ansible.com/ansible/latest/user_guide/vault.html) + using [Ansible + vault](https://docs.ansible.com/ansible/latest/user_guide/vault.html) and then committed to the repository. - Ensure created instances have accurate/synchronised time. For VM instances @@ -76,13 +253,13 @@ and referenced from the `site` and `production` environments, e.g.: instances) it may be necessary to configure or proxy `chronyd` via an environment hook. -- The cookiecutter provided OpenTofu configurations define resources for home and - state volumes. The former may not be required if the cluster's `/home` is +- The cookiecutter provided OpenTofu configurations define resources for home + and state volumes. The former may not be required if the cluster's `/home` is provided from an external filesystem (or Manila). In any case, in at least the production environment, and probably also in the staging environment, the volumes should be manually created and the resources changed to [data - resources](https://opentofu.org/docs/language/data-sources/). This ensures that even if the cluster is deleted via tofu, the - volumes will persist. + resources](https://opentofu.org/docs/language/data-sources/). This ensures + that even if the cluster is deleted via tofu, the volumes will persist. For a development environment, having volumes under tofu control via volume resources is usually appropriate as there may be many instantiations @@ -98,9 +275,12 @@ and referenced from the `site` and `production` environments, e.g.: - Configure Open OnDemand - see [specific documentation](openondemand.md). -- Remove the `demo_user` user from `environments/$ENV/inventory/group_vars/all/basic_users.yml` +- Remove the `demo_user` user from + `environments/$ENV/inventory/group_vars/all/basic_users.yml` -- Consider whether having (read-only) access to Grafana without login is OK. If not, remove `grafana_auth_anonymous` in `environments/$ENV/inventory/group_vars/all/grafana.yml` +- Consider whether having (read-only) access to Grafana without login is OK. If + not, remove `grafana_auth_anonymous` in + `environments/$ENV/inventory/group_vars/all/grafana.yml` - Modify `environments/site/tofu/nodes.tf` to provide fixed IPs for at least the control node, and (if not using FIPs) the login node(s): @@ -114,13 +294,15 @@ and referenced from the `site` and `production` environments, e.g.: } } ``` - + Note the variable `control_ip_address` is new. - Using fixed IPs will require either using admin credentials or policy changes. + Using fixed IPs will require either using admin credentials or policy + changes. -- If floating IPs are required for login nodes, modify the OpenTofu configurations - appropriately. +- If floating IPs are required for login nodes, modify the OpenTofu + configurations appropriately. + TODO add example - Consider whether mapping of baremetal nodes to ironic nodes is required. See [PR 485](https://github.com/stackhpc/ansible-slurm-appliance/pull/485). @@ -131,9 +313,10 @@ and referenced from the `site` and `production` environments, e.g.: - See the [hpctests docs](../ansible/roles/hpctests/README.md) for advice on raising `hpctests_hpl_mem_frac` during tests. -- By default, OpenTofu (and Terraform) [limits](https://opentofu.org/docs/cli/commands/apply/#apply-options) - the number of concurrent operations to 10. This means that for example only - 10 ports or 10 instances can be deployed at once. This should be raised by +- By default, OpenTofu (and Terraform) + [limits](https://opentofu.org/docs/cli/commands/apply/#apply-options) the + number of concurrent operations to 10. This means that for example only 10 + ports or 10 instances can be deployed at once. This should be raised by modifying `environments/$ENV/activate` to add a line like: export TF_CLI_ARGS_apply="-parallelism=25" @@ -142,12 +325,17 @@ and referenced from the `site` and `production` environments, e.g.: Note that any time spent blocked due to this parallelism limit does not count against the (un-overridable) internal OpenTofu timeout of 30 minutes -- By default, OpenStack Nova also [limits](https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.max_concurrent_builds) +- By default, OpenStack Nova also + [limits](https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.max_concurrent_builds) the number of concurrent instance builds to 10. This is per Nova controller, - so 10x virtual machines per hypervisor. For baremetal nodes it is 10 per cloud - if the OpenStack version is earlier than Caracel, else this limit can be - raised using [shards](https://specs.openstack.org/openstack/nova-specs/specs/2024.1/implemented/ironic-shards.html). + so 10x virtual machines per hypervisor. For baremetal nodes it is 10 per + cloud if the OpenStack version is earlier than Caracel, else this limit can + be raised using + [shards](https://specs.openstack.org/openstack/nova-specs/specs/2024.1/implemented/ironic-shards.html). In general it should be possible to raise this value to 50-100 if the cloud is properly tuned, again, demonstrated through testing. -- Enable alertmanager if Slack is available - see [docs/alerting.md](./alerting.md). +- Enable alertmanager if Slack is available - see + [docs/alerting.md](./alerting.md). + +For further information see the [docs](docs/) directory. From 674be6202d9678679d6a9dea39fca6878b298299 Mon Sep 17 00:00:00 2001 From: Matt Crees Date: Thu, 22 May 2025 14:08:28 +0100 Subject: [PATCH 2/8] Ubuntu Jammy is also supported --- docs/production.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/production.md b/docs/production.md index b990609c8..c506289ea 100644 --- a/docs/production.md +++ b/docs/production.md @@ -33,6 +33,8 @@ Before starting ensure that: The following operating systems are supported for the deploy host: + - Ubuntu Jammy 22.04 + - Rocky Linux 9 - Rocky Linux 8 From 37830bb230ffd5c6b0b7671aea9c643cf1d25881 Mon Sep 17 00:00:00 2001 From: Matt Crees Date: Thu, 22 May 2025 14:12:56 +0100 Subject: [PATCH 3/8] Add TODOs --- docs/production.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/production.md b/docs/production.md index c506289ea..783b83cda 100644 --- a/docs/production.md +++ b/docs/production.md @@ -59,7 +59,7 @@ create a fork of this repo. To start, you should use the [latest tagged release](https://github.com/stackhpc/ansible-slurm-appliance/releases). v1.161 -has been used as an example here, make sure to channge this. Do not use the +has been used as an example here, make sure to change this. Do not use the default main branch, as this may have features that are still works in progress. The steps below show how to create a site-specific branch. @@ -304,7 +304,7 @@ Once it completes you can log in to the cluster using: - If floating IPs are required for login nodes, modify the OpenTofu configurations appropriately. - TODO add example + **TODO: add example** - Consider whether mapping of baremetal nodes to ironic nodes is required. See [PR 485](https://github.com/stackhpc/ansible-slurm-appliance/pull/485). @@ -340,4 +340,8 @@ Once it completes you can log in to the cluster using: - Enable alertmanager if Slack is available - see [docs/alerting.md](./alerting.md). + **TODO: custom image builds, when/why and how** + + **TODO: any further docs to link to. cuda, lustre, filesystems when written** + For further information see the [docs](docs/) directory. From 91c62811e82bf700c89c9b6721a0977f7e5ec4a2 Mon Sep 17 00:00:00 2001 From: Matt Crees Date: Thu, 22 May 2025 15:28:03 +0100 Subject: [PATCH 4/8] Accomplish TODOs --- docs/production.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/docs/production.md b/docs/production.md index 783b83cda..78e65c83d 100644 --- a/docs/production.md +++ b/docs/production.md @@ -302,9 +302,9 @@ Once it completes you can log in to the cluster using: Using fixed IPs will require either using admin credentials or policy changes. -- If floating IPs are required for login nodes, modify the OpenTofu - configurations appropriately. - **TODO: add example** +- If floating IPs are required for login nodes, these can be set using the + optional parameter `fip_addresses`. These need to be created in your project + beforehand. - Consider whether mapping of baremetal nodes to ironic nodes is required. See [PR 485](https://github.com/stackhpc/ansible-slurm-appliance/pull/485). @@ -340,8 +340,10 @@ Once it completes you can log in to the cluster using: - Enable alertmanager if Slack is available - see [docs/alerting.md](./alerting.md). - **TODO: custom image builds, when/why and how** +- For some features, such as installing [DOCA-OFED](../roles/doca/README.md) or + [CUDA](../roles/cuda/README.md), you will need to build a custom image. It is + recommended that you build this on top of the latest existing openhpc image. + See the [image-build docs](image-build.md) for details. - **TODO: any further docs to link to. cuda, lustre, filesystems when written** - -For further information see the [docs](docs/) directory. +For further information, including additional configuration guides and +operations instructions, see the [docs](README.md) directory. From aa457d3d3cd04fbbedb478cc843dd1a5e0500977 Mon Sep 17 00:00:00 2001 From: Matt Crees Date: Thu, 22 May 2025 15:37:09 +0100 Subject: [PATCH 5/8] Mention networks docs --- docs/production.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/production.md b/docs/production.md index 78e65c83d..8bca452b4 100644 --- a/docs/production.md +++ b/docs/production.md @@ -306,6 +306,9 @@ Once it completes you can log in to the cluster using: optional parameter `fip_addresses`. These need to be created in your project beforehand. +- A production deployment may have a more complex networking requirements than + just a simple network. See the [networks docs](networks.md) for details. + - Consider whether mapping of baremetal nodes to ironic nodes is required. See [PR 485](https://github.com/stackhpc/ansible-slurm-appliance/pull/485). From 0e93f4898e659cf9f548396d0573c1c313cb2bef Mon Sep 17 00:00:00 2001 From: Matt Crees Date: Thu, 22 May 2025 15:45:47 +0100 Subject: [PATCH 6/8] NFS --- docs/production.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/docs/production.md b/docs/production.md index 8bca452b4..7c1380d1c 100644 --- a/docs/production.md +++ b/docs/production.md @@ -343,6 +343,16 @@ Once it completes you can log in to the cluster using: - Enable alertmanager if Slack is available - see [docs/alerting.md](./alerting.md). +- By default, the appliance uses a built-in NFS share backed by an OpenStack + volume for the cluster home directories. You may find that you want to change + this. The following alternatives are supported: + + - External NFS + + - CephFS via OpenStack Manila + + - [Lustre](../roles/lustre/README.md) + - For some features, such as installing [DOCA-OFED](../roles/doca/README.md) or [CUDA](../roles/cuda/README.md), you will need to build a custom image. It is recommended that you build this on top of the latest existing openhpc image. From 040790e970380f60275aa77865721546475c6379 Mon Sep 17 00:00:00 2001 From: Matt Crees Date: Thu, 22 May 2025 15:54:59 +0100 Subject: [PATCH 7/8] Clarify image --- docs/production.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/production.md b/docs/production.md index 7c1380d1c..c3fd9f1bb 100644 --- a/docs/production.md +++ b/docs/production.md @@ -210,6 +210,10 @@ Variables marked `*` refer to OpenStack resources which must already exist. The above is a minimal configuration - for all variables and descriptions see `environments/$ENV/tofu/variables.tf`. +The cluster image used should match the release which you are deploying with. +Images are published alongside the release tags +[here](https://github.com/stackhpc/ansible-slurm-appliance/releases). + To deploy this infrastructure, ensure the venv and the environment are [activated](#cookiecutter-instructions) and run: From 5aed4a0437489f7a5cffda4b25002b699612a2f9 Mon Sep 17 00:00:00 2001 From: Matt Crees Date: Tue, 19 Aug 2025 15:48:29 +0100 Subject: [PATCH 8/8] Formatting changes --- docs/production.md | 65 +++++++++++++++++++++++----------------------- 1 file changed, 33 insertions(+), 32 deletions(-) diff --git a/docs/production.md b/docs/production.md index 748e066cd..a37d4e3bb 100644 --- a/docs/production.md +++ b/docs/production.md @@ -108,16 +108,17 @@ requires instance deletion/recreation. ### Environments structure -At least two environments should be created using cookiecutter, which will derive from the `site` base environment: +At least two environments should be created using cookiecutter, which will +derive from the `site` base environment: - `production`: production environment - `staging`: staging environment A `dev` environment should also be created if considered required, or this can be left until later. -In general only the `inventory/groups` file in the `site` environment is needed - -it can be modified as required to -enable features for all environments at the site. +In general only the `inventory/groups` file in the `site` environment is needed +- it can be modified as required to enable features for all environments at the +site. To avoid divergence of configuration all possible overrides for group/role vars should be placed in `environments/site/inventory/group_vars/all/*.yml` @@ -133,14 +134,6 @@ and referenced from the `site` and `production` environments, e.g.: import_playbook: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/../site/hooks/pre.yml" ``` -When setting OpenTofu configurations: - - Environment-specific variables (`cluster_name`) should be hardcoded - as arguments into the cluster module block at `environments/$ENV/tofu/main.tf`. - - Environment-independent variables (e.g. maybe `cluster_net` if the - same is used for staging and production) should be set as *defaults* - in `environments/site/tofu/variables.tf`, and then don't need to - be passed in to the module. - OpenTofu configurations should be defined in the `site` environment and used as a module from the other environments. This can be done with the cookie-cutter generated configurations: @@ -278,13 +271,13 @@ Once it completes you can log in to the cluster using: either for a specific environment within the cluster module block in `environments/$ENV/tofu/main.tf`, or as the site default by changing the default in `environments/site/tofu/variables.tf`. - + For a development environment allowing OpenTofu to manage the volumes using the default value of `"manage"` for those varibles is usually appropriate, as it allows for multiple clusters to be created with this environment. - - If no home volume at all is required because the home directories are provided - by a parallel filesystem (e.g. manila) set + + If no home volume at all is required because the home directories are + provided by a parallel filesystem (e.g. manila) set home_volume_provisioning = "none" @@ -302,21 +295,23 @@ Once it completes you can log in to the cluster using: - Consider whether Prometheus storage configuration is required. By default: - A 200GB state volume is provisioned (but see above) - - The common environment [sets](../environments/common/inventory/group_vars/all/prometheus.yml) - a maximum retention of 100 GB and 31 days + - The common environment + [sets](../environments/common/inventory/group_vars/all/prometheus.yml) a + maximum retention of 100 GB and 31 days. These may or may not be appropriate depending on the number of nodes, the scrape interval, and other uses of the state volume (primarily the `slurmctld` - state and the `slurmdbd` database). See [docs/monitoring-and-logging](./monitoring-and-logging.md) - for more options. + state and the `slurmdbd` database). See + [docs/monitoring-and-logging](./monitoring-and-logging.md) for more options. - Configure Open OnDemand - see [specific documentation](openondemand.md) which notes specific variables required. - Configure Open OnDemand - see [specific documentation](openondemand.md). -- Remove the `demo_user` user from `environments/$ENV/inventory/group_vars/all/basic_users.yml`. - Replace the `hpctests_user` in `environments/$ENV/inventory/group_vars/all/hpctests.yml` with - an appropriately configured user. +- Remove the `demo_user` user from + `environments/$ENV/inventory/group_vars/all/basic_users.yml`. Replace the + `hpctests_user` in `environments/$ENV/inventory/group_vars/all/hpctests.yml` + with an appropriately configured user. - Consider whether having (read-only) access to Grafana without login is OK. If not, remove `grafana_auth_anonymous` in @@ -325,15 +320,21 @@ Once it completes you can log in to the cluster using: - A production deployment may have a more complex networking requirements than just a simple network. See the [networks docs](networks.md) for details. -- If floating IPs are required for login nodes, create these in OpenStack and add the IPs into - the OpenTofu `login` definition. - -- Consider enabling topology aware scheduling. This is currently only supported if your cluster does not include any baremetal nodes. This can be enabled by: - 1. Creating Availability Zones in your OpenStack project for each physical rack - 2. Setting the `availability_zone` fields of compute groups in your OpenTofu configuration - 3. Adding the `compute` group as a child of `topology` in `environments/$ENV/inventory/groups` - 4. (Optional) If you are aware of the physical topology of switches above the rack-level, override `topology_above_rack_topology` in your groups vars - (see [topology docs](../ansible/roles/topology/README.md) for more detail) +- If floating IPs are required for login nodes, create these in OpenStack and + add the IPs into the OpenTofu `login` definition. + +- Consider enabling topology aware scheduling. This is currently only supported + if your cluster does not include any baremetal nodes. This can be enabled by: + 1. Creating Availability Zones in your OpenStack project for each physical + rack + 2. Setting the `availability_zone` fields of compute groups in your + OpenTofu configuration + 3. Adding the `compute` group as a child of `topology` in + `environments/$ENV/inventory/groups` + 4. (Optional) If you are aware of the physical topology of switches above + the rack-level, override `topology_above_rack_topology` in your groups + vars (see [topology docs](../ansible/roles/topology/README.md) for more + detail) - Consider whether mapping of baremetal nodes to ironic nodes is required. See [PR 485](https://github.com/stackhpc/ansible-slurm-appliance/pull/485).