Skip to content
Steve Brasier edited this page Mar 15, 2022 · 29 revisions

Use a hook from a parent environment

Hooks from parent environments don't get run by default as site.yml relies on APPLIANCES_ENVIRONMENT_ROOT to find them. But you can explicitly run them using :

# environments/child/hooks/pre/yml
- name: Import parent hook
  vars:
    appliances_environment_root: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}"
  import_playbook: "{{ appliances_environment_root }}/../parent/hooks/pre.yml"

where child and parent are the environment names.

Create a cluster with a combined compute/control/login node

  • Create a group control containing just the first compute node, and add that group into hpctests, e.g. in environments/<myenv>/inventory/groups:

    [control]
    combined-compute-0
    
    [hpctests:children]
    control
  • Do NOT create a group login (that is for login-only nodes)

  • Create a post-hook like this:

    - hosts: control
      become: true
      tasks:
        - name: Prevent ansible_user's processes being killed on compute nodes at job completion
          replace:
            path: /etc/slurm/slurm.epilog.clean
            regexp: 'if \[ \$SLURM_UID -lt 100 \] ; then'
            replace: "if [[ $SLURM_UID -lt 100 || $SLURM_JOB_USER -eq {{ ansible_user }} ]] ; then"

Rerun slurm configuration fast(ish)

You can rerun the slurm setup (e.g. for partition/node changes, slurm template debugging) faster using:

$ ansible-playbook ansible/slurm.yml --tags openhpc --skip-tags install

Run CI and merge a PR from a fork

Github won't (by default) inject secrets into forked repos, so the OpenStack-based CI won't run as the runner won't have access to the credentials needed to access our OpenStack. In addition, the repo is configured to require approval on workflows from forked repos (which should therefore be denied, because they can't do anything useful).

The proposed approach is therefore as follows:

  • Review the PR for correctness.
  • Review the PR for safety, i.e. no changes which could leak the repository secrets, provide access to or leak information about our infrastructure
  • Get changes made until happy.
  • Create a new branch and change the merge target for that PR to that new branch.
  • Merge the PR into the new branch - this will run CI.
  • Make tweaks as necessary.
  • Go through normal (internal) review to merge new branch into main.
  • Merge new branch into main.

Setup an autoscaling cluster (DRAFT)

TODO: add some notes about the ordering.

  1. Create an app cred (by default in ~/.config/openstack/clouds.yaml or set variables autoscale_clouds and openhpc_rebuild_clouds pointing to the correct file, for roles stackhpc.slurm_openstack_tools.{autoscale,rebuild} respectively).
  2. Create initial cluster with e.g. 2x nodes and check functionality (e.g. run hpctests, check OOD, monitoring etc) to check config is OK.
  3. Run packer build (for compute and login) - see packer/packer-manifest.json for image IDs (can do in parallel with 1.).
  4. Optionally, reimage login node.
  5. Optionally, reimage compute nodes and check cluster is still up to test image is OK.
  6. Leave partition(s) defined, but empty them by deleting compute nodes and removing nodes from group definition(s) <openhpc_cluster_name>_<partition_name>.
  7. Add cloud_ definitions to partition definitions, e.g.:
    openhpc_slurm_partitions:
    - name: small
      cloud_nodes: autoscale-small-[0-1]
      cloud_instances:
        flavor: general.v1.tiny
        image: 34e88d94-9b36-4d73-abfb-df98acea5513
        keypair: slurm-app-ci
        network: stackhpc-ci-geneve
    
  8. Set any autoscaling parameters: See stackhpc.slurm_openstack_tools.autoscale/README.md#role-variables.NB: Ram and CPU information will need setting.
  9. Rerun ansible/slurm.yml to push this info into the Slurm configuration.
  10. Optionally, login to cluster and check sinfo shows above cloud node names in powered-down (~) state.
  11. Optionally, run hpctests again to check nodes respond.

Note: to recover failed autoscaling nodes, check if state is shown as power-ing up (#) in sinfo. If it is, wait for it to change to DOWN (i.e. node did not "resume" within ResumeTimeout) then run exceeded run scontrol update state=resume nodename=.... It should change back to idle~ state.

Configure for Open Ondemand

The CI uses basic auth with a predefined user (which is not rocky, as $HOME for rocky is not on the shared NFS):

Clone this wiki locally