Skip to content

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Feb 24, 2021

Currently the node definitions are constructed using ansible facts. At least in some situations this doesn't appear entirely satisfactory to slurm, e.g. slurmd -C shows ... Boards=1 ... and nodes are getting set DOWN.

This PR runs slurmd -C on all compute nodes, then uses values from the first-in-play in each partition (iaw existing logic) to provide node definitions.

This is sort of Trust On First Use that the node configuration is in fact correct.

An alternative is only to specify NodeName and not the expected CPU parameters at all:

Only the NodeName must be supplied in the configuration file. All other node configuration information is optional.

This would have 2x disadvantages:

  • Slurm cannot detect node misconfiguration
  • Scheduling is slower:

    Establishing baseline configurations will also speed Slurm's scheduling process by permitting it to compare job requirements against these (relatively few) configuration parameters and possibly avoid having to check job requirements against every individual node's configuration. The resources checked at node registration time are: CPUs, RealMemory and TmpDisk.

Quotes from https://slurm.schedmd.com/slurm.conf.html.

@sjpb sjpb marked this pull request as draft August 25, 2021 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant