From 6c6a1903716677ff2753e2f107be8bf7525065f0 Mon Sep 17 00:00:00 2001 From: Nick Hagerty Date: Mon, 28 Jul 2025 11:47:48 -0400 Subject: [PATCH 1/5] Adding documentation of Slurm plugins --- software/software-news.rst | 10 ++++++++++ systems/frontier_user_guide.rst | 25 +++++++++++++++++++++++++ 2 files changed, 35 insertions(+) diff --git a/software/software-news.rst b/software/software-news.rst index 2db3af20..f6437d45 100644 --- a/software/software-news.rst +++ b/software/software-news.rst @@ -8,6 +8,16 @@ most recent changes are listed first. ---- +Frontier: hardware counter daemon (July 29, 2025) +------------------------------------------------ + +On July 29, 2025, a ROCm Profiler (``rocprofiler``) based daemon that automatically samples GPU hardware counters from a subset of compute nodes will be enabled by default for all jobs > 1800 nodes. +This can be explicitly enabled/disabled via the ``--gpu-counters`` flag to ``sbatch``. +To explicitly enable the rocprofiler-based counter collection daemon, set ``--gpu-counters=1``. +To explicitly disable to daemon, set ``--gpu-counters=0``. +See :ref:`frontier-slurm-plugins` for more information. + + Frontier: Core Module (March 18, 2025) ------------------------------------------------ diff --git a/systems/frontier_user_guide.rst b/systems/frontier_user_guide.rst index fb2b6ed3..a496c7b7 100644 --- a/systems/frontier_user_guide.rst +++ b/systems/frontier_user_guide.rst @@ -1239,6 +1239,31 @@ The table below summarizes options for submitted jobs. Unless otherwise noted, t | ``-q`` | ``#SBATCH -q debug`` | Request a "Quality of Service" (QOS) for the job. (default is ``normal``) | +------------------------+--------------------------------------------+--------------------------------------------------------------------------------------+ +.. _frontier_slurm_plugins: + +OLCF Custom Slurm Plugins +------------------------- + +In addition to the common Slurm flags above, OLCF maintains several plugins to Slurm that provide additional options to the user. +The following options are available as command-line parameters to ``sbatch`` or as ``#SBATCH`` pragmas in a job script: + +.. table:: + :widths: 15 28 57 + + +------------------------+--------------------------------------------+---------------------------------------------------------------------------------------+ + | Option | Example Usage | Description | + +========================+============================================+=======================================================================================+ + | ``--gpu-srange`` | ``#SBATCH --gpu-srange=800-1700`` | Sets the GPU sclk range in MHz (default: 500-1700) | + +------------------------+--------------------------------------------+---------------------------------------------------------------------------------------+ + | ``--gpu-power-cap`` | ``#SBATCH --gpu-power-cap=500`` | Sets the GPU power cap in Watts (default: 560) | + +------------------------+--------------------------------------------+---------------------------------------------------------------------------------------+ + | ``--gpu-counters`` | ``#SBATCH --gpu-counters=0`` | When set to 1 (default), enables a rocprofiler-based daemon that automatically samples| + | | | GPU hardware counters from a subset of nodes in a compute job, when the job size is | + | | | greater than 1800 nodes. The resulting profiling data may be made available upon | + | | | request. Please provide the requested job ID to help@olcf.ornl.gov. | + | | | Setting ``gpu-counters=0`` disables this feature. | + +------------------------+--------------------------------------------+---------------------------------------------------------------------------------------+ + Slurm Environment Variables --------------------------- From 5a5b7e0f36232e087ce4c21dda1a8b4c08d50a8f Mon Sep 17 00:00:00 2001 From: Nick Hagerty Date: Mon, 28 Jul 2025 11:56:31 -0400 Subject: [PATCH 2/5] Fix typo in ref name --- systems/frontier_user_guide.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/systems/frontier_user_guide.rst b/systems/frontier_user_guide.rst index a496c7b7..93a13a81 100644 --- a/systems/frontier_user_guide.rst +++ b/systems/frontier_user_guide.rst @@ -1239,7 +1239,7 @@ The table below summarizes options for submitted jobs. Unless otherwise noted, t | ``-q`` | ``#SBATCH -q debug`` | Request a "Quality of Service" (QOS) for the job. (default is ``normal``) | +------------------------+--------------------------------------------+--------------------------------------------------------------------------------------+ -.. _frontier_slurm_plugins: +.. _frontier-slurm-plugins: OLCF Custom Slurm Plugins ------------------------- From 79218e18d54212d2c57fd02b082a7e4877e6bfd1 Mon Sep 17 00:00:00 2001 From: Nick Hagerty Date: Tue, 29 Jul 2025 16:14:37 -0400 Subject: [PATCH 3/5] Added blurb for gpu-counters and SHS parameter change --- systems/frontier_user_guide.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/systems/frontier_user_guide.rst b/systems/frontier_user_guide.rst index 93a13a81..ba2a9b8a 100644 --- a/systems/frontier_user_guide.rst +++ b/systems/frontier_user_guide.rst @@ -3888,6 +3888,12 @@ If it is necessary to have bit-wise reproducible results from these libraries, i System Updates ============== +2025-07-29 +---------- +On Tuesday, July 29, 2025, Frontier's Slingshot Host Software 12.0.1 was patched to adjust a parameter known to contribute to a recent regression in the performance of ``MPI_Alltoall`` at full-system scale (>8K nodes). +Additionally, the ``--gpu-counters`` flag to ``sbatch`` was enabled by default for all jobs >1800 nodes. +Please see `Software News `_ for further information about this feature. + 2025-06-17 ---------- On Tuesday, June 17, 2025, Frontier's system software was upgraded. From 5bed39073c2bc2101c20980982e28fad2e224f0c Mon Sep 17 00:00:00 2001 From: Nick Hagerty Date: Wed, 30 Jul 2025 08:59:39 -0400 Subject: [PATCH 4/5] Revert "Added blurb for gpu-counters and SHS parameter change" This reverts commit 79218e18d54212d2c57fd02b082a7e4877e6bfd1. --- systems/frontier_user_guide.rst | 6 ------ 1 file changed, 6 deletions(-) diff --git a/systems/frontier_user_guide.rst b/systems/frontier_user_guide.rst index ba2a9b8a..93a13a81 100644 --- a/systems/frontier_user_guide.rst +++ b/systems/frontier_user_guide.rst @@ -3888,12 +3888,6 @@ If it is necessary to have bit-wise reproducible results from these libraries, i System Updates ============== -2025-07-29 ----------- -On Tuesday, July 29, 2025, Frontier's Slingshot Host Software 12.0.1 was patched to adjust a parameter known to contribute to a recent regression in the performance of ``MPI_Alltoall`` at full-system scale (>8K nodes). -Additionally, the ``--gpu-counters`` flag to ``sbatch`` was enabled by default for all jobs >1800 nodes. -Please see `Software News `_ for further information about this feature. - 2025-06-17 ---------- On Tuesday, June 17, 2025, Frontier's system software was upgraded. From 67c406c2608e941d44ddb54cab9758184a2d64f1 Mon Sep 17 00:00:00 2001 From: Nick Hagerty Date: Wed, 30 Jul 2025 09:31:06 -0400 Subject: [PATCH 5/5] Updating language to reflect current implementation focusing on leadership-class jobs --- systems/frontier_user_guide.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/systems/frontier_user_guide.rst b/systems/frontier_user_guide.rst index 93a13a81..7f3a1ba8 100644 --- a/systems/frontier_user_guide.rst +++ b/systems/frontier_user_guide.rst @@ -1259,9 +1259,10 @@ The following options are available as command-line parameters to ``sbatch`` or +------------------------+--------------------------------------------+---------------------------------------------------------------------------------------+ | ``--gpu-counters`` | ``#SBATCH --gpu-counters=0`` | When set to 1 (default), enables a rocprofiler-based daemon that automatically samples| | | | GPU hardware counters from a subset of nodes in a compute job, when the job size is | - | | | greater than 1800 nodes. The resulting profiling data may be made available upon | + | | | greater than 1882 nodes. The resulting profiling data may be made available upon | | | | request. Please provide the requested job ID to help@olcf.ornl.gov. | | | | Setting ``gpu-counters=0`` disables this feature. | + | | | This feature is not available for jobs <= 1882 nodes at this time. | +------------------------+--------------------------------------------+---------------------------------------------------------------------------------------+