Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions software/software-news.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,16 @@ most recent changes are listed first.

----

Frontier: hardware counter daemon (July 29, 2025)
------------------------------------------------

On July 29, 2025, a ROCm Profiler (``rocprofiler``) based daemon that automatically samples GPU hardware counters from a subset of compute nodes will be enabled by default for all jobs > 1800 nodes.
This can be explicitly enabled/disabled via the ``--gpu-counters`` flag to ``sbatch``.
To explicitly enable the rocprofiler-based counter collection daemon, set ``--gpu-counters=1``.
To explicitly disable to daemon, set ``--gpu-counters=0``.
See :ref:`frontier-slurm-plugins` for more information.


Frontier: Core Module (March 18, 2025)
------------------------------------------------

Expand Down
26 changes: 26 additions & 0 deletions systems/frontier_user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1239,6 +1239,32 @@ The table below summarizes options for submitted jobs. Unless otherwise noted, t
| ``-q`` | ``#SBATCH -q debug`` | Request a "Quality of Service" (QOS) for the job. (default is ``normal``) |
+------------------------+--------------------------------------------+--------------------------------------------------------------------------------------+

.. _frontier-slurm-plugins:

OLCF Custom Slurm Plugins
-------------------------

In addition to the common Slurm flags above, OLCF maintains several plugins to Slurm that provide additional options to the user.
The following options are available as command-line parameters to ``sbatch`` or as ``#SBATCH`` pragmas in a job script:

.. table::
:widths: 15 28 57

+------------------------+--------------------------------------------+---------------------------------------------------------------------------------------+
| Option | Example Usage | Description |
+========================+============================================+=======================================================================================+
| ``--gpu-srange`` | ``#SBATCH --gpu-srange=800-1700`` | Sets the GPU sclk range in MHz (default: 500-1700) |
+------------------------+--------------------------------------------+---------------------------------------------------------------------------------------+
| ``--gpu-power-cap`` | ``#SBATCH --gpu-power-cap=500`` | Sets the GPU power cap in Watts (default: 560) |
+------------------------+--------------------------------------------+---------------------------------------------------------------------------------------+
| ``--gpu-counters`` | ``#SBATCH --gpu-counters=0`` | When set to 1 (default), enables a rocprofiler-based daemon that automatically samples|
| | | GPU hardware counters from a subset of nodes in a compute job, when the job size is |
| | | greater than 1882 nodes. The resulting profiling data may be made available upon |
| | | request. Please provide the requested job ID to [email protected]. |
| | | Setting ``gpu-counters=0`` disables this feature. |
| | | This feature is not available for jobs <= 1882 nodes at this time. |
+------------------------+--------------------------------------------+---------------------------------------------------------------------------------------+


Slurm Environment Variables
---------------------------
Expand Down