Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
322 changes: 283 additions & 39 deletions docs/inference/parallel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,52 +2,142 @@
Parallel Inference
####################

If the memory requirements of your model are too large to fit within a
single GPU, you can run Anemoi-Inference in parallel across multiple
GPUs.
.. contents:: Table of Contents
:local:
:depth: 2

You have two options to launch parallel inference:
- Launch without Slurm. This allows you to run inference across
multiple GPUs **on a single node**.
- Launch via Slurm. Slurm is needed to run inference **across
multiple nodes**.
If the memory requirements of your model are too large to fit within a
single device, you can run Anemoi-Inference in parallel across multiple
devices. The parallel runner distributes the model across devices and
coordinates inference execution.

***************
Prerequisites
***************

Parallel inference requires a certain minimum version of Anemoi-models
>= v0.4.2. If this breaks your checkpoints, you could cherry-pick `the
relevant PR <https://github.com/ecmwf/anemoi-core/pull/77>`_ into your
old version of Anemoi-Models.
Parallel inference requires:

- Anemoi-Models >= v0.4.2 (for model parallelism support)
- Multiple devices available on your system or cluster

.. note::

If updating to Anemoi-Models v0.4.2 breaks your existing checkpoints,
you can cherry-pick `the relevant PR
<https://github.com/ecmwf/anemoi-core/pull/77>`_ into your old
version of Anemoi-Models.

***************
Configuration
***************

To run in parallel, you must add '``runner:parallel``' to your inference
config file. If you are running in parallel without Slurm, you must also
add a '``world_size: num_gpus``' field. This informs Anemoi-Inference
how many GPUs you want to run across. It cannot be greater then the
number of GPUs on a single node.
To run in parallel, add ``runner: parallel`` to your inference config
file. The parallel runner will automatically detect your cluster
environment (Slurm, MPI, torchrun, etc.) and configure itself
accordingly.

.. note::
Basic Configuration
===================

If you are launching parallel inference via Slurm, '``world_size``'
will be ignored in favour of the '``SLURM_NTASKS``' environment
variable.
For environments with automatic cluster detection (Slurm, MPI,
torchrun), a minimal configuration is sufficient:

.. code:: yaml

checkpoint: /path/to/inference-last.ckpt
lead_time: 60
runner: parallel
world_size: 4 #Only required if running parallel inference without Slurm

input:
grib: /path/to/input.grib
output:
grib: /path/to/output.grib

Supported Cluster Types
=======================

The following cluster types are automatically detected:

.. list-table::
:header-rows: 1
:widths: 20 30 50

- - Cluster Type
- Detection Method
- Environment Variables Used

- - **Slurm**
- Presence of ``SLURM_NTASKS`` and ``SLURM_JOB_NAME``
- ``SLURM_PROCID``, ``SLURM_LOCALID``, ``SLURM_NTASKS``,
``SLURM_NODELIST``

- - **MPI**
- Presence of ``OMPI_COMM_WORLD_SIZE`` or ``PMI_SIZE``
- ``OMPI_COMM_WORLD_RANK``, ``OMPI_COMM_WORLD_LOCAL_RANK``,
``OMPI_COMM_WORLD_SIZE``

- - **Distributed (torchrun)**
- Presence of ``RANK`` and ``LOCAL_RANK``
- ``RANK``, ``LOCAL_RANK``, ``WORLD_SIZE``, ``MASTER_ADDR``,
``MASTER_PORT``

Manual Cluster Configuration
============================

If you are running in an environment without automatic detection, use
the manual cluster
(:class:`anemoi.inference.clusters.manual.ManualCluster`) by specifying
the cluster as ``manual`` and the ``world_size`` (number of devices):

.. code:: yaml

checkpoint: /path/to/inference-last.ckpt
lead_time: 60
runner:
parallel:
cluster:
manual: 4 # Use 4 devices

input:
grib: /path/to/input.grib
output:
grib: /path/to/output.grib

.. warning::

The ``world_size`` cannot exceed the number of available devices on
your system.

Custom Cluster Mapping
======================

Additionally, if you have a custom cluster environment, you can specify
your own environment variable mapping:

.. code:: yaml

checkpoint: /path/to/inference-last.ckpt
lead_time: 60
runner:
parallel:
cluster:
custom:
mapping:
local_rank: LOCAL_RANK_ENV_VAR
global_rank: GLOBAL_RANK_ENV_VAR
world_size: WORLD_SIZE_ENV_VAR
master_addr: MASTER_ADDR_ENV_VAR
master_port: MASTER_PORT_ENV_VAR
init_method: env://

input:
grib: /path/to/input.grib
output:
grib: /path/to/output.grib

Base Runner
-----------

By default, the `parallel` runner inherits from the `default` runner
(:class:`anemoi.inference.runners.default.DefaultRunner`). If you want
to run a different runner in parallel, you can pass the ``base_runner``
Expand All @@ -62,20 +152,23 @@ option:
Any additional options passed to the `parallel` runner will be forwarded
to the ``base_runner``.

*********************************************
Running inference in parallel without Slurm
*********************************************
*******************************
Running Inference in Parallel
*******************************

Once you have configured ``runner: parallel`` in your config file, you
can launch parallel inference by calling ``anemoi-inference run
config.yaml`` as normal.

Once you have added '``runner:parallel``' and '``world_size: num_gpus``'
to your config file, you can launch parallel inference by calling
'``anemoi-inferece run config.yaml``' as normal.
If you are using a cluster manager like Slurm or MPI, you must launch
your job using the appropriate launcher (``srun``, ``mpirun``, etc). See
the examples below.

******************************************
Running inference in parallel with Slurm
******************************************
Parallel with Slurm
===================

Below is an example SLURM batch script to launch a parallel inference
job across 4 GPUs with SLURM.
job across 4 GPUs.

.. code:: bash

Expand All @@ -92,16 +185,16 @@ job across 4 GPUs with SLURM.

.. warning::

If you specify '``runner:parallel``' but you don't launch with
'``srun``', your anemoi-inference job may hang as only 1 process will
be launched.
If you specify ``runner: parallel`` but don't launch with ``srun``,
your anemoi-inference job may hang as only 1 process will be
launched.

.. note::

By default, anemoi-inference will determine your systems master
address and port itself. If this fails (i.e. when running
Anemoi-Inference inside a container), you can instead set these
values yourself via environment variables in your SLURM batch script:
By default, anemoi-inference will determine your system's master
address and port automatically. If this fails (e.g., when running
inside a container), you can set these values manually via
environment variables in your SLURM batch script:

.. code:: bash

Expand All @@ -110,3 +203,154 @@ job across 4 GPUs with SLURM.
export MASTER_PORT=$((10000 + RANDOM % 10000))

srun anemoi-inference run parallel.yaml

Parallel with MPI
=================

To run parallel inference with MPI, use ``mpirun`` or ``mpiexec`` to
launch your job:

.. code:: bash

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=0:05:00
#SBATCH --output=outputs/parallel_inf_mpi.%j.out

source /path/to/venv/bin/activate

# Set master address and port for communication
MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_ADDR=$(nslookup $MASTER_ADDR | grep -oP '(?<=Address: ).*')
export MASTER_PORT=29500

mpirun -np 4 anemoi-inference run parallel.yaml

.. note::

If your torch supports it (PyTorch must be compiled from source with
MPI support to use the MPI backend to torch.distributed), you can use
the ``mpi`` torch backend by configuring:

.. code:: yaml

runner:
parallel:
cluster:
mpi:
use_mpi_backend: true

Parallel with torchrun
======================

For environments without a cluster manager, you can use PyTorch's
``torchrun`` utility:

.. code:: bash

#!/bin/bash

source /path/to/venv/bin/activate

torchrun --nproc_per_node=4 \
--nnodes=1 \
--node_rank=0 \
--master_addr=localhost \
--master_port=29500 \
$(which anemoi-inference) run parallel.yaml

.. note::

When using ``torchrun``, the distributed environment variables
(``RANK``, ``LOCAL_RANK``, ``WORLD_SIZE``, etc.) are automatically
set by torchrun.

***********************
Environment Variables
***********************

The following environment variables can be used to customise parallel
inference:

.. list-table::
:header-rows: 1
:widths: 30 70

- - Environment Variable
- Description

- - ``ANEMOI_BASE_SEED``

- Base seed for reproducible inference. Will be broadcast from
rank 0 to all ranks. Values < 1000 are automatically multiplied
by 1000.

*****************
Troubleshooting
*****************

Common Issues
=============

.. list-table::
:header-rows: 1
:widths: 30 70

- - Issue
- Solution

- - **Job hangs indefinitely**

- Ensure you're launching with the appropriate launcher
(``srun``, ``mpirun``, ``torchrun``). Check that the number of
processes matches your configuration.

- - **"No suitable cluster found" error**
- Add explicit cluster configuration using ``cluster: manual`` or
verify your environment variables are set correctly.

- - **Version compatibility error**
- Upgrade to Anemoi-Models >= v0.4.2 or cherry-pick the `parallel
inference PR <https://github.com/ecmwf/anemoi-core/pull/77>`_.

- - **CUDA out of memory**

- Increase the number of devices (``world_size``) to distribute
the model across more devices. Or, increase the chunking with
``ANEMOI_INFERENCE_NUM_CHUNKS``.

- - **Port already in use**
- Set ``MASTER_PORT`` to a different port number, or let Slurm
auto-generate one.

- - **Communication timeout**
- Check firewall settings and ensure all nodes can communicate.
Verify ``MASTER_ADDR`` is accessible from all ranks.

Verification Checklist
======================

Before running parallel inference, verify:

#. ✓ Anemoi-Models version >= v0.4.2
#. ✓ Multiple devices available (``nvidia-smi`` or equivalent)
#. ✓ Configuration includes ``runner: parallel``
#. ✓ Using appropriate launcher (``srun``, ``mpirun``, or ``torchrun``)
#. ✓ Number of processes matches available devices
#. ✓ Network connectivity between nodes (multi-node only)

Expected Output
===============

When parallel inference runs successfully, you should see log messages
indicating:

- Cluster type detected (e.g., "Using compute client: SlurmCluster")
- Rank information (e.g., "rank00", "rank01", etc.)
- Model loading on each rank
- Inference progress from rank 0 (master)

Only rank 0 produces output files; other ranks assist with computation.
Loading
Loading