Skip to content
Draft
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 231 additions & 0 deletions docs/blog/posts/2025/09/eessi-cray-slingshot11.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
---
author: [Richard]
date: 2025-09-10
slug: EESSI-on-Cray-Slingshot
---

# EESSI on Cray system with Slingshot 11

High-performance computing environments are constantly evolving, and keeping pace with the latest interconnect technologies is crucial for maximizing application performance. HPE Cray Slingshot 11 with CXI (Cassini eXascale Interconnect) represents a significant advancement in HPC networking, offering improved bandwidth, lower latency, and better scalability for exascale computing workloads.

In this blog post, we present the requirements for building OpenMPI 5.x with Slingshot 11 CXI support on Cray systems and its integration with EESSI using `host_injections`. This approach enables overriding EESSI’s default MPI library with an ABI-compatible, Slingshot-optimized version. The post concludes with test results validating this setup.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"present requirements" sounds like there will be a checklist, but not if there is a step-by-step guide on how to use EESSI on Slingshot-11 systems. For a reader it may be necessary to explain a little how OpenMPI is built in EESSI and then how the software/driver ecosystem for Slingshot-11 looks like; then how one could bring these together (EESSI is already built and read-only, so how can optimized software support for Slingshot-11 work?; then one could introduce host_injections as a means to customize EESSI to specifics of a system).


<!-- more -->

## The Challenge

EESSI provides a comprehensive software stack, but specialized interconnect support like Slingshot 11 CXI requires custom-built libraries that aren't yet available in the standard EESSI distribution. Our goal is to:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these "custom-built libraries" ever be included in EESSI?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt, those are custom made to fit local needs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't thought about this but indeed we could do this via a dev.eessi.io repo dedicated to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a nice way of sharing experience on this topic


1. Build OpenMPI 5.x with native Slingshot 11 CXI support
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the CXI important? Sounds very nerdy to me. If it's important, maybe a sentence on what it stands for might help the uninformed reader (like me).

2. Create ABI-compatible replacements for EESSI's MPI libraries
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All MPI libraries? Specific ones? If yes to the latter, which ones?

3. Support both x86_64 AMD CPU partitions and NVIDIA Grace CPU partitions with Hopper accelerators
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it add to the challenge? Why these specific architectures? What if my architectures are different?

4. Avoid dependency on system packages where possible
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear why this is a challenge.


The main technical challenge is building the complete dependency chain on top of EESSI, as many of the required libraries for CXI support don't exist in the current EESSI stack.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my naive view, I can't see a dependency chain "on top" of EESSI then being used for very low-level libraries.


## System Architecture

Our target system consists of two distinct partitions:

- **Partition 1**: x86_64 AMD CPUs without accelerators
- **Partition 2**: NVIDIA Grace CPUs with Hopper accelerators

For the Grace/Hopper partition we needed to enable CUDA support in libfabric.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite a detail for which is no context provided. Why is libfabric suddenly mentioned?


## Building the Dependency Chain

Copy link
Collaborator

@trz42 trz42 Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a little too early? One could first explain different options, but needs probably a bit background information about the design choices made in EESSI, e.g., it uses OpenMPI built with support for UCX and libfabrics. Because EESSI relies on RPATH linking, we cannot simply replace the whole OpenMP installation, but rather replace specific libraries/parts by "injecting" them. Which libraries could be targeted, libmpi, others, ...? Could we use libraries provided as part of the Cray MPICH installation? Why not? ... possibly point at new MPI ABI compatibility standard which might help making such scenarios easier in the future.

### Building Strategy

Rather than relying on Cray-provided system packages, we opted to build all dependencies from source on top of EESSI. This approach provides several advantages:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to understand (not relying on Cray/HPE-provided system packages) one would need to understand what the dependencies actually are ... and where do they come from.


- **Consistency**: All libraries built with the same compiler toolchain
- **Compatibility**: Ensures ABI compatibility with EESSI libraries
- **Control**: Full control over build configurations and optimizations

### Required Dependencies

To build OpenMPI 5.x with CXI support, we needed the following missing dependencies:

1. **libuv** - Asynchronous I/O library
2. **libnl** - Netlink library for network configuration
3. **libconfig** - Library designed for processing structured configuration files
4. **libfuse** - Filesystem in Userspace library
5. **libpdap** - Performance Data Access Protocol library
6. **shs-libcxi** - Slingshot CXI library
7. **lm-sensors** - Monitoring tools and drivers
8. **libfabric 2.x** - OpenFabrics Interfaces library with CXI provider
9. **OpenMPI 5.x** - The final MPI implementation

## EESSI Integration via `host_injections`

EESSI's [host_injections](../../../../site_specific_config/host_injections.md) mechanism allows us to override EESSI's MPI library with an ABI compatible host MPI while maintaining compatibility with the rest of the software stack.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be misunderstood that host_injections ensure compatibility with the rest of the software stack. Is that the case? What does host_injections actually do? And what do we have to do to ensure compatibility?


*Validating `libmpi.so.40` in `host_injections` on the ARM nodes*:
```
ldd /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system/lib/libmpi.so.40
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TopRichard I made a number of updates directly to the PR, so please take a look.

I didn't touch this, but I think this is not exactly the right thing to show here, at least not on it's own. For one, you are definitely not using the ldd from the EESSI compat layer (which would give slightly different results). I think you want to follow this up with

lddtree $(which osu_latency)

which shows where the OSU binaries are picking up their MPI libraries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lddtree is nice (and available in the compat layer) since it shows you which .so requires what


linux-vdso.so.1 (0x0000fffcfd1d0000)
libucc.so.1 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/UCC/1.2.0-GCCcore-12.3.0/lib64/libucc.so.1 (0x0000fffcfce50000)
libucs.so.0 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libucs.so.0 (0x0000fffcfcde0000)
libnuma.so.1 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/numactl/2.0.16-GCCcore-12.3.0/lib64/libnuma.so.1 (0x0000fffcfcdb0000)
libucm.so.0 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libucm.so.0 (0x0000fffcfcd70000)
libopen-pal.so.80 => /cluster/installations/eessi/default/aarch64/software/OpenMPI/5.0.7-GCC-12.3.0/lib/libopen-pal.so.80 (0x0000fffcfcc40000)
libfabric.so.1 => /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system/lib/libfabric.so.1 (0x0000fffcfca50000)
librdmacm.so.1 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/librdmacm.so.1 (0x0000fffcfca10000)
libefa.so.1 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/libefa.so.1 (0x0000fffcfc9e0000)
libibverbs.so.1 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/libibverbs.so.1 (0x0000fffcfc9a0000)
libcxi.so.1 => /cluster/installations/eessi/default/aarch64/software/shs-libcxi/1.7.0-GCCcore-12.3.0/lib64/libcxi.so.1 (0x0000fffcfc960000)
libcurl.so.4 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/libcurl.so.4 (0x0000fffcfc8a0000)
libjson-c.so.5 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/json-c/0.16-GCCcore-12.3.0/lib64/libjson-c.so.5 (0x0000fffcfc870000)
libatomic.so.1 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/GCCcore/12.3.0/lib64/libatomic.so.1 (0x0000fffcfc840000)
libcudart.so.12 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/software/CUDA/12.1.1/lib64/libcudart.so.12 (0x0000fffcfc780000)
libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x0000fffcf97d0000)
libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x0000fffcf8980000)
libnl-route-3.so.200 => /cluster/installations/eessi/default/aarch64/software/libnl/3.11.0-GCCcore-12.3.0/lib64/libnl-route-3.so.200 (0x0000fffcf88d0000)
libnl-3.so.200 => /cluster/installations/eessi/default/aarch64/software/libnl/3.11.0-GCCcore-12.3.0/lib64/libnl-3.so.200 (0x0000fffcf8890000)
libpmix.so.2 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/PMIx/4.2.4-GCCcore-12.3.0/lib64/libpmix.so.2 (0x0000fffcf8690000)
libevent_core-2.1.so.7 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/libevent/2.1.12-GCCcore-12.3.0/lib64/libevent_core-2.1.so.7 (0x0000fffcf8630000)
libevent_pthreads-2.1.so.7 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/libevent/2.1.12-GCCcore-12.3.0/lib64/libevent_pthreads-2.1.so.7 (0x0000fffcf8600000)
libhwloc.so.15 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/hwloc/2.9.1-GCCcore-12.3.0/lib64/libhwloc.so.15 (0x0000fffcf8580000)
libpciaccess.so.0 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/libpciaccess/0.17-GCCcore-12.3.0/lib64/libpciaccess.so.0 (0x0000fffcf8550000)
libxml2.so.2 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/libxml2/2.11.4-GCCcore-12.3.0/lib64/libxml2.so.2 (0x0000fffcf83e0000)
libz.so.1 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/libz.so.1 (0x0000fffcf83a0000)
liblzma.so.5 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/liblzma.so.5 (0x0000fffcf8330000)
libm.so.6 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libm.so.6 (0x0000fffcf8280000)
libc.so.6 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6 (0x0000fffcf80e0000)
/lib/ld-linux-aarch64.so.1 (0x0000fffcfd1e0000)
libcares.so.2 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/libcares.so.2 (0x0000fffcf80a0000)
libnghttp2.so.14 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/libnghttp2.so.14 (0x0000fffcf8050000)
libssl.so.1.1 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/OpenSSL/1.1/lib64/libssl.so.1.1 (0x0000fffcf7fb0000)
libcrypto.so.1.1 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/OpenSSL/1.1/lib64/libcrypto.so.1.1 (0x0000fffcf7d10000)
libdl.so.2 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libdl.so.2 (0x0000fffcf7ce0000)
libpthread.so.0 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libpthread.so.0 (0x0000fffcf7cb0000)
librt.so.1 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/librt.so.1 (0x0000fffcf7c80000)
```

### Testing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess with this the idea is to show that it works in principle. You only show bandwidth here, what about latency? Are there numbers to compare with using the Cray toolchains? What about device-to-device?

I don't necessarily want you to write a lot here, but it is nice to show that all the things you might hope for do work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second test is osu_bibw -d cuda D D, i can add the latency test and a comparison to runs using Cray toolchains.


**1- Test using OSU-Micro-Benchmarks on 2-nodes (x86_64 AMD-CPUs)**:
```
Environment set up to use EESSI (2023.06), have fun!
hostname:
x1001c6s2b0n1
x1001c6s3b0n0

CPU info:
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9745 128-Core Processor
Virtualization: AMD-V

Currently Loaded Modules:
1) GCCcore/12.3.0
2) GCC/12.3.0
3) numactl/2.0.16-GCCcore-12.3.0
4) libxml2/2.11.4-GCCcore-12.3.0
5) libpciaccess/0.17-GCCcore-12.3.0
6) hwloc/2.9.1-GCCcore-12.3.0
7) OpenSSL/1.1
8) libevent/2.1.12-GCCcore-12.3.0
9) UCX/1.14.1-GCCcore-12.3.0
10) libfabric/1.18.0-GCCcore-12.3.0
11) PMIx/4.2.4-GCCcore-12.3.0
12) UCC/1.2.0-GCCcore-12.3.0
13) OpenMPI/4.1.5-GCC-12.3.0
14) gompi/2023a
15) OSU-Micro-Benchmarks/7.1-1-gompi-2023a

# OSU MPI Bi-Directional Bandwidth Test v7.1
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 2.87
2 5.77
4 11.55
8 23.18
16 46.27
32 92.64
64 185.21
128 369.03
256 743.08
512 1487.21
1024 2975.75
2048 5928.14
4096 11809.66
8192 23097.44
16384 31009.54
32768 36493.20
65536 40164.63
131072 43150.62
262144 45075.57
524288 45918.07
1048576 46313.37
2097152 46507.25
4194304 46609.10
```

**2- Test using OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0 on 2-nodes (Grace/Hopper GPUs)**:
```
Environment set up to use EESSI (2023.06), have fun!

hostname:
x1000c4s4b1n0
x1000c5s3b0n0

CPU info:
Vendor ID: ARM

Currently Loaded Modules:
1) GCCcore/13.2.0
2) GCC/13.2.0
3) numactl/2.0.16-GCCcore-13.2.0
4) libxml2/2.11.5-GCCcore-13.2.0
5) libpciaccess/0.17-GCCcore-13.2.0
6) hwloc/2.9.2-GCCcore-13.2.0
7) OpenSSL/1.1
8) libevent/2.1.12-GCCcore-13.2.0
9) UCX/1.15.0-GCCcore-13.2.0
10) libfabric/1.19.0-GCCcore-13.2.0
11) PMIx/4.2.6-GCCcore-13.2.0
12) UCC/1.2.0-GCCcore-13.2.0
13) OpenMPI/4.1.6-GCC-13.2.0
14) gompi/2023b
15) GDRCopy/2.4-GCCcore-13.2.0
16) UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0 (g)
17) NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0 (g)
18) UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0 (g)
19) OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0 (g)

Where:
g: built for GPU

# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 0.18
2 0.37
4 0.75
8 1.49
16 2.99
32 5.93
64 11.88
128 23.76
256 72.78
512 145.45
1024 282.03
2048 535.46
4096 1020.24
8192 16477.70
16384 25982.96
32768 30728.30
65536 37637.46
131072 41808.92
262144 44316.19
524288 43693.89
1048576 43759.66
2097152 43593.38
4194304 43436.60
```
## Conclusion

The approach demonstrates EESSI's flexibility in accommodating specialized hardware requirements while preserving the benefits of a standardized software stack!