- 
                Notifications
    You must be signed in to change notification settings 
- Fork 44
blog-EESSI-Cray-Slingshot11 #551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
0fa9422
              d619629
              d985401
              1c66375
              895117f
              160c7bb
              3c176e9
              c34cc2f
              886a8a0
              adaf415
              389c10d
              511dd4c
              93e2771
              001fb1c
              bad82cf
              ae60284
              c500c86
              c4bef6a
              cff5e97
              a29cd6f
              3a26e29
              285375a
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,231 @@ | ||
| --- | ||
| author: [Richard] | ||
| date: 2025-09-10 | ||
| slug: EESSI-on-Cray-Slingshot | ||
| --- | ||
|  | ||
| # EESSI on Cray system with Slingshot 11 | ||
|  | ||
| High-performance computing environments are constantly evolving, and keeping pace with the latest interconnect technologies is crucial for maximizing application performance. HPE Cray Slingshot 11 with CXI (Cassini eXascale Interconnect) represents a significant advancement in HPC networking, offering improved bandwidth, lower latency, and better scalability for exascale computing workloads. | ||
|         
                  TopRichard marked this conversation as resolved.
              Outdated
          
            Show resolved
            Hide resolved | ||
|  | ||
| In this blog post, we present the requirements for building OpenMPI 5.x with Slingshot 11 CXI support on Cray systems and its integration with EESSI using `host_injections`. This approach enables overriding EESSI’s default MPI library with an ABI-compatible, Slingshot-optimized version. The post concludes with test results validating this setup. | ||
|          | ||
|  | ||
| <!-- more --> | ||
|  | ||
| ## The Challenge | ||
|  | ||
| EESSI provides a comprehensive software stack, but specialized interconnect support like Slingshot 11 CXI requires custom-built libraries that aren't yet available in the standard EESSI distribution. Our goal is to: | ||
|          | ||
|  | ||
| 1. Build OpenMPI 5.x with native Slingshot 11 CXI support | ||
|          | ||
| 2. Create ABI-compatible replacements for EESSI's MPI libraries | ||
|          | ||
| 3. Support both x86_64 AMD CPU partitions and NVIDIA Grace CPU partitions with Hopper accelerators | ||
|          | ||
| 4. Avoid dependency on system packages where possible | ||
|          | ||
|  | ||
| The main technical challenge is building the complete dependency chain on top of EESSI, as many of the required libraries for CXI support don't exist in the current EESSI stack. | ||
|          | ||
|  | ||
| ## System Architecture | ||
|  | ||
| Our target system consists of two distinct partitions: | ||
|         
                  TopRichard marked this conversation as resolved.
              Outdated
          
            Show resolved
            Hide resolved | ||
|  | ||
| - **Partition 1**: x86_64 AMD CPUs without accelerators | ||
| - **Partition 2**: NVIDIA Grace CPUs with Hopper accelerators | ||
|  | ||
| For the Grace/Hopper partition we needed to enable CUDA support in libfabric. | ||
|          | ||
|  | ||
| ## Building the Dependency Chain | ||
|  | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a little too early? One could first explain different options, but needs probably a bit background information about the design choices made in EESSI, e.g., it uses OpenMPI built with support for UCX and libfabrics. Because EESSI relies on RPATH linking, we cannot simply replace the whole OpenMP installation, but rather replace specific libraries/parts by "injecting" them. Which libraries could be targeted, libmpi, others, ...? Could we use libraries provided as part of the Cray MPICH installation? Why not? ... possibly point at new MPI ABI compatibility standard which might help making such scenarios easier in the future. | ||
| ### Building Strategy | ||
|  | ||
| Rather than relying on Cray-provided system packages, we opted to build all dependencies from source on top of EESSI. This approach provides several advantages: | ||
|          | ||
|  | ||
| - **Consistency**: All libraries built with the same compiler toolchain | ||
| - **Compatibility**: Ensures ABI compatibility with EESSI libraries | ||
| - **Control**: Full control over build configurations and optimizations | ||
|  | ||
| ### Required Dependencies | ||
|  | ||
| To build OpenMPI 5.x with CXI support, we needed the following missing dependencies: | ||
|  | ||
| 1. **libuv** - Asynchronous I/O library | ||
| 2. **libnl** - Netlink library for network configuration | ||
| 3. **libconfig** - Library designed for processing structured configuration files | ||
| 4. **libfuse** - Filesystem in Userspace library | ||
| 5. **libpdap** - Performance Data Access Protocol library | ||
| 6. **shs-libcxi** - Slingshot CXI library | ||
| 7. **lm-sensors** - Monitoring tools and drivers | ||
| 8. **libfabric 2.x** - OpenFabrics Interfaces library with CXI provider | ||
| 9. **OpenMPI 5.x** - The final MPI implementation | ||
|  | ||
| ## EESSI Integration via `host_injections` | ||
|  | ||
| EESSI's [host_injections](../../../../site_specific_config/host_injections.md) mechanism allows us to override EESSI's MPI library with an ABI compatible host MPI while maintaining compatibility with the rest of the software stack. | ||
|          | ||
|  | ||
| *Validating `libmpi.so.40` in `host_injections` on the ARM nodes*: | ||
| ``` | ||
| ldd /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system/lib/libmpi.so.40 | ||
|         
                  TopRichard marked this conversation as resolved.
              Outdated
          
            Show resolved
            Hide resolved          | ||
|  | ||
| linux-vdso.so.1 (0x0000fffcfd1d0000) | ||
| libucc.so.1 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/UCC/1.2.0-GCCcore-12.3.0/lib64/libucc.so.1 (0x0000fffcfce50000) | ||
| libucs.so.0 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libucs.so.0 (0x0000fffcfcde0000) | ||
| libnuma.so.1 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/numactl/2.0.16-GCCcore-12.3.0/lib64/libnuma.so.1 (0x0000fffcfcdb0000) | ||
| libucm.so.0 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libucm.so.0 (0x0000fffcfcd70000) | ||
| libopen-pal.so.80 => /cluster/installations/eessi/default/aarch64/software/OpenMPI/5.0.7-GCC-12.3.0/lib/libopen-pal.so.80 (0x0000fffcfcc40000) | ||
| libfabric.so.1 => /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system/lib/libfabric.so.1 (0x0000fffcfca50000) | ||
| librdmacm.so.1 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/librdmacm.so.1 (0x0000fffcfca10000) | ||
| libefa.so.1 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/libefa.so.1 (0x0000fffcfc9e0000) | ||
| libibverbs.so.1 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/libibverbs.so.1 (0x0000fffcfc9a0000) | ||
| libcxi.so.1 => /cluster/installations/eessi/default/aarch64/software/shs-libcxi/1.7.0-GCCcore-12.3.0/lib64/libcxi.so.1 (0x0000fffcfc960000) | ||
| libcurl.so.4 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/libcurl.so.4 (0x0000fffcfc8a0000) | ||
| libjson-c.so.5 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/json-c/0.16-GCCcore-12.3.0/lib64/libjson-c.so.5 (0x0000fffcfc870000) | ||
| libatomic.so.1 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/GCCcore/12.3.0/lib64/libatomic.so.1 (0x0000fffcfc840000) | ||
| libcudart.so.12 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/software/CUDA/12.1.1/lib64/libcudart.so.12 (0x0000fffcfc780000) | ||
| libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x0000fffcf97d0000) | ||
| libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x0000fffcf8980000) | ||
| libnl-route-3.so.200 => /cluster/installations/eessi/default/aarch64/software/libnl/3.11.0-GCCcore-12.3.0/lib64/libnl-route-3.so.200 (0x0000fffcf88d0000) | ||
| libnl-3.so.200 => /cluster/installations/eessi/default/aarch64/software/libnl/3.11.0-GCCcore-12.3.0/lib64/libnl-3.so.200 (0x0000fffcf8890000) | ||
| libpmix.so.2 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/PMIx/4.2.4-GCCcore-12.3.0/lib64/libpmix.so.2 (0x0000fffcf8690000) | ||
| libevent_core-2.1.so.7 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/libevent/2.1.12-GCCcore-12.3.0/lib64/libevent_core-2.1.so.7 (0x0000fffcf8630000) | ||
| libevent_pthreads-2.1.so.7 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/libevent/2.1.12-GCCcore-12.3.0/lib64/libevent_pthreads-2.1.so.7 (0x0000fffcf8600000) | ||
| libhwloc.so.15 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/hwloc/2.9.1-GCCcore-12.3.0/lib64/libhwloc.so.15 (0x0000fffcf8580000) | ||
| libpciaccess.so.0 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/libpciaccess/0.17-GCCcore-12.3.0/lib64/libpciaccess.so.0 (0x0000fffcf8550000) | ||
| libxml2.so.2 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/libxml2/2.11.4-GCCcore-12.3.0/lib64/libxml2.so.2 (0x0000fffcf83e0000) | ||
| libz.so.1 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/libz.so.1 (0x0000fffcf83a0000) | ||
| liblzma.so.5 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/liblzma.so.5 (0x0000fffcf8330000) | ||
| libm.so.6 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libm.so.6 (0x0000fffcf8280000) | ||
| libc.so.6 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6 (0x0000fffcf80e0000) | ||
| /lib/ld-linux-aarch64.so.1 (0x0000fffcfd1e0000) | ||
| libcares.so.2 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/libcares.so.2 (0x0000fffcf80a0000) | ||
| libnghttp2.so.14 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/usr/lib/../lib64/libnghttp2.so.14 (0x0000fffcf8050000) | ||
| libssl.so.1.1 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/OpenSSL/1.1/lib64/libssl.so.1.1 (0x0000fffcf7fb0000) | ||
| libcrypto.so.1.1 => /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/OpenSSL/1.1/lib64/libcrypto.so.1.1 (0x0000fffcf7d10000) | ||
| libdl.so.2 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libdl.so.2 (0x0000fffcf7ce0000) | ||
| libpthread.so.0 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libpthread.so.0 (0x0000fffcf7cb0000) | ||
| librt.so.1 => /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/librt.so.1 (0x0000fffcf7c80000) | ||
| ``` | ||
|  | ||
| ### Testing | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess with this the idea is to show that it works in principle. You only show bandwidth here, what about latency? Are there numbers to compare with using the Cray toolchains? What about device-to-device? I don't necessarily want you to write a lot here, but it is nice to show that all the things you might hope for do work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The second test is  | ||
|  | ||
| **1- Test using OSU-Micro-Benchmarks on 2-nodes (x86_64 AMD-CPUs)**: | ||
| ``` | ||
| Environment set up to use EESSI (2023.06), have fun! | ||
| hostname: | ||
| x1001c6s2b0n1 | ||
| x1001c6s3b0n0 | ||
|  | ||
| CPU info: | ||
| Vendor ID: AuthenticAMD | ||
| Model name: AMD EPYC 9745 128-Core Processor | ||
| Virtualization: AMD-V | ||
|  | ||
| Currently Loaded Modules: | ||
| 1) GCCcore/12.3.0 | ||
| 2) GCC/12.3.0 | ||
| 3) numactl/2.0.16-GCCcore-12.3.0 | ||
| 4) libxml2/2.11.4-GCCcore-12.3.0 | ||
| 5) libpciaccess/0.17-GCCcore-12.3.0 | ||
| 6) hwloc/2.9.1-GCCcore-12.3.0 | ||
| 7) OpenSSL/1.1 | ||
| 8) libevent/2.1.12-GCCcore-12.3.0 | ||
| 9) UCX/1.14.1-GCCcore-12.3.0 | ||
| 10) libfabric/1.18.0-GCCcore-12.3.0 | ||
| 11) PMIx/4.2.4-GCCcore-12.3.0 | ||
| 12) UCC/1.2.0-GCCcore-12.3.0 | ||
| 13) OpenMPI/4.1.5-GCC-12.3.0 | ||
| 14) gompi/2023a | ||
| 15) OSU-Micro-Benchmarks/7.1-1-gompi-2023a | ||
|  | ||
| # OSU MPI Bi-Directional Bandwidth Test v7.1 | ||
| # Size Bandwidth (MB/s) | ||
| # Datatype: MPI_CHAR. | ||
| 1 2.87 | ||
| 2 5.77 | ||
| 4 11.55 | ||
| 8 23.18 | ||
| 16 46.27 | ||
| 32 92.64 | ||
| 64 185.21 | ||
| 128 369.03 | ||
| 256 743.08 | ||
| 512 1487.21 | ||
| 1024 2975.75 | ||
| 2048 5928.14 | ||
| 4096 11809.66 | ||
| 8192 23097.44 | ||
| 16384 31009.54 | ||
| 32768 36493.20 | ||
| 65536 40164.63 | ||
| 131072 43150.62 | ||
| 262144 45075.57 | ||
| 524288 45918.07 | ||
| 1048576 46313.37 | ||
| 2097152 46507.25 | ||
| 4194304 46609.10 | ||
| ``` | ||
|  | ||
| **2- Test using OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0 on 2-nodes (Grace/Hopper GPUs)**: | ||
| ``` | ||
| Environment set up to use EESSI (2023.06), have fun! | ||
|  | ||
| hostname: | ||
| x1000c4s4b1n0 | ||
| x1000c5s3b0n0 | ||
|  | ||
| CPU info: | ||
| Vendor ID: ARM | ||
|  | ||
| Currently Loaded Modules: | ||
| 1) GCCcore/13.2.0 | ||
| 2) GCC/13.2.0 | ||
| 3) numactl/2.0.16-GCCcore-13.2.0 | ||
| 4) libxml2/2.11.5-GCCcore-13.2.0 | ||
| 5) libpciaccess/0.17-GCCcore-13.2.0 | ||
| 6) hwloc/2.9.2-GCCcore-13.2.0 | ||
| 7) OpenSSL/1.1 | ||
| 8) libevent/2.1.12-GCCcore-13.2.0 | ||
| 9) UCX/1.15.0-GCCcore-13.2.0 | ||
| 10) libfabric/1.19.0-GCCcore-13.2.0 | ||
| 11) PMIx/4.2.6-GCCcore-13.2.0 | ||
| 12) UCC/1.2.0-GCCcore-13.2.0 | ||
| 13) OpenMPI/4.1.6-GCC-13.2.0 | ||
| 14) gompi/2023b | ||
| 15) GDRCopy/2.4-GCCcore-13.2.0 | ||
| 16) UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0 (g) | ||
| 17) NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0 (g) | ||
| 18) UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0 (g) | ||
| 19) OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0 (g) | ||
|  | ||
| Where: | ||
| g: built for GPU | ||
|  | ||
| # OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5 | ||
| # Datatype: MPI_CHAR. | ||
| # Size Bandwidth (MB/s) | ||
| 1 0.18 | ||
| 2 0.37 | ||
| 4 0.75 | ||
| 8 1.49 | ||
| 16 2.99 | ||
| 32 5.93 | ||
| 64 11.88 | ||
| 128 23.76 | ||
| 256 72.78 | ||
| 512 145.45 | ||
| 1024 282.03 | ||
| 2048 535.46 | ||
| 4096 1020.24 | ||
| 8192 16477.70 | ||
| 16384 25982.96 | ||
| 32768 30728.30 | ||
| 65536 37637.46 | ||
| 131072 41808.92 | ||
| 262144 44316.19 | ||
| 524288 43693.89 | ||
| 1048576 43759.66 | ||
| 2097152 43593.38 | ||
| 4194304 43436.60 | ||
| ``` | ||
| ## Conclusion | ||
|  | ||
| The approach demonstrates EESSI's flexibility in accommodating specialized hardware requirements while preserving the benefits of a standardized software stack! | ||
|  | ||
|  | ||
Uh oh!
There was an error while loading. Please reload this page.