Skip to content

Conversation

@hzhou
Copy link
Contributor

@hzhou hzhou commented Dec 17, 2025

Pull Request Description

MPI_Get will fallback to MPIDIG active message layer when either the window buffers are device buffers (and libfabric HMEM is not enabled or supported) or the origin buffers is device buffer. The active message transport in ofi uses RDMA for data greater than 16KB, With gpu memory for both origin buffer and window base buffer, that involves an mr registration per message and extra allocation of staging buffers on both the origin and target side. Besides, both staging are currently synchronous.

This PR proposes an alternative fallback algorithm -

  1. allocate a host mirror buffer at window creation time
  2. target upon receiving GET_REQ active message do a localcopy from the device base buffer to mirror buffer and send ack to origin
  3. origin perform RDMA read
  4. use the asynchronous pipelined read algorithm (ch4/ofi: add new pipeline implementation #7529) to improve throughput.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2512_ofi_rma branch 3 times, most recently from 3b3d43c to 64dd174 Compare December 17, 2025 20:41
When the base is gpu buffer and native RMA is not supported, currently
we allocate pack buffer and perform both gpu registration and nic mr
registrations for every put or get. This is inefficient. In addition,
provider may not support too many concurrent mr keys.

Add MPIR_CVAR_OFI_ENABLE_WIN_MIRROR, if enabled, to allocate a mirror
buffer to avoid allocating separate pack buffers and registrations.
Add ofi rma get fallback using active messages and the mirror buffer.
The implementation is similar to the mpidig am get, but it doesn't use
MPIDIG_REQUEST, and use async frame work for local async steps.
Refactor the asynchronous pipelined RDMA read operation to be reused by
other parts of the code.

It is a compromise for now to use an MPIR_Request for the async context.
This minimizes the amount of code changes to ofi_rndv_read.c. We do not
need the full MPIR_Request for the RDMA read operation on its own.
The direct fi_read won't work if the origin buffer is a device buffer or
uses non-contig datatypes. Async operation MPIDI_OFI_rdmaread_poll
handles pipelined read supporting both device buffer and non-contig
datatypes.
@hzhou
Copy link
Contributor Author

hzhou commented Dec 18, 2025

test:mpich/custom/pipeline
label: gpu
netmod: ch4:ofi
config: nolocal
env: MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=1000
env: CUDA_PATH=/usr/local/cuda

test:mpich/ch4/ofi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant