ch4/ofi: rma get algorithm using mirror buffers #7695

hzhou · 2025-12-17T19:58:56Z

Pull Request Description

MPI_Get will fallback to MPIDIG active message layer when either the window buffers are device buffers (and libfabric HMEM is not enabled or supported) or the origin buffers is device buffer. The active message transport in ofi uses RDMA for data greater than 16KB, With gpu memory for both origin buffer and window base buffer, that involves an mr registration per message and extra allocation of staging buffers on both the origin and target side. Besides, both staging are currently synchronous.

This PR proposes an alternative fallback algorithm -

allocate a host mirror buffer at window creation time
target upon receiving GET_REQ active message do a localcopy from the device base buffer to mirror buffer and send ack to origin
origin perform RDMA read
use the asynchronous pipelined read algorithm (ch4/ofi: add new pipeline implementation #7529) to improve throughput.

[skip warnings]

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

When the base is gpu buffer and native RMA is not supported, currently we allocate pack buffer and perform both gpu registration and nic mr registrations for every put or get. This is inefficient. In addition, provider may not support too many concurrent mr keys. Add MPIR_CVAR_OFI_ENABLE_WIN_MIRROR, if enabled, to allocate a mirror buffer to avoid allocating separate pack buffers and registrations.

Add ofi rma get fallback using active messages and the mirror buffer. The implementation is similar to the mpidig am get, but it doesn't use MPIDIG_REQUEST, and use async frame work for local async steps.

Refactor the asynchronous pipelined RDMA read operation to be reused by other parts of the code. It is a compromise for now to use an MPIR_Request for the async context. This minimizes the amount of code changes to ofi_rndv_read.c. We do not need the full MPIR_Request for the RDMA read operation on its own.

The direct fi_read won't work if the origin buffer is a device buffer or uses non-contig datatypes. Async operation MPIDI_OFI_rdmaread_poll handles pipelined read supporting both device buffer and non-contig datatypes.

hzhou · 2025-12-18T00:02:12Z

test:mpich/custom/pipeline
label: gpu
netmod: ch4:ofi
config: nolocal
env: MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=1000
env: CUDA_PATH=/usr/local/cuda

test:mpich/ch4/ofi

hzhou force-pushed the 2512_ofi_rma branch 3 times, most recently from 3b3d43c to 64dd174 Compare December 17, 2025 20:41

hzhou added 4 commits December 17, 2025 18:01

ch4/ofi: use mirror buffer for rma get fallback

05e780e

Add ofi rma get fallback using active messages and the mirror buffer. The implementation is similar to the mpidig am get, but it doesn't use MPIDIG_REQUEST, and use async frame work for local async steps.

ch4/ofi: use MPIDI_OFI_rdmaread_poll in MPIDI_OFI_mirror_get

994d871

The direct fi_read won't work if the origin buffer is a device buffer or uses non-contig datatypes. Async operation MPIDI_OFI_rdmaread_poll handles pipelined read supporting both device buffer and non-contig datatypes.

hzhou force-pushed the 2512_ofi_rma branch from 64dd174 to 994d871 Compare December 18, 2025 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ch4/ofi: rma get algorithm using mirror buffers #7695

ch4/ofi: rma get algorithm using mirror buffers #7695

hzhou commented Dec 17, 2025

Uh oh!

hzhou commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ch4/ofi: rma get algorithm using mirror buffers #7695

Are you sure you want to change the base?

ch4/ofi: rma get algorithm using mirror buffers #7695

Conversation

hzhou commented Dec 17, 2025

Pull Request Description

Author Checklist

Uh oh!

hzhou commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant