Skip to content

Commit 59caae0

Browse files
dstaay-fbfacebook-github-bot
authored andcommitted
Readme (#714)
Summary: Pull Request resolved: #714 readme Reviewed By: allenwang28 Differential Revision: D79394652 fbshipit-source-id: 582c29e6e8f74ebce29134b06c6675da1456e2e9
1 parent 50d083f commit 59caae0

File tree

1 file changed

+180
-0
lines changed

1 file changed

+180
-0
lines changed

rdmaxcel-sys/README.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# RDMaxcel
2+
3+
RDMaxcel (RDMA Acceleration) is a library that provides a unified interface for RDMA operations from both CPU and GPU, enabling seamless RDMA communication regardless of whether the caller is host code or device hardware.
4+
5+
## Overview
6+
7+
RDMaxcel bridges the gap between traditional CPU-based RDMA programming and GPU-accelerated RDMA operations by providing a consistent API that works identically in both environments. This enables applications to leverage the power of GPUs for RDMA operations without having to maintain separate code paths.
8+
9+
## Key Features
10+
11+
- **Unified Host/Device API**: Same functions can be called from both CPU and GPU code
12+
- **Zero-copy Operations**: Direct GPU access to RDMA resources without CPU involvement
13+
- **Hardware-agnostic Interface**: Abstract away differences between CPU and GPU execution
14+
- **Low-latency Communication**: Direct GPU-initiated RDMA operations
15+
16+
## How RDMaxcel Works
17+
18+
### Work Queue Elements (WQEs)
19+
20+
Work Queue Elements (WQEs) are the fundamental building blocks of RDMA operations. They represent commands to the RDMA hardware that describe data transfer operations.
21+
22+
#### WQE Structure
23+
24+
In RDMaxcel, WQEs are composed of multiple segments:
25+
26+
1. **Control Segment**: Contains operation type, QP number, and other control information
27+
2. **Remote Address Segment**: Specifies the remote memory address and key for RDMA operations
28+
3. **Data Segment**: Describes the local memory buffer, including address, length, and memory key
29+
30+
The `wqe_params_t` structure encapsulates all parameters needed to create and post a WQE:
31+
32+
```c
33+
typedef struct {
34+
uintptr_t laddr; // Local memory address
35+
uint32_t lkey; // Local memory key
36+
size_t length; // Length of data to transfer
37+
uint64_t wr_id; // Work request ID
38+
bool signaled; // Whether completion should be signaled
39+
uint32_t op_type; // Operation type (e.g., MLX5_OPCODE_RDMA_WRITE)
40+
uintptr_t raddr; // Remote memory address
41+
uint32_t rkey; // Remote memory key
42+
uint32_t qp_num; // Queue pair number
43+
uint8_t* buf; // WQE buffer address
44+
uint32_t* dbrec; // Doorbell record
45+
uint32_t wqe_cnt; // WQE count
46+
} wqe_params_t;
47+
```
48+
49+
#### Posting WQEs
50+
51+
RDMaxcel provides unified functions for posting send and receive WQEs:
52+
53+
- `send_wqe()`: Posts a send WQE (RDMA write, read, send, etc.)
54+
- `recv_wqe()`: Posts a receive WQE
55+
56+
These functions can be called directly from both host and device code, with identical behavior. This is achieved through CUDA's `__host__ __device__` function attributes.
57+
58+
### Doorbells
59+
60+
Doorbells are the mechanism used to notify the RDMA hardware that new work has been queued and is ready for processing.
61+
62+
#### How Doorbells Work
63+
64+
1. **WQE Creation**: The application creates a WQE in memory
65+
2. **Doorbell Ring**: The application "rings the doorbell" by writing to a special memory-mapped register
66+
3. **Hardware Notification**: This write operation notifies the RDMA hardware that new work is available
67+
4. **Work Processing**: The hardware reads the WQE from memory and executes the requested operation
68+
69+
#### Doorbell Implementation
70+
71+
RDMaxcel implements doorbell operations through the `db_ring()` function:
72+
73+
```c
74+
__host__ __device__ void db_ring(void* dst, void* src);
75+
```
76+
77+
This function copies 64 bytes (8 64-bit values) from the source buffer to the destination doorbell register. The unified `__host__ __device__` implementation ensures that this operation works identically whether called from CPU or GPU code.
78+
79+
### Same Code Path for Device and Host
80+
81+
One of the key innovations in RDMaxcel is the use of the same code path for both device (GPU) and host (CPU) operations. This is achieved through several techniques:
82+
83+
#### Unified Function Implementations
84+
85+
Core functions are implemented with CUDA's `__host__ __device__` attributes, allowing them to be compiled for both CPU and GPU execution:
86+
87+
```c
88+
__host__ __device__ void send_wqe(wqe_params_t params);
89+
__host__ __device__ void recv_wqe(wqe_params_t params);
90+
__host__ __device__ void db_ring(void* dst, void* src);
91+
__host__ __device__ void cqe_poll(int32_t* result, cqe_poll_params_t params);
92+
```
93+
94+
#### Memory Registration
95+
96+
For GPU access to RDMA resources, RDMaxcel registers the necessary memory regions with CUDA:
97+
98+
```c
99+
cudaError_t register_cuda_memory(
100+
struct mlx5dv_qp* dv_qp,
101+
struct mlx5dv_cq* dv_recv_cq,
102+
struct mlx5dv_cq* dv_send_cq);
103+
```
104+
105+
This function registers queue pair buffers, completion queue buffers, and doorbell registers with CUDA, making them accessible from GPU code.
106+
107+
#### Kernel Wrappers
108+
109+
For GPU execution, RDMaxcel provides kernel wrapper functions that launch the core functions on the GPU:
110+
111+
```c
112+
__global__ void cu_send_wqe(wqe_params_t params);
113+
__global__ void cu_recv_wqe(wqe_params_t params);
114+
__global__ void cu_db_ring(void* dst, void* src);
115+
__global__ void cu_cqe_poll(int32_t* result, cqe_poll_params_t params);
116+
```
117+
118+
And corresponding launch functions:
119+
120+
```c
121+
void launch_send_wqe(wqe_params_t params);
122+
void launch_recv_wqe(wqe_params_t params);
123+
void launch_db_ring(void* dst, void* src);
124+
cqe_poll_result_t launch_cqe_poll(void* mlx5dv_cq, int32_t cqe_idx);
125+
```
126+
127+
### Caller Agnosticism
128+
129+
The design of RDMaxcel makes it agnostic to whether the caller is CPU or GPU hardware:
130+
131+
1. **Identical Function Signatures**: The same parameters are used for both CPU and GPU calls
132+
2. **Consistent Memory Layout**: WQEs and CQEs have the same memory layout in both environments
133+
3. **Unified Endianness Handling**: Byte swapping functions work identically on both platforms
134+
4. **Transparent Memory Access**: Memory registration ensures GPU can access all required resources
135+
136+
## Benefits
137+
138+
- **Code Reuse**: Write RDMA code once, run it on both CPU and GPU
139+
- **Simplified Development**: No need to maintain separate code paths
140+
- **Performance**: Direct GPU-initiated RDMA operations without CPU involvement
141+
- **Flexibility**: Choose the best execution environment for each workload
142+
143+
## Usage Example
144+
145+
```c
146+
// Create and initialize RDMA resources
147+
struct ibv_qp* qp = create_qp(...);
148+
struct mlx5dv_qp* dv_qp = create_mlx5dv_qp(qp);
149+
struct mlx5dv_cq* dv_send_cq = create_mlx5dv_send_cq(qp);
150+
struct mlx5dv_cq* dv_recv_cq = create_mlx5dv_recv_cq(qp);
151+
152+
// Register memory with CUDA
153+
register_cuda_memory(dv_qp, dv_recv_cq, dv_send_cq);
154+
155+
// Create WQE parameters
156+
wqe_params_t params = {
157+
.laddr = local_buffer_addr,
158+
.lkey = local_memory_key,
159+
.length = transfer_size,
160+
.wr_id = work_id,
161+
.signaled = true,
162+
.op_type = MLX5_OPCODE_RDMA_WRITE,
163+
.raddr = remote_buffer_addr,
164+
.rkey = remote_memory_key,
165+
.qp_num = qp->qp_num,
166+
.buf = dv_qp->sq.buf,
167+
.dbrec = dv_qp->dbrec,
168+
.wqe_cnt = dv_qp->sq.wqe_cnt
169+
};
170+
171+
// CPU execution
172+
send_wqe(params);
173+
174+
// Or GPU execution
175+
launch_send_wqe(params);
176+
```
177+
178+
## Conclusion
179+
180+
RDMaxcel provides a powerful abstraction for RDMA operations that works seamlessly across CPU and GPU environments. By using the same code path for both device and host operations, it simplifies development and enables new possibilities for GPU-accelerated networking applications.

0 commit comments

Comments
 (0)