[RFC] Intra-node shared memory (SHM) optimizations for communication operators on CPUs

## Overview and Motivation
Tensor parallel (TP) inference is a key optimization and well adopted in LLM inference on CPUs. Most use cases for LLM benchmark/deployment focus on one node host (intra-node), where there are many numbers of CPU sockets/NUMAs/sub-NUMAs available. 

Referring to the practices of well-known libraries including [DeepSpeed ](https://github.com/deepspeedai/DeepSpeed/pull/5391/files), [Sglang](https://github.com/sgl-project/sglang/pull/5150), [vLLM ](https://github.com/vllm-project/vllm/blob/762be26a8ee0de15638fa21a59d85efedacec847/csrc/cpu/torch_bindings.cpp#L38)and [TGI](https://github.com/huggingface/text-generation-inference/blob/fc2405c549bab24081055d12791aaef7ac8a7566/server/text_generation_server/layers/tensor_parallel.py#L206), for intra-node TP cases, shared memory (SHM) optimizations bring much benefit for communication operators including allreduce, allgather and allgather_into_tensor. 

In this RFC, we propose a low-latency SHM-based optimization for Gloo, intending to cover all key communication operators which are mostly used in LLM inference. And as the CPU default communication backend in PyTorch, this SHM optimization can bring significant improvement when working with [PyTorch Tensor parallel solution](https://docs.pytorch.org/docs/stable/distributed.tensor.parallel.html).

## Design:
1) Intra-node check and fallback

```
# Check if an intra-node case
if  local_size >= 0 and local_size == word_size  
 ->   SHM_lmpl

# Fallback
else ->  Gloo Ring_lmpl
```

2) OP register and dispatch

<img width="300" height="400" alt="Image" src="https://github.com/user-attachments/assets/fe8a723f-b3ac-4f8a-90d2-c8b8bc5e5c13" />


3) dtype scope: 

In PyTorch integration, Gloo uses datatype defined in PyTorch, including c10::Half, c10::Float and c10::Bfloat16.  Thus, our SHM impl is also targeted at c10::Half, c10::Float, and c10::Bfloat16.



4) Shared memory design

```
In general, there are 3 steps for SHM optimization (take SHM allreduce as example):
a. copy input to shared memory buffer.
b. do sum operation on reduce buffer (or gather for allgather).
c. copy result back to each rank

Compared to ring allreduce, it uses one united shared memory buffer for calculation in intra-node case. 
SHM allreduce won't use other collective primitives such as send, receive like in ring allreduce.
In that case it's faster and more efficient.
```

<img width="800" height="600" alt="Image" src="https://github.com/user-attachments/assets/56b067b5-00fe-4f6a-a595-ccadd71ae9c9" />

## Implementation (Work in progress):
- [ ] allreduce. https://github.com/pytorch/gloo/pull/458 （under review)
- [ ] allgather.
- [ ] allgather_into_tensor.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Intra-node shared memory (SHM) optimizations for communication operators on CPUs #455

Overview and Motivation

Design:

Implementation (Work in progress):

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Intra-node shared memory (SHM) optimizations for communication operators on CPUs #455

Description

Overview and Motivation

Design:

Implementation (Work in progress):

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions