-
Notifications
You must be signed in to change notification settings - Fork 338
Description
Overview and Motivation
Tensor parallel (TP) inference is a key optimization and well adopted in LLM inference on CPUs. Most use cases for LLM benchmark/deployment focus on one node host (intra-node), where there are many numbers of CPU sockets/NUMAs/sub-NUMAs available.
Referring to the practices of well-known libraries including DeepSpeed , Sglang, vLLM and TGI, for intra-node TP cases, shared memory (SHM) optimizations bring much benefit for communication operators including allreduce, allgather and allgather_into_tensor.
In this RFC, we propose a low-latency SHM-based optimization for Gloo, intending to cover all key communication operators which are mostly used in LLM inference. And as the CPU default communication backend in PyTorch, this SHM optimization can bring significant improvement when working with PyTorch Tensor parallel solution.
Design:
- Intra-node check and fallback
# Check if an intra-node case
if local_size >= 0 and local_size == word_size
-> SHM_lmpl
# Fallback
else -> Gloo Ring_lmpl
- OP register and dispatch

- dtype scope:
In PyTorch integration, Gloo uses datatype defined in PyTorch, including c10::Half, c10::Float and c10::Bfloat16. Thus, our SHM impl is also targeted at c10::Half, c10::Float, and c10::Bfloat16.
- Shared memory design
In general, there are 3 steps for SHM optimization (take SHM allreduce as example):
a. copy input to shared memory buffer.
b. do sum operation on reduce buffer (or gather for allgather).
c. copy result back to each rank
Compared to ring allreduce, it uses one united shared memory buffer for calculation in intra-node case.
SHM allreduce won't use other collective primitives such as send, receive like in ring allreduce.
In that case it's faster and more efficient.

Implementation (Work in progress):
- allreduce. Intra-node shared memory (SHM) optimizations for CPU primitives #458 (under review)
- allgather.
- allgather_into_tensor.