diff --git a/docs/source/1.0/features/parallel-strategy.md b/docs/source/1.0/features/parallel-strategy.md index 78c5e6d396f..10e556ebc6e 100644 --- a/docs/source/1.0/features/parallel-strategy.md +++ b/docs/source/1.0/features/parallel-strategy.md @@ -1,60 +1,107 @@ - # Parallelism in TensorRT-LLM -Parallelism across multiple GPUs is required when either +Parallelism across multiple GPUs becomes necessary when either * the model cannot fit in a single GPU’s memory, or * a single GPU cannot deliver the desired performance. ---- +TensorRT-LLM supports multiple parallelism strategies for deployment on both single and multiple nodes: +* **Tensor Parallel (TP)** - Shards model weights across GPUs +* **Pipeline Parallel (PP)** - Distributes model layers across GPUs +* **Data Parallel (DP)** - Replicates model across GPUs for different requests +* **Expert Parallel (EP)** - Distributes experts across GPUs for MoE models +* **Context Parallel (CP)** - Distributes context processing across GPUs +* **Wide Expert Parallel (Wide-EP)** - Advanced EP with load balancing for large-scale MoE models + +## Overview of Parallelism Strategies + +### Tensor Parallelism (TP) +Tensor parallelism splits the model weights across multiple GPUs. Each GPU holds a portion of the weights and processes the same input tokens, with results combined through communication. + +**Best for:** Small batch sizes, memory-constrained scenarios + +### Pipeline Parallelism (PP) +Pipeline parallelism distributes different layers of the model across multiple GPUs. Each GPU processes a subset of layers, with activations passed between GPUs. + +**Best for:** Large models that don't fit in single GPU memory + +### Data Parallelism (DP) +Data parallelism replicates the entire model across multiple GPUs. Each GPU processes different requests independently. + +**Best for:** Large batch sizes, high throughput scenarios + +### Expert Parallelism (EP) +Expert parallelism is specifically designed for Mixture of Experts (MoE) models, where different experts are distributed across GPUs. -## Attention Module +**Best for:** MoE models with high expert count -TensorRT-LLM currently supports two strategies: +### Context Parallelism (CP) +Context parallelism distributes the processing of long sequences across multiple GPUs. -**Tensor Parallelism (TP)** — best for small batch sizes -**Data Parallelism (DP)** — best for large batch sizes +**Best for:** Long context scenarios -### Tensor Parallelism +### Wide Expert Parallelism (Wide-EP) +Wide-EP is an advanced form of expert parallelism that addresses the inherent workload imbalance in large-scale MoE models through intelligent load balancing and expert replication. + +**Best for:** Large-scale MoE models like DeepSeek-V3/R1. + +## Module-level Parallelism Guide + +### Attention Module + +TensorRT-LLM supports two strategies for attention modules: + +- **Tensor Parallelism (TP)** — best for small batch sizes +- **Data Parallelism (DP)** — best for large batch sizes + +#### Tensor Parallelism (TP) * The GEMM weights before and after the attention kernel are evenly sharded across GPUs, as are the attention `num_heads`. -* Exceptions +* Exceptions: 1. **DeepSeek-R1**: the `fused_A` GEMM is *not* sharded. 2. **GQA / MQA / MLA**: if `num_heads < tensor_parallel_size`, the KV-cache is replicated on every GPU. -### Data Parallelism +#### Data Parallelism (DP) * All GEMM weights are **replicated** on every GPU. * The KV-cache is **partitioned**, because different user requests are routed to different DP ranks. -### How to enable +#### How to Enable Attention Parallelism + +To deploy a model with the above parallel strategies using `trtllm-serve` or run benchmarking with `trtllm-bench`, create a YAML configuration file named `parallel_config.yaml`: -```yaml +```bash +cat < parallel_config.yaml # TP-8 tensor_parallel_size: 8 enable_attention_dp: false # default # DP-8 tensor_parallel_size: 8 enable_attention_dp: true +EOF ``` -## FFN Module +### FFN Module -### Dense models +#### Dense Models Tensor Parallelism is supported for the FFN layers of dense models. -### Mixture of Experts (MoE) +#### Mixture of Experts (MoE) MoE replaces a single FFN with multiple experts. A router selects the top-k experts for each token and dispatches the corresponding hidden states. -TensorRT-LLM supports three execution patterns: Tensor Parallelism (TP), Expert Parallelism (EP) and Hybrid (TP × EP) -* **TP** - Every expert’s weight matrix is sliced across all GPUs. Each GPU sees all tokens. -* **EP** - Full weights of each expert reside on a single GPU. Each GPU only sees tokens routed to its local experts. +TensorRT-LLM supports three execution patterns for MoE: + +* **TP** - Every expert's weight matrix is sliced across all GPUs. Each GPU sees all tokens. +* **EP** - Full weights of each expert reside on a single GPU. Each GPU only sees tokens routed to its local experts. * **Hybrid ETP** - Each GPU stores a subset of experts (EP) and shards those weights further (TP), balancing workload and kernel efficiency. -#### How to enable +#### How to Enable MoE Parallelism + +To deploy a model with the above parallel strategies using `trtllm-serve` or run benchmarking with `trtllm-bench`, create a YAML configuration file named `parallel_config.yaml` as follows: -```yaml +```bash +cat < parallel_config.yaml # TP only tensor_parallel_size: 8 moe_tensor_parallel_size: 8 @@ -67,14 +114,67 @@ moe_expert_parallel_size: 8 tensor_parallel_size: 8 # 4 × 2 moe_tensor_parallel_size: 4 moe_expert_parallel_size: 2 +EOF ``` - +```{note} The product of `moe_tensor_parallel_size` and `moe_expert_parallel_size` must equal `tensor_parallel_size`. +``` -## Pipeline Parallelism +## Wide Expert Parallelism (Wide-EP) -Enable pipeline parallelism with: +Wide Expert Parallelism (Wide-EP) is TensorRT-LLM's advanced solution for large-scale MoE model inference. It addresses the challenges of traditional expert parallelism through intelligent load balancing and expert replication strategies. -```yaml -pipeline_parallel_size: 4 # example -``` +### Motivation for Wide-EP + +Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 use fine-grained expert designs that introduce new challenges: + +- **High memory demands** for expert weights +- **Inherent expert-level workload imbalance** due to sparse execution patterns +- **Communication overhead** in distributed expert parallelism +- **Hot expert problem** where certain experts receive significantly more tokens than others + +### Key Features of Wide-EP + +#### 1. Expert Replication and Load Balancing +Wide-EP introduces the concept of **expert slots** that are decoupled from specific experts. This allows: +- Multiple replicas of hot experts across different GPUs +- Dynamic expert placement based on workload patterns +- Both offline and online load balancing strategies + +#### 2. Custom EP Communication Kernels +- Optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL) +- Efficient all-to-all communication for expert dispatch and combine +- Reduced communication overhead compared to traditional EP + +#### 3. Expert Parallelism Load Balancer (EPLB) +- **Offline EPLB**: Pre-computed expert placement based on historical workload statistics +- **Online EPLB**: Dynamic expert placement that adapts to real-time traffic patterns +- Layer-wise weight redistribution to minimize inference disruption + +### Architecture Overview + +Wide-EP separates the concepts of **experts** and **slots**: +- **Expert**: The concept from the model's perspective (e.g., Expert 0, Expert 1, etc.) +- **Slot**: The concept from the model engine's perspective (e.g., Slot 0, Slot 1, etc.) + +The system maintains a routing table that maps Expert IDs to Slot IDs, which can be updated by the load balancing policy. + + +### Best Practices + +1. **Start with offline EPLB** for production deployments with known workload patterns +2. **Use online EPLB** for dynamic workloads or when traffic patterns change frequently +3. **Monitor expert statistics** to understand workload distribution +4. **Tune max_num_tokens** based on your memory constraints and EP size +5. **Test with representative datasets** to validate load balancing effectiveness + +### References + +- [Technical Blog: Scaling Expert Parallelism in TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) +- [DeepSeek-V3 Paper](https://arxiv.org/abs/2412.19437) +- [EPLB Implementation](https://github.com/deepseek-ai/EPLB) + +For detailed implementation examples and advanced usage, see: +- [`examples/wide_ep/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/): Complete Wide-EP examples +- [`examples/wide_ep/ep_load_balancer/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/ep_load_balancer/): Load balancing tools +- [`examples/wide_ep/slurm_scripts/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/slurm_scripts/): Cluster deployment scripts