Skip to content

Commit f03e9cf

Browse files
varun-sundar-rabindranathVarun Sundar Rabindranath
andauthored
[Doc] Add FusedMoE Modular Kernel Documentation (#21623)
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
1 parent 37f86d9 commit f03e9cf

File tree

5 files changed

+236
-0
lines changed

5 files changed

+236
-0
lines changed
187 KB
Loading
189 KB
Loading
227 KB
Loading
128 KB
Loading
Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
# Fused MoE Modular Kernel
2+
3+
## Introduction
4+
FusedMoEModularKernel is implemented [here](gh-file:/vllm/model_executor/layers/fused_moe/modular_kernel.py)
5+
6+
Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types.
7+
8+
* Contiguous / Standard / Non-Batched, and
9+
* Batched
10+
11+
!!! note
12+
The terms Contiguous, Standard, and Non-Batched are used interchangeably throughout the document.
13+
14+
The input activation format completely depends on the All2All Dispatch being used.
15+
16+
* In the Contiguous variant, the All2All Dispatch returns the activations as a contiguous tensor of shape (M, K) along with TopK Ids and TopK weights of shape (M, num_topk). Look at `DeepEPHTPrepareAndFinalize` for an example.
17+
* In the Batched variant, the All2All Dispatch returns the activations as a tensor of shape (num_experts, max_tokens, K). Here, the activations/tokens that subscribe to the same expert are batched together. Note that not all entries of the tensor are valid. The activations tensor is typically accompanied by an `expert_num_tokens` tensor of size `num_experts`, where `expert_num_tokens[i]` indicates the number of valid tokens that subscribe to the ith expert. Look at `PplxPrepareAndFinalize` or `DeepEPLLPrepareAndFinalize` for an example.
18+
19+
The FusedMoE operation is generally made of multiple operations, in both the Contiguous and Batched variants, as described in the diagrams below
20+
21+
![](../assets/design/fused_moe_modular_kernel/fused_moe_non_batched.png "FusedMoE Non-Batched")
22+
23+
![](../assets/design/fused_moe_modular_kernel/fused_moe_batched.png "FusedMoE Batched")
24+
25+
!!! note
26+
The main difference, in terms of operations, between the Batched and Non-Batched cases is the Permute / Unpermute operations. All other operations remain.
27+
28+
## Motivation
29+
30+
As can be seen from the diagrams, there are a lot of operations and there can be a variety of implementations for each operation. The set of ways the operations can be put together to make a valid FusedMoE implementation quickly becomes intractable. The Modular Kernel framework addresses this issue, by grouping the operations into logical components. This broad categorization makes the combinations manageable and prevents code-duplication. This also decouples the All2All Dispatch & Combine implementations from the FusedMoE implementations and allows for their independent development and testing. Furthermore, the Modular Kernel framework introduces Abstract classes for the different components thus providing a well-defined skeleton for future implementations.
31+
32+
The rest of the document will focus on the Contiguous / Non-Batched case. Extrapolating to the Batched case should be straight-forward.
33+
34+
## ModularKernel Components:
35+
FusedMoEModularKernel splits the FusedMoE operation into 3 parts,
36+
37+
1. TopKWeightAndReduce
38+
2. FusedMoEPrepareAndFinalize
39+
3. FusedMoEPermuteExpertsUnpermute
40+
41+
### TopKWeightAndReduce
42+
The TopK Weight Application and Reduction components happen right after the Unpermute operation and before the All2All Combine. Note that the `FusedMoEPermuteExpertsUnpermute` is responsible for the Unpermute and `FusedMoEPrepareAndFinalize` is responsible for the All2All Combine. There is value in doing the TopK Weight Application and Reduction in the `FusedMoEPermuteExpertsUnpermute`. But some implementations choose to do it `FusedMoEPrepareAndFinalize`. In order to enable this flexibility, we have a TopKWeightAndReduce abstract class.
43+
44+
Please find the implementations of TopKWeightAndReduce [here](gh-file:vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py).
45+
46+
`FusedMoEPrepareAndFinalize::finalize()` method accepts a `TopKWeightAndReduce` argument that is invoked inside the method.
47+
The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEPermuteExpertsUnpermute` and `FusedMoEPerpareAndFinalize` implementations to determine where the TopK Weight Application and Reduction happens.
48+
49+
* `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceNoOp` if the `FusedMoEPermuteExpertsUnpermute` implementation does the weight application and reduction itself.
50+
* `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceContiguous` / `TopKWeightAndReduceNaiveBatched` / `TopKWeightAndReduceDelegate` if the `FusedMoEPermuteExpertsUnpermute` implementation needs the `FusedMoEPrepareAndFinalize::finalize()` to do the weight application and reduction.
51+
52+
### FusedMoEPrepareAndFinalize
53+
The `FusedMoEPrepareAndFinalize` abstract class exposes `prepare` and `finalize` functions.
54+
The `prepare` function is responsible for input activation Quantization and All2All Dispatch. The `finalize` function is responsible for invoking the All2All Combine. Additionally the `finalize` function may or may not do the TopK weight application and reduction (Please refer to the TopKWeightAndReduce section)
55+
56+
![](../assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png "FusedMoEPrepareAndFinalize Blocks")
57+
58+
### FusedMoEPermuteExpertsUnpermute
59+
The `FusedMoEPermuteExpertsUnpermute` class is where the crux of the MoE operations happen. The `FusedMoEPermuteExpertsUnpermute` abstract class exposes a few important functions,
60+
61+
* apply()
62+
* workspace_shapes()
63+
* finalize_weight_and_reduce_impl()
64+
65+
#### apply()
66+
The `apply` method is where the implementations perform
67+
68+
* Permute
69+
* Matmul with weight W1
70+
* Act + Mul
71+
* Quantization
72+
* Matmul with weight W2
73+
* Unpermute
74+
* Maybe TopK Weight Application + Reduction
75+
76+
#### workspace_shapes()
77+
The core FusedMoE implementation performs a series of operations. It would be inefficient to create output memory for each of these operations separately. To that effect, implementations are required to declare 2 workspace shapes, the workspace datatype and the FusedMoE output shape as outputs of the workspace_shapes() method. This information is used to allocate the workspace tensors and the output tensor in `FusedMoEModularKernel::forward()` and passed on to the `FusedMoEPermuteExpertsUnpermute::apply()` method. The workspaces could then be used as intermediate buffers in the FusedMoE implementation.
78+
79+
#### finalize_weight_and_reduce_impl()
80+
It is sometimes efficient to perform TopK weight application and Reduction inside the `FusedMoEPermuteExpertsUnpermute::apply()`. Find an example [here](https://github.com/vllm-project/vllm/pull/20228). We have a `TopKWeightAndReduce` abstract class to facilitate such implementations. Please refer to the TopKWeightAndReduce section.
81+
`FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl()` returns the `TopKWeightAndReduce` object that the implementation wants the `FusedMoEPrepareAndFinalize::finalize()` to use.
82+
83+
![](../assets/design/fused_moe_modular_kernel/fused_experts_blocks.png "FusedMoEPermuteExpertsUnpermute Blocks")
84+
85+
### FusedMoEModularKernel
86+
`FusedMoEModularKernel` is composed of the `FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` objects.
87+
`FusedMoEModularKernel` pseudocode/sketch,
88+
89+
```
90+
FusedMoEModularKernel::__init__(self,
91+
prepare_finalize: FusedMoEPrepareAndFinalize,
92+
fused_experts: FusedMoEPermuteExpertsUnpermute):
93+
94+
self.prepare_finalize = prepare_finalize
95+
self.fused_experts = fused_experts
96+
97+
FusedMoEModularKernel::forward(self, DP_A):
98+
99+
Aq, A_scale, _, _, _ = self.prepare_finalize.prepare(DP_A, ...)
100+
101+
workspace13_shape, workspace2_shape, _, _ = self.fused_experts.workspace_shapes(...)
102+
103+
# allocate workspaces
104+
workspace_13 = torch.empty(workspace13_shape, ...)
105+
workspace_2 = torch.empty(workspace2_shape, ...)
106+
107+
# execute fused_experts
108+
fe_out = self.fused_experts.apply(Aq, A_scale, workspace13, workspace2, ...)
109+
110+
# war_impl is an object of type TopKWeightAndReduceNoOp if the fused_experts implementations performs the TopK Weight Application and Reduction.
111+
war_impl = self.fused_experts.finalize_weight_and_reduce_impl()
112+
113+
output = self.prepare_finalize.finalize(fe_out, war_impl,...)
114+
115+
return output
116+
```
117+
118+
## How-To
119+
120+
### How To Add a FusedMoEPrepareAndFinalize Type
121+
Typically a FusedMoEPrepareAndFinalize type is backed by an All2All Dispatch & Combine implementation / kernel. For example,
122+
123+
* PplxPrepareAndFinalize type is backed by Pplx All2All kernels,
124+
* DeepEPHTPrepareAndFinalize type is backed by DeepEP High-Throughtput All2All kernels, and
125+
* DeepEPLLPrepareAndFinalize type is backed by DeepEP Low-Latency All2All kernels.
126+
127+
#### Step 1: Add an All2All manager
128+
The purpose of the All2All Manager is to setup the All2All kernel implementations. The `FusedMoEPrepareAndFinalize` implementations typically fetch a kernel-implementation "handle" from the All2All Manager to invoke the Dispatch and Combine functions. Please look at the All2All Manager implementations [here](gh-file:vllm/distributed/device_communicators/all2all.py).
129+
130+
#### Step 2: Add a FusedMoEPrepareAndFinalize Type
131+
This section describes the significance of the various functions exposed by the `FusedMoEPrepareAndFinalize` abstract class.
132+
133+
`FusedMoEPrepareAndFinalize::prepare()`: The prepare method implements the Quantization and All2All Dispatch. Typically the Dispatch function from the relevant All2All Manager is invoked.
134+
135+
`FusedMoEPrepareAndFinalize::finalize()`: Maybe perform TopK Weight Application and Reduction and All2All Combine. Typically the Combine function from the relevant All2AllManager is invoked.
136+
137+
`FusedMoEPrepareAndFinalize::activation_format()`: Return `FusedMoEActivationFormat.BatchedExperts` if the output of the prepare method (i.e. the All2All dispatch) is Batched. Return `FusedMoEActivationFormat.Standard` otherwise.
138+
139+
`FusedMoEPrepareAndFinalize::topk_indices_dtype()`: Data type of the TopK ids. Some All2All kernels have strict requirements pertaining to the data type of the TopK ids. This requirement is passed on to the `FusedMoe::select_experts` function so it could be respected. If there are no strict requirements return None.
140+
141+
`FusedMoEPrepareAndFinalize::max_num_tokens_per_rank()`: This is the maximum number of tokens that would be submitted to the All2All Dispatch at once.
142+
143+
`FusedMoEPrepareAndFinalize::num_dispatchers()`: Total number of dispatching units. This value determines the size of the Dispatch output. The Dispatch output is of shape (num_local_experts, max_num_tokens, K). Here max_num_tokens = num_dispatchers() * max_num_tokens_per_rank().
144+
145+
We suggest picking an already existing `FusedMoEPrepareAndFinalize` implementation that matches your All2All implementation closely and using it as a reference.
146+
147+
### How To Add a FusedMoEPermuteExpertsUnpermute Type
148+
FusedMoEPermuteExpertsUnpermute performs the core of the FusedMoE operations. The various functions exposed by the abstract class and their significance is as follows,
149+
150+
`FusedMoEPermuteExpertsUnpermute::activation_formats()`: Return the supported Input and Output activation formats. i.e. Contiguous / Batched format.
151+
152+
`FusedMoEPermuteExpertsUnpermute::supports_chunking()`: Return True if the implementation supports chunking. Typically
153+
implementations that input `FusedMoEActivationFormat.Standard` support chunking and `FusedMoEActivationFormat.BatchedExperts` do not.
154+
155+
`FusedMoEPermuteExpertsUnpermute::supports_expert_map()`: Return True if the implementation supports expert map.
156+
157+
`FusedMoEPermuteExpertsUnpermute::workspace_shapes()` /
158+
`FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` /
159+
`FusedMoEPermuteExpertsUnpermute::apply`: Refer to `FusedMoEPermuteExpertsUnpermute` section above.
160+
161+
### FusedMoEModularKernel Initialization
162+
`FusedMoEMethodBase` class has 2 methods that are collectively responsible in creating the `FusedMoEModularKernel` object. They are,
163+
164+
* select_gemm_impl, and
165+
* init_prepare_finalize
166+
167+
#### select_gemm_impl
168+
The `select_gemm_impl` method is undefined in the base class. It is the responsibility of the derived class to implement a method that constructs a valid/appropriate `FusedMoEPermuteExpertsUnpermute` object.
169+
Please refer to the implementations in,
170+
171+
* `UnquantizedFusedMoEMethod`
172+
* `CompressedTensorsW8A8Fp8MoEMethod`
173+
* `CompressedTensorsW8A8Fp8MoECutlassMethod`
174+
* `Fp8MoEMethod`
175+
* `ModelOptNvFp4FusedMoE`
176+
dervied classes.
177+
178+
#### init_prepare_finalize
179+
Based on the input and env settings, the `init_prepare_finalize` method creates the appropriate `FusedMoEPrepareAndFinalize` object. The method then queries `select_gemm_impl` for the appropriate `FusedMoEPermuteExpertsUnpermute` object and builds the `FusedMoEModularKernel` object
180+
181+
Please take a look at [init_prepare_finalize](https://github.com/vllm-project/vllm/blob/1cbf951ba272c230823b947631065b826409fa62/vllm/model_executor/layers/fused_moe/layer.py#L188).
182+
**Important**: The `FusedMoEMethodBase` derived classes use the `FusedMoEMethodBase::fused_experts` object in their `apply` methods. When settings permit the construction of a valid `FusedMoEModularKernel` object, we override `FusedMoEMethodBase::fused_experts` with it. This essentially makes the derived classes agnostic to what FusedMoE implementation is used.
183+
184+
### How To Unit Test
185+
We have `FusedMoEModularKernel` unit tests at [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py).
186+
187+
The unit test iterates through all combinations of `FusedMoEPrepareAndFinalize` and `FusedMoEPremuteExpertsUnpermute` types and if they are
188+
compatible, runs some correctness tests.
189+
If you are adding some `FusedMoEPrepareAndFinalize` / `FusedMoEPermuteExpertsUnpermute` implementations,
190+
191+
1. Add the implementation type to `MK_ALL_PREPARE_FINALIZE_TYPES` and `MK_FUSED_EXPERT_TYPES` in [mk_objects.py](gh-file:tests/kernels/moe/modular_kernel_tools/mk_objects.py) respectively.
192+
2. Update `Config::is_batched_prepare_finalize()`, `Config::is_batched_fused_experts()`, `Config::is_standard_fused_experts()`,
193+
`Config::is_fe_16bit_supported()`, `Config::is_fe_fp8_supported()`, `Config::is_fe_block_fp8_supported()`,
194+
`Config::is_fe_supports_chunking()` methods in [/tests/kernels/moe/modular_kernel_tools/common.py](gh-file:tests/kernels/moe/modular_kernel_tools/common.py)
195+
196+
Doing this will add the new implementation to the test suite.
197+
198+
### How To Check `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` Compatibility
199+
The unit test file [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py) can also be executed as a standalone script.
200+
Example: `python3 -m tests.kernels.moe.test_modular_kernel_combinations --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts`
201+
As a side-effect, this script can be used to test `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` compatibility. When invoked
202+
with incompatible types, the script will error.
203+
204+
### How To Profile
205+
Please take a look at [profile_modular_kernel.py](gh-file:tests/kernels/moe/modular_kernel_tools/profile_modular_kernel.py)
206+
The script can be used to generate Torch traces for a single `FusedMoEModularKernel::forward()` call for any compatible
207+
`FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` types.
208+
Example: `python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts`
209+
210+
## FusedMoEPrepareAndFinalize Implementations
211+
The following table lists the `FusedMoEPrepareAndFinalize` implementations at the time of writing,
212+
213+
| Implementation | Type | Comments |
214+
| :--- | :--- | :--- |
215+
| DeepEPHTPrepareAndFinalize | Contiguous / Non-Batched | Uses the DeepEP High-Throughput all2all kernels. |
216+
| DeepEPLLPrepareAndFinalize | Batched | Uses the DeepEP Low-Latency all2all kernels. |
217+
| PplxPrepareAndFinalize | Batched | Uses the Perplexity all2all kernels. |
218+
| FlashInferCutlassMoEPrepareAndFinalize | Contiguous | |
219+
| MoEPrepareAndFinalizeNoEP | Contiguous | This implementation is used when there is no EP. i.e. no all2all kernels are invoked. |
220+
| BatchedPrepareAndFinalize | Batched | A reference prepare/finalize class that reorganizes the tokens into expert batched format, i.e. E x max_num_tokens x K. (Doesn’t use any all2all kernels. This is primarily used in unit testing) |
221+
222+
## FusedMoEPermuteExpertsUnpermute
223+
The following table lists the `FusedMoEPermuteExpertsUnpermute` implementations at the time of writing,
224+
225+
| Implementation | Type | Comment |
226+
| :--- | :--- | :--- |
227+
| BatchedDeepGemmExperts | Batched | Uses the DeepGemm’s Masked Grouped Gemm kernels for the fused_moe operation. |
228+
| BatchedTritonExperts | Batched | Uses a Triton Kernel for the Batched matmuls. |
229+
| BatchedTritonOrDeepGemmExperts | Batched | Chooses either the `BatchedDeepGemmExperts` or `BatchedTritonExperts` based on environment settings. |
230+
| DeepGemmExperts | Contiguous / Non-Batched | Uses DeepGemm’s Grouped Gemm kernels for fused_moe operation. |
231+
| TritonExperts | Contiguous / Non-Batched | Uses a Triton Kernel for fused_moe matmuls. |
232+
| TritonOrDeepGemmExperts | Contiguous / Non-Batched | Chooses either the `DeepGemmExperts` or `TritonExperts` based on fused_moe inputs. |
233+
| CutlassExpertsFP8 | Supports both Batched and Contiguous formats | Uses Cutlass Grouped Gemm implementations for the fp8 matmuls. |
234+
| CutlassExpertsFP4 | Supports both Batched and Contiguous formats | Uses Cutlass Grouped Gemm implementations for the fp4 matmuls. |
235+
| FlashInferExperts | Contiguous | Uses fused_moe operation from FlashInfer |
236+
| NaiveBatchedExperts | Batched | Reference Batched Experts implementation. Primarily used in unit tests. |

0 commit comments

Comments
 (0)