|
| 1 | +# Fused MoE Modular Kernel |
| 2 | + |
| 3 | +## Introduction |
| 4 | +FusedMoEModularKernel is implemented [here](gh-file:/vllm/model_executor/layers/fused_moe/modular_kernel.py) |
| 5 | + |
| 6 | +Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types. |
| 7 | + |
| 8 | +* Contiguous / Standard / Non-Batched, and |
| 9 | +* Batched |
| 10 | + |
| 11 | +!!! note |
| 12 | + The terms Contiguous, Standard, and Non-Batched are used interchangeably throughout the document. |
| 13 | + |
| 14 | +The input activation format completely depends on the All2All Dispatch being used. |
| 15 | + |
| 16 | +* In the Contiguous variant, the All2All Dispatch returns the activations as a contiguous tensor of shape (M, K) along with TopK Ids and TopK weights of shape (M, num_topk). Look at `DeepEPHTPrepareAndFinalize` for an example. |
| 17 | +* In the Batched variant, the All2All Dispatch returns the activations as a tensor of shape (num_experts, max_tokens, K). Here, the activations/tokens that subscribe to the same expert are batched together. Note that not all entries of the tensor are valid. The activations tensor is typically accompanied by an `expert_num_tokens` tensor of size `num_experts`, where `expert_num_tokens[i]` indicates the number of valid tokens that subscribe to the ith expert. Look at `PplxPrepareAndFinalize` or `DeepEPLLPrepareAndFinalize` for an example. |
| 18 | + |
| 19 | +The FusedMoE operation is generally made of multiple operations, in both the Contiguous and Batched variants, as described in the diagrams below |
| 20 | + |
| 21 | + |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +!!! note |
| 26 | + The main difference, in terms of operations, between the Batched and Non-Batched cases is the Permute / Unpermute operations. All other operations remain. |
| 27 | + |
| 28 | +## Motivation |
| 29 | + |
| 30 | +As can be seen from the diagrams, there are a lot of operations and there can be a variety of implementations for each operation. The set of ways the operations can be put together to make a valid FusedMoE implementation quickly becomes intractable. The Modular Kernel framework addresses this issue, by grouping the operations into logical components. This broad categorization makes the combinations manageable and prevents code-duplication. This also decouples the All2All Dispatch & Combine implementations from the FusedMoE implementations and allows for their independent development and testing. Furthermore, the Modular Kernel framework introduces Abstract classes for the different components thus providing a well-defined skeleton for future implementations. |
| 31 | + |
| 32 | +The rest of the document will focus on the Contiguous / Non-Batched case. Extrapolating to the Batched case should be straight-forward. |
| 33 | + |
| 34 | +## ModularKernel Components: |
| 35 | +FusedMoEModularKernel splits the FusedMoE operation into 3 parts, |
| 36 | + |
| 37 | +1. TopKWeightAndReduce |
| 38 | +2. FusedMoEPrepareAndFinalize |
| 39 | +3. FusedMoEPermuteExpertsUnpermute |
| 40 | + |
| 41 | +### TopKWeightAndReduce |
| 42 | +The TopK Weight Application and Reduction components happen right after the Unpermute operation and before the All2All Combine. Note that the `FusedMoEPermuteExpertsUnpermute` is responsible for the Unpermute and `FusedMoEPrepareAndFinalize` is responsible for the All2All Combine. There is value in doing the TopK Weight Application and Reduction in the `FusedMoEPermuteExpertsUnpermute`. But some implementations choose to do it `FusedMoEPrepareAndFinalize`. In order to enable this flexibility, we have a TopKWeightAndReduce abstract class. |
| 43 | + |
| 44 | +Please find the implementations of TopKWeightAndReduce [here](gh-file:vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py). |
| 45 | + |
| 46 | +`FusedMoEPrepareAndFinalize::finalize()` method accepts a `TopKWeightAndReduce` argument that is invoked inside the method. |
| 47 | +The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEPermuteExpertsUnpermute` and `FusedMoEPerpareAndFinalize` implementations to determine where the TopK Weight Application and Reduction happens. |
| 48 | + |
| 49 | +* `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceNoOp` if the `FusedMoEPermuteExpertsUnpermute` implementation does the weight application and reduction itself. |
| 50 | +* `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceContiguous` / `TopKWeightAndReduceNaiveBatched` / `TopKWeightAndReduceDelegate` if the `FusedMoEPermuteExpertsUnpermute` implementation needs the `FusedMoEPrepareAndFinalize::finalize()` to do the weight application and reduction. |
| 51 | + |
| 52 | +### FusedMoEPrepareAndFinalize |
| 53 | +The `FusedMoEPrepareAndFinalize` abstract class exposes `prepare` and `finalize` functions. |
| 54 | +The `prepare` function is responsible for input activation Quantization and All2All Dispatch. The `finalize` function is responsible for invoking the All2All Combine. Additionally the `finalize` function may or may not do the TopK weight application and reduction (Please refer to the TopKWeightAndReduce section) |
| 55 | + |
| 56 | + |
| 57 | + |
| 58 | +### FusedMoEPermuteExpertsUnpermute |
| 59 | +The `FusedMoEPermuteExpertsUnpermute` class is where the crux of the MoE operations happen. The `FusedMoEPermuteExpertsUnpermute` abstract class exposes a few important functions, |
| 60 | + |
| 61 | +* apply() |
| 62 | +* workspace_shapes() |
| 63 | +* finalize_weight_and_reduce_impl() |
| 64 | + |
| 65 | +#### apply() |
| 66 | +The `apply` method is where the implementations perform |
| 67 | + |
| 68 | +* Permute |
| 69 | +* Matmul with weight W1 |
| 70 | +* Act + Mul |
| 71 | +* Quantization |
| 72 | +* Matmul with weight W2 |
| 73 | +* Unpermute |
| 74 | +* Maybe TopK Weight Application + Reduction |
| 75 | + |
| 76 | +#### workspace_shapes() |
| 77 | +The core FusedMoE implementation performs a series of operations. It would be inefficient to create output memory for each of these operations separately. To that effect, implementations are required to declare 2 workspace shapes, the workspace datatype and the FusedMoE output shape as outputs of the workspace_shapes() method. This information is used to allocate the workspace tensors and the output tensor in `FusedMoEModularKernel::forward()` and passed on to the `FusedMoEPermuteExpertsUnpermute::apply()` method. The workspaces could then be used as intermediate buffers in the FusedMoE implementation. |
| 78 | + |
| 79 | +#### finalize_weight_and_reduce_impl() |
| 80 | +It is sometimes efficient to perform TopK weight application and Reduction inside the `FusedMoEPermuteExpertsUnpermute::apply()`. Find an example [here](https://github.com/vllm-project/vllm/pull/20228). We have a `TopKWeightAndReduce` abstract class to facilitate such implementations. Please refer to the TopKWeightAndReduce section. |
| 81 | +`FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl()` returns the `TopKWeightAndReduce` object that the implementation wants the `FusedMoEPrepareAndFinalize::finalize()` to use. |
| 82 | + |
| 83 | + |
| 84 | + |
| 85 | +### FusedMoEModularKernel |
| 86 | +`FusedMoEModularKernel` is composed of the `FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` objects. |
| 87 | +`FusedMoEModularKernel` pseudocode/sketch, |
| 88 | + |
| 89 | +``` |
| 90 | +FusedMoEModularKernel::__init__(self, |
| 91 | + prepare_finalize: FusedMoEPrepareAndFinalize, |
| 92 | + fused_experts: FusedMoEPermuteExpertsUnpermute): |
| 93 | +
|
| 94 | + self.prepare_finalize = prepare_finalize |
| 95 | + self.fused_experts = fused_experts |
| 96 | +
|
| 97 | +FusedMoEModularKernel::forward(self, DP_A): |
| 98 | +
|
| 99 | + Aq, A_scale, _, _, _ = self.prepare_finalize.prepare(DP_A, ...) |
| 100 | +
|
| 101 | + workspace13_shape, workspace2_shape, _, _ = self.fused_experts.workspace_shapes(...) |
| 102 | +
|
| 103 | + # allocate workspaces |
| 104 | + workspace_13 = torch.empty(workspace13_shape, ...) |
| 105 | + workspace_2 = torch.empty(workspace2_shape, ...) |
| 106 | +
|
| 107 | + # execute fused_experts |
| 108 | + fe_out = self.fused_experts.apply(Aq, A_scale, workspace13, workspace2, ...) |
| 109 | +
|
| 110 | + # war_impl is an object of type TopKWeightAndReduceNoOp if the fused_experts implementations performs the TopK Weight Application and Reduction. |
| 111 | + war_impl = self.fused_experts.finalize_weight_and_reduce_impl() |
| 112 | +
|
| 113 | + output = self.prepare_finalize.finalize(fe_out, war_impl,...) |
| 114 | + |
| 115 | + return output |
| 116 | +``` |
| 117 | + |
| 118 | +## How-To |
| 119 | + |
| 120 | +### How To Add a FusedMoEPrepareAndFinalize Type |
| 121 | +Typically a FusedMoEPrepareAndFinalize type is backed by an All2All Dispatch & Combine implementation / kernel. For example, |
| 122 | + |
| 123 | +* PplxPrepareAndFinalize type is backed by Pplx All2All kernels, |
| 124 | +* DeepEPHTPrepareAndFinalize type is backed by DeepEP High-Throughtput All2All kernels, and |
| 125 | +* DeepEPLLPrepareAndFinalize type is backed by DeepEP Low-Latency All2All kernels. |
| 126 | + |
| 127 | +#### Step 1: Add an All2All manager |
| 128 | +The purpose of the All2All Manager is to setup the All2All kernel implementations. The `FusedMoEPrepareAndFinalize` implementations typically fetch a kernel-implementation "handle" from the All2All Manager to invoke the Dispatch and Combine functions. Please look at the All2All Manager implementations [here](gh-file:vllm/distributed/device_communicators/all2all.py). |
| 129 | + |
| 130 | +#### Step 2: Add a FusedMoEPrepareAndFinalize Type |
| 131 | +This section describes the significance of the various functions exposed by the `FusedMoEPrepareAndFinalize` abstract class. |
| 132 | + |
| 133 | +`FusedMoEPrepareAndFinalize::prepare()`: The prepare method implements the Quantization and All2All Dispatch. Typically the Dispatch function from the relevant All2All Manager is invoked. |
| 134 | + |
| 135 | +`FusedMoEPrepareAndFinalize::finalize()`: Maybe perform TopK Weight Application and Reduction and All2All Combine. Typically the Combine function from the relevant All2AllManager is invoked. |
| 136 | + |
| 137 | +`FusedMoEPrepareAndFinalize::activation_format()`: Return `FusedMoEActivationFormat.BatchedExperts` if the output of the prepare method (i.e. the All2All dispatch) is Batched. Return `FusedMoEActivationFormat.Standard` otherwise. |
| 138 | + |
| 139 | +`FusedMoEPrepareAndFinalize::topk_indices_dtype()`: Data type of the TopK ids. Some All2All kernels have strict requirements pertaining to the data type of the TopK ids. This requirement is passed on to the `FusedMoe::select_experts` function so it could be respected. If there are no strict requirements return None. |
| 140 | + |
| 141 | +`FusedMoEPrepareAndFinalize::max_num_tokens_per_rank()`: This is the maximum number of tokens that would be submitted to the All2All Dispatch at once. |
| 142 | + |
| 143 | +`FusedMoEPrepareAndFinalize::num_dispatchers()`: Total number of dispatching units. This value determines the size of the Dispatch output. The Dispatch output is of shape (num_local_experts, max_num_tokens, K). Here max_num_tokens = num_dispatchers() * max_num_tokens_per_rank(). |
| 144 | + |
| 145 | +We suggest picking an already existing `FusedMoEPrepareAndFinalize` implementation that matches your All2All implementation closely and using it as a reference. |
| 146 | + |
| 147 | +### How To Add a FusedMoEPermuteExpertsUnpermute Type |
| 148 | +FusedMoEPermuteExpertsUnpermute performs the core of the FusedMoE operations. The various functions exposed by the abstract class and their significance is as follows, |
| 149 | + |
| 150 | +`FusedMoEPermuteExpertsUnpermute::activation_formats()`: Return the supported Input and Output activation formats. i.e. Contiguous / Batched format. |
| 151 | + |
| 152 | +`FusedMoEPermuteExpertsUnpermute::supports_chunking()`: Return True if the implementation supports chunking. Typically |
| 153 | +implementations that input `FusedMoEActivationFormat.Standard` support chunking and `FusedMoEActivationFormat.BatchedExperts` do not. |
| 154 | + |
| 155 | +`FusedMoEPermuteExpertsUnpermute::supports_expert_map()`: Return True if the implementation supports expert map. |
| 156 | + |
| 157 | +`FusedMoEPermuteExpertsUnpermute::workspace_shapes()` / |
| 158 | +`FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` / |
| 159 | +`FusedMoEPermuteExpertsUnpermute::apply`: Refer to `FusedMoEPermuteExpertsUnpermute` section above. |
| 160 | + |
| 161 | +### FusedMoEModularKernel Initialization |
| 162 | +`FusedMoEMethodBase` class has 2 methods that are collectively responsible in creating the `FusedMoEModularKernel` object. They are, |
| 163 | + |
| 164 | +* select_gemm_impl, and |
| 165 | +* init_prepare_finalize |
| 166 | + |
| 167 | +#### select_gemm_impl |
| 168 | +The `select_gemm_impl` method is undefined in the base class. It is the responsibility of the derived class to implement a method that constructs a valid/appropriate `FusedMoEPermuteExpertsUnpermute` object. |
| 169 | +Please refer to the implementations in, |
| 170 | + |
| 171 | +* `UnquantizedFusedMoEMethod` |
| 172 | +* `CompressedTensorsW8A8Fp8MoEMethod` |
| 173 | +* `CompressedTensorsW8A8Fp8MoECutlassMethod` |
| 174 | +* `Fp8MoEMethod` |
| 175 | +* `ModelOptNvFp4FusedMoE` |
| 176 | +dervied classes. |
| 177 | + |
| 178 | +#### init_prepare_finalize |
| 179 | +Based on the input and env settings, the `init_prepare_finalize` method creates the appropriate `FusedMoEPrepareAndFinalize` object. The method then queries `select_gemm_impl` for the appropriate `FusedMoEPermuteExpertsUnpermute` object and builds the `FusedMoEModularKernel` object |
| 180 | + |
| 181 | +Please take a look at [init_prepare_finalize](https://github.com/vllm-project/vllm/blob/1cbf951ba272c230823b947631065b826409fa62/vllm/model_executor/layers/fused_moe/layer.py#L188). |
| 182 | +**Important**: The `FusedMoEMethodBase` derived classes use the `FusedMoEMethodBase::fused_experts` object in their `apply` methods. When settings permit the construction of a valid `FusedMoEModularKernel` object, we override `FusedMoEMethodBase::fused_experts` with it. This essentially makes the derived classes agnostic to what FusedMoE implementation is used. |
| 183 | + |
| 184 | +### How To Unit Test |
| 185 | +We have `FusedMoEModularKernel` unit tests at [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py). |
| 186 | + |
| 187 | +The unit test iterates through all combinations of `FusedMoEPrepareAndFinalize` and `FusedMoEPremuteExpertsUnpermute` types and if they are |
| 188 | +compatible, runs some correctness tests. |
| 189 | +If you are adding some `FusedMoEPrepareAndFinalize` / `FusedMoEPermuteExpertsUnpermute` implementations, |
| 190 | + |
| 191 | +1. Add the implementation type to `MK_ALL_PREPARE_FINALIZE_TYPES` and `MK_FUSED_EXPERT_TYPES` in [mk_objects.py](gh-file:tests/kernels/moe/modular_kernel_tools/mk_objects.py) respectively. |
| 192 | +2. Update `Config::is_batched_prepare_finalize()`, `Config::is_batched_fused_experts()`, `Config::is_standard_fused_experts()`, |
| 193 | +`Config::is_fe_16bit_supported()`, `Config::is_fe_fp8_supported()`, `Config::is_fe_block_fp8_supported()`, |
| 194 | +`Config::is_fe_supports_chunking()` methods in [/tests/kernels/moe/modular_kernel_tools/common.py](gh-file:tests/kernels/moe/modular_kernel_tools/common.py) |
| 195 | + |
| 196 | +Doing this will add the new implementation to the test suite. |
| 197 | + |
| 198 | +### How To Check `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` Compatibility |
| 199 | +The unit test file [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py) can also be executed as a standalone script. |
| 200 | +Example: `python3 -m tests.kernels.moe.test_modular_kernel_combinations --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts` |
| 201 | +As a side-effect, this script can be used to test `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` compatibility. When invoked |
| 202 | +with incompatible types, the script will error. |
| 203 | + |
| 204 | +### How To Profile |
| 205 | +Please take a look at [profile_modular_kernel.py](gh-file:tests/kernels/moe/modular_kernel_tools/profile_modular_kernel.py) |
| 206 | +The script can be used to generate Torch traces for a single `FusedMoEModularKernel::forward()` call for any compatible |
| 207 | +`FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` types. |
| 208 | +Example: `python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts` |
| 209 | + |
| 210 | +## FusedMoEPrepareAndFinalize Implementations |
| 211 | +The following table lists the `FusedMoEPrepareAndFinalize` implementations at the time of writing, |
| 212 | + |
| 213 | +| Implementation | Type | Comments | |
| 214 | +| :--- | :--- | :--- | |
| 215 | +| DeepEPHTPrepareAndFinalize | Contiguous / Non-Batched | Uses the DeepEP High-Throughput all2all kernels. | |
| 216 | +| DeepEPLLPrepareAndFinalize | Batched | Uses the DeepEP Low-Latency all2all kernels. | |
| 217 | +| PplxPrepareAndFinalize | Batched | Uses the Perplexity all2all kernels. | |
| 218 | +| FlashInferCutlassMoEPrepareAndFinalize | Contiguous | | |
| 219 | +| MoEPrepareAndFinalizeNoEP | Contiguous | This implementation is used when there is no EP. i.e. no all2all kernels are invoked. | |
| 220 | +| BatchedPrepareAndFinalize | Batched | A reference prepare/finalize class that reorganizes the tokens into expert batched format, i.e. E x max_num_tokens x K. (Doesn’t use any all2all kernels. This is primarily used in unit testing) | |
| 221 | + |
| 222 | +## FusedMoEPermuteExpertsUnpermute |
| 223 | +The following table lists the `FusedMoEPermuteExpertsUnpermute` implementations at the time of writing, |
| 224 | + |
| 225 | +| Implementation | Type | Comment | |
| 226 | +| :--- | :--- | :--- | |
| 227 | +| BatchedDeepGemmExperts | Batched | Uses the DeepGemm’s Masked Grouped Gemm kernels for the fused_moe operation. | |
| 228 | +| BatchedTritonExperts | Batched | Uses a Triton Kernel for the Batched matmuls. | |
| 229 | +| BatchedTritonOrDeepGemmExperts | Batched | Chooses either the `BatchedDeepGemmExperts` or `BatchedTritonExperts` based on environment settings. | |
| 230 | +| DeepGemmExperts | Contiguous / Non-Batched | Uses DeepGemm’s Grouped Gemm kernels for fused_moe operation. | |
| 231 | +| TritonExperts | Contiguous / Non-Batched | Uses a Triton Kernel for fused_moe matmuls. | |
| 232 | +| TritonOrDeepGemmExperts | Contiguous / Non-Batched | Chooses either the `DeepGemmExperts` or `TritonExperts` based on fused_moe inputs. | |
| 233 | +| CutlassExpertsFP8 | Supports both Batched and Contiguous formats | Uses Cutlass Grouped Gemm implementations for the fp8 matmuls. | |
| 234 | +| CutlassExpertsFP4 | Supports both Batched and Contiguous formats | Uses Cutlass Grouped Gemm implementations for the fp4 matmuls. | |
| 235 | +| FlashInferExperts | Contiguous | Uses fused_moe operation from FlashInfer | |
| 236 | +| NaiveBatchedExperts | Batched | Reference Batched Experts implementation. Primarily used in unit tests. | |
0 commit comments