Conversation
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
There was a problem hiding this comment.
Pull request overview
This PR adds EPLB (Expert Parallel Load Balancing) support by introducing two new SYCL kernels (init_expert_map and remap_hidden_states) that replace the existing fused_moe_prologue operation. The new ops support expert_map for expert parallel deployment and offer better performance in both prefill and decode stages.
Changes:
- Added
init_expert_mapkernel to initialize expert-to-rank mapping andremap_hidden_stateskernel to permute hidden states according to expert assignments. - Replaced
fused_moe_prologueworkspace-based approach inxpu_fused_moewith the new ops, addingexpert_mapparameter support. - Updated
moe_gatherto use the new row-majorunpermuted_row_to_permuted_rowlayout[num_rows, TOPK]instead of[TOPK, num_rows].
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
vllm_xpu_kernels/fused_moe_interface.py |
Replaces fused_moe_prologue with init_expert_map and remap_hidden_states calls; adds expert_map parameter |
csrc/moe/init_expert_map.cpp |
New SYCL kernel to initialize expert map based on EP rank/size |
csrc/moe/remap_hidden_states.cpp |
New SYCL kernel to remap hidden states, compute expert offsets, and build permutation maps |
csrc/moe/moe_ops.h |
Declares the two new C++ functions |
csrc/moe/torch_bindings.cpp |
Registers the new ops with PyTorch |
csrc/moe/moe_gather.cpp |
Updates indexing to match new [num_rows, TOPK] layout of unpermuted_row_to_permuted_row |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
1bd971a to
601cd23
Compare
|
pls also modify test_moe_prologue.py to validate offset/remap_input on all kinds of dtypes. |
|
@Liangliang-Ma |
yeah I mean add your new kernel's tests. You can reuse/modify test_moe_prologue or add new files. |
Got it. I will add it later. |
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
|
@Liangliang-Ma |
Add new sycl op init_expert_map and remap_hidden_states for eplb feature.
https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/?h=
vllm-xpu integration PR:https://github.com/intel-innersource/applications.ai.gpu.vllm-xpu/pull/157
Replace the op fused_moe_prologue which cannot support expert_map by the new ops.


And new ops have better performance:
Prefill stage:
Decode stage: