[GPU][CPU] Use GatherMatmul as MOE building block by EgorDuplensky · Pull Request #34591 · openvinotoolkit/openvino

EgorDuplensky · 2026-03-09T18:57:48Z

Details:

Align MOE processing and transformations between CPU and GPU plugins
Use existing CPU op GatherMatmul as GPU MOE op building block
Previous flow: TiledMoeBlock -> MOEOp -> MOECompressedOp -> MOECompressedOpWithRouting
New flow: TiledMoeBlock -> MoeBlockViaGatherMatmuls(compressed) -> MOEOp(compressed) -> MOECompressedOpWithRouting
The idea is that in case complex MOE / MOEFused patterns are not matched, the models should still perform well enough with GatherMatmuls as separate ops

TODO

Add ascii diagram for every transformation
Finalize GPU GatherMatmul implementation

EgorDuplensky · 2026-03-11T16:24:40Z

Converted to a draft for now to prevent CI executions

p-durandin · 2026-03-16T11:00:23Z

@EgorDuplensky please provide description of approach (either in PR or in ticket)

EgorDuplensky · 2026-03-16T12:17:40Z

@EgorDuplensky please provide description of approach (either in PR or in ticket)

The high-level description has been already provided. Could you please specify, what extra information is needed?

Fix CI Fix CI #2 Add nested braces for MOE::Config base subobject in aggregate initialization of MOECompressed::Config to fix -Wmissing-braces and constructor matching errors on Clang/Emscripten. Fix CI #3 Remove explicit Config() = default from MOECompressed::Config. User-declared constructors prevent aggregate initialization in C++17, causing build failures on Clang (Linux CC, WebAssembly, Android). Fix CI #4 Guard BGMRuntimeParams usage behind ENABLE_ONEDNN_FOR_GPU in gather_matmul.cpp to fix Android build where oneDNN is not available. Fix CI #5 Move BGMRuntimeParams out of ENABLE_ONEDNN_FOR_GPU guard in gather_matmul_gen_micro.hpp — it has no oneDNN dependency. Revert unnecessary guard around update_rt_params in gather_matmul.cpp. Fix CI #6 Remove redundant fuse_moe integration test (covered by GPU unit tests). Fix size_t-to-int32_t narrowing warning in moe_e2e_pipeline_test.cpp. Fix params order Remove debug serialization Rename MoEMatMulsFusionTest to ConvertTiledMoeBlockToGatherMatmulsTest Align test class, params type, and suite names with the transformation name and file name convention. Fix CI #7 Remove redundant Reshape wrapping from FuseMOE3GemmCompressed callback. The output shape is now derived from input 0 directly, making the Reshape unnecessary. Update FuseMOE3GemmCompressedTest1 to use the new simplified routing pattern (Transpose+Unsqueeze) instead of the old scatter-based path. Remove debug serialization block. Add functional test Add status 'changed / unchanged' to visualization Fix functional MOE test Optimized gather mamtul version Fix bias input handling Allow exception to pass through Keep GatherMatmul weights precision Fix dynamic dimensions handling Fix zero point handling Update moe3gemm zp handling Fix handling of transposed weights Extend tests [DEBUG][TMP] Add _model_name to gpu network

riverlijunjie · 2026-03-17T07:22:46Z

GatherMatmul is a key building block for expert computation in full MoE, but it does not represent the entire MoE workflow. Should we consider introducing a dedicated op/primitive to encapsulate end-to-end MoE computation in the GPU plugin (similar to PA/SDPA)? This may reduce scalability to some extent, but it could improve performance both on the host side and in kernel fusion opportunities.
In the GPU plugin, GatherMatmul currently uses two micro-GEMM kernel paths for prefill and decode phases. This design helps both compute-bound and memory-bound scenarios. We may further improve performance by fusing post-processing into micro-GEMM (similar to PR [GPU] enable post_proc for moe micro_gemm #33723) to avoid extra write-back/read-back traffic.
gather_sort currently uses a single thread for sorting, which is a clear bottleneck. Performance degrades significantly for long token sequences and large top-k values. This path should be optimized with parallel sorting.
oneDNN is introducing grouped_gemm as an alternative to micro-GEMM for MoE support. Should we consider adopting grouped_gemm in this path? If yes, what are the expected trade-offs and rationale (e.g., maintainability, portability, peak performance)? If no, why?
Do we have performance data for these paths? It would be helpful to quantify current efficiency and estimate the gap to the roofline.

EgorDuplensky requested review from a team as code owners March 9, 2026 18:57

github-actions bot added category: Core OpenVINO Core (aka ngraph) category: GPU OpenVINO GPU plugin category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations category: CPP API OpenVINO CPP API bindings labels Mar 9, 2026

maxnick self-assigned this Mar 10, 2026

EgorDuplensky marked this pull request as draft March 11, 2026 16:24

EgorDuplensky force-pushed the moe_batched_gather_matmul branch from dd7c521 to 97dcf8f Compare March 16, 2026 12:29

EgorDuplensky assigned e-ddykim and isanghao Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU][CPU] Use GatherMatmul as MOE building block#34591

[GPU][CPU] Use GatherMatmul as MOE building block#34591
EgorDuplensky wants to merge 1 commit intoopenvinotoolkit:masterfrom
EgorDuplensky:moe_batched_gather_matmul

EgorDuplensky commented Mar 9, 2026 •

edited

Loading

Uh oh!

EgorDuplensky commented Mar 11, 2026

Uh oh!

p-durandin commented Mar 16, 2026

Uh oh!

EgorDuplensky commented Mar 16, 2026

Uh oh!

riverlijunjie commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

EgorDuplensky commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

TODO

Uh oh!

EgorDuplensky commented Mar 11, 2026

Uh oh!

p-durandin commented Mar 16, 2026

Uh oh!

EgorDuplensky commented Mar 16, 2026

Uh oh!

riverlijunjie commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

EgorDuplensky commented Mar 9, 2026 •

edited

Loading