Skip to content

[GPU][CPU] Use GatherMatmul as MOE building block#34591

Draft
EgorDuplensky wants to merge 1 commit intoopenvinotoolkit:masterfrom
EgorDuplensky:moe_batched_gather_matmul
Draft

[GPU][CPU] Use GatherMatmul as MOE building block#34591
EgorDuplensky wants to merge 1 commit intoopenvinotoolkit:masterfrom
EgorDuplensky:moe_batched_gather_matmul

Conversation

@EgorDuplensky
Copy link
Contributor

@EgorDuplensky EgorDuplensky commented Mar 9, 2026

Details:

  • Align MOE processing and transformations between CPU and GPU plugins
  • Use existing CPU op GatherMatmul as GPU MOE op building block
  • Previous flow: TiledMoeBlock -> MOEOp -> MOECompressedOp -> MOECompressedOpWithRouting
  • New flow: TiledMoeBlock -> MoeBlockViaGatherMatmuls(compressed) -> MOEOp(compressed) -> MOECompressedOpWithRouting
  • The idea is that in case complex MOE / MOEFused patterns are not matched, the models should still perform well enough with GatherMatmuls as separate ops

TODO

  • Add ascii diagram for every transformation
  • Finalize GPU GatherMatmul implementation

@EgorDuplensky EgorDuplensky requested review from a team as code owners March 9, 2026 18:57
@github-actions github-actions bot added category: Core OpenVINO Core (aka ngraph) category: GPU OpenVINO GPU plugin category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations category: CPP API OpenVINO CPP API bindings labels Mar 9, 2026
@maxnick maxnick self-assigned this Mar 10, 2026
@EgorDuplensky EgorDuplensky marked this pull request as draft March 11, 2026 16:24
@EgorDuplensky
Copy link
Contributor Author

Converted to a draft for now to prevent CI executions

@p-durandin
Copy link
Contributor

@EgorDuplensky please provide description of approach (either in PR or in ticket)

@EgorDuplensky
Copy link
Contributor Author

@EgorDuplensky please provide description of approach (either in PR or in ticket)

The high-level description has been already provided. Could you please specify, what extra information is needed?

Fix CI

Fix CI #2

Add nested braces for MOE::Config base subobject in aggregate
initialization of MOECompressed::Config to fix -Wmissing-braces
and constructor matching errors on Clang/Emscripten.

Fix CI #3

Remove explicit Config() = default from MOECompressed::Config.
User-declared constructors prevent aggregate initialization in C++17,
causing build failures on Clang (Linux CC, WebAssembly, Android).

Fix CI #4

Guard BGMRuntimeParams usage behind ENABLE_ONEDNN_FOR_GPU in
gather_matmul.cpp to fix Android build where oneDNN is not available.

Fix CI #5

Move BGMRuntimeParams out of ENABLE_ONEDNN_FOR_GPU guard in
gather_matmul_gen_micro.hpp — it has no oneDNN dependency.
Revert unnecessary guard around update_rt_params in gather_matmul.cpp.

Fix CI #6

Remove redundant fuse_moe integration test (covered by GPU unit tests).
Fix size_t-to-int32_t narrowing warning in moe_e2e_pipeline_test.cpp.

Fix params order

Remove debug serialization

Rename MoEMatMulsFusionTest to ConvertTiledMoeBlockToGatherMatmulsTest

Align test class, params type, and suite names with the transformation
name and file name convention.

Fix CI #7

Remove redundant Reshape wrapping from FuseMOE3GemmCompressed callback.
The output shape is now derived from input 0 directly, making the
Reshape unnecessary. Update FuseMOE3GemmCompressedTest1 to use the
new simplified routing pattern (Transpose+Unsqueeze) instead of the
old scatter-based path. Remove debug serialization block.

Add functional test

Add status 'changed / unchanged' to visualization

Fix functional MOE test

Optimized gather mamtul version

Fix bias input handling

Allow exception to pass through

Keep GatherMatmul weights precision

Fix dynamic dimensions handling

Fix zero point handling

Update moe3gemm zp handling

Fix handling of transposed weights

Extend tests

[DEBUG][TMP] Add _model_name to gpu network
@riverlijunjie
Copy link
Contributor

  1. GatherMatmul is a key building block for expert computation in full MoE, but it does not represent the entire MoE workflow. Should we consider introducing a dedicated op/primitive to encapsulate end-to-end MoE computation in the GPU plugin (similar to PA/SDPA)? This may reduce scalability to some extent, but it could improve performance both on the host side and in kernel fusion opportunities.

  2. In the GPU plugin, GatherMatmul currently uses two micro-GEMM kernel paths for prefill and decode phases. This design helps both compute-bound and memory-bound scenarios. We may further improve performance by fusing post-processing into micro-GEMM (similar to PR [GPU] enable post_proc for moe micro_gemm #33723) to avoid extra write-back/read-back traffic.

  3. gather_sort currently uses a single thread for sorting, which is a clear bottleneck. Performance degrades significantly for long token sequences and large top-k values. This path should be optimized with parallel sorting.

  4. oneDNN is introducing grouped_gemm as an alternative to micro-GEMM for MoE support. Should we consider adopting grouped_gemm in this path? If yes, what are the expected trade-offs and rationale (e.g., maintainability, portability, peak performance)? If no, why?

  5. Do we have performance data for these paths? It would be helpful to quantify current efficiency and estimate the gap to the roofline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: Core OpenVINO Core (aka ngraph) category: CPP API OpenVINO CPP API bindings category: CPU OpenVINO CPU plugin category: GPU OpenVINO GPU plugin category: transformations OpenVINO Runtime library - Transformations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants