-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[GPU]qwen3 moe fused compressed #32536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
riverlijunjie
wants to merge
44
commits into
openvinotoolkit:master
Choose a base branch
from
riverlijunjie:river/qwen3_moe_fused_compressed
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+5,571
−102
Open
Changes from all commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
af6463d
qwen3 moe_compressed primitive_impl
riverlijunjie 1b34668
MOECompressed internal op
chenhu-wang 54bb288
update
riverlijunjie 80a6509
Update weight management
riverlijunjie b891113
MOECompressed internal op
chenhu-wang f324a12
Remove softmax_top out of moe primitive implement
riverlijunjie 4d1d72d
MOE to MOECompressed
chenhu-wang cfcbacc
update
riverlijunjie 72ab1fd
Enable moe transformation - FuseVectorizedMOE3GEMM and ConvertMOEToMO…
riverlijunjie 7b4b2d5
update dnnl weight convert
riverlijunjie c05f001
Fix double free issue and kernel build errors
riverlijunjie d630503
Fix router weight gather issue
riverlijunjie d5ba90d
MOECompressed to MOEFusedCompressed
chenhu-wang 5d63e20
Fuse softmax_topk_oneshot with moe_compressed
riverlijunjie c19ca31
Fix windows compiling error
riverlijunjie 5fc87c8
Switch on FuseMOE moc transformation
riverlijunjie d3b8778
align scale and zp format
chenhu-wang 3527786
minor update
riverlijunjie 078d825
Restore keeping FuseMOE off by default
riverlijunjie eaac162
Update moe kernel
riverlijunjie f814faf
cleanup & optimizate intermediate memory
riverlijunjie a081a9c
inherit from moe, keep moe const pass and code clean up
chenhu-wang 420d3f4
Switch on FuseMOE
riverlijunjie cb9cbe2
WA: OCL OUT_OF_RESOURCE issue when input token size < 8
riverlijunjie db436a8
add unit_test
zhaixuejun1993 1b57d17
Fix OUT_OF_RESOURCE issue and remove WA
riverlijunjie e025c81
WA: OCL OUT_OF_RESOURCE when input token size < 8
riverlijunjie 71e6db6
add accuracy ut
zaixing-wang b277578
Fix typo issue
riverlijunjie 79d6a13
add transformations test
chenhu-wang 457c810
Fix 32/1024 out of resouce issue on PTL
riverlijunjie f7d8ead
fix
zaixing-wang e15bed9
Remove WA for CVS-175938
peterchen-intel 2a10a4f
Merge branch 'master' into river/qwen3_moe_fused_compressed
peterchen-intel 5d0101f
add supports_immad condition
zaixing-wang 18b8cf3
Merge branch 'master' into river/qwen3_moe_fused_compressed
riverlijunjie 0aa8c53
Update for review comments
riverlijunjie 9465fc7
update ut
zaixing-wang 0ead32d
separate
zaixing-wang 8a0fa98
clean
zaixing-wang 6c32e83
Align moe cpp file path
riverlijunjie 358a015
Update for reviewing comments
riverlijunjie 0eb170e
update code comment
chenhu-wang 6f69933
Merge branch 'master' into river/qwen3_moe_fused_compressed
riverlijunjie File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
46 changes: 46 additions & 0 deletions
46
src/plugins/intel_gpu/include/intel_gpu/op/moe_fused_compressed.hpp
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| // Copyright (C) 2018-2025 Intel Corporation | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // | ||
|
|
||
| #pragma once | ||
|
|
||
| #include "intel_gpu/op/moe_compressed.hpp" | ||
|
|
||
| namespace ov::intel_gpu::op { | ||
|
|
||
| /// \brief MOEFusedCompressed that support compressed and fused MOE for GEMM3_SWIGLU. | ||
| class MOEFusedCompressed : public MOECompressed { | ||
| public: | ||
| OPENVINO_OP("MOEFusedCompressed", "gpu_opset", MOECompressed); | ||
|
|
||
| MOEFusedCompressed() = default; | ||
|
|
||
| /// \brief Constructs a MOEFusedCompressed operation with config only | ||
| /// \param args The input tensors, in the following order: | ||
| /// 0: hidden_states - input tensor with hidden representations | ||
| /// 1: routing_weights - [num_seq, num_experts] routing weights for all experts | ||
| /// 2: w0_weight - expert weights for first projection, | ||
| /// shape [num_experts, inter_size, group_num, group_size] | ||
| /// 3: w0_scale - expert scale for first projection for compressed experts, | ||
| /// shape [num_experts, inter_size, group_num, 1] | ||
| /// 4: w0_zp - expert zp for first projection for compressed experts, | ||
| /// shape [num_experts, inter_size, group_num, 1] | ||
| /// 5: w1_weight - expert weights for second projection, | ||
| /// shape [num_experts, inter_size, group_num, group_size] | ||
| /// 6: w1_scale - expert scale for second projection for compressed experts, | ||
| /// shape [num_experts, inter_size, group_num, 1] | ||
| /// 7: w1_zp - expert zp for second projection for compressed experts, | ||
| /// shape [num_experts, inter_size, group_num, 1] | ||
| /// 8: w2_weight - expert weights for final projection, | ||
| /// shape [num_experts, hidden_size, group_num, group_size] | ||
| /// 9: w2_scale - expert scale for final projection for compressed experts, | ||
| /// shape [num_experts, hidden_size, group_num, 1] | ||
| /// 10: w2_zp - expert zp for final projection for compressed experts, | ||
| /// shape [num_experts, hidden_size, group_num, 1] | ||
| /// \param config Configuration for the MOE operation | ||
| MOEFusedCompressed(const OutputVector& args, const MOECompressed::Config config); | ||
|
|
||
| std::shared_ptr<Node> clone_with_new_inputs(const OutputVector& new_args) const override; | ||
| }; | ||
|
|
||
| } // namespace ov::intel_gpu::op | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
72 changes: 72 additions & 0 deletions
72
src/plugins/intel_gpu/include/intel_gpu/primitives/moe_fused_compressed.hpp
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| // Copyright (C) 2025 Intel Corporation | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // | ||
|
|
||
| #pragma once | ||
| #include <vector> | ||
|
|
||
| #include "intel_gpu/op/moe_fused_compressed.hpp" | ||
| #include "intel_gpu/runtime/engine.hpp" | ||
| #include "primitive.hpp" | ||
|
|
||
| namespace cldnn { | ||
| using MOEFusedCompressed = ov::intel_gpu::op::MOEFusedCompressed; | ||
|
|
||
| /// @brief moe compressed primitive | ||
| /// @details Performs moe compressed | ||
| struct moe_fused_compressed : public primitive_base<moe_fused_compressed> { | ||
| CLDNN_DECLARE_PRIMITIVE(moe_fused_compressed) | ||
|
|
||
| moe_fused_compressed() : primitive_base("", {}) {} | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please modify the primitive name too, for the specifc target pattern. |
||
|
|
||
| // @brief Constructs moe primitive / layer. | ||
| // | ||
| // @param id An identifier of new primitive. | ||
| // @param inputs A list of Input primitive ids (inputs). | ||
| // 0: hidden_states - input tensor with hidden representations | ||
| // 1: routing_weights - [num_seq, num_experts] routing weights for all experts | ||
| // 2: w0_weight - expert weights for first projection, | ||
| // shape [num_experts, inter_size, group_num, group_size] | ||
| // 3: w0_scale - expert scale for first projection for compressed experts, | ||
| // shape [num_experts, inter_size, group_num, 1] | ||
| // 4: w0_zp - expert zp for first projection for compressed experts, | ||
| // shape [num_experts, inter_size, group_num, 1] | ||
| // 5: w1_weight - expert weights for second projection, | ||
| // shape [num_experts, inter_size, group_num, group_size] | ||
| // 6: w1_scale - expert scale for second projection for compressed experts, | ||
| // shape [num_experts, inter_size, group_num, 1] | ||
| // 7: w1_zp - expert zp for second projection for compressed experts, | ||
| // shape [num_experts, inter_size, group_num, 1] | ||
| // 8: w2_weight - expert weights for final projection, | ||
| // shape [num_experts, hidden_size, group_num, group_size] | ||
| // 9: w2_scale - expert scale for final projection for compressed experts, | ||
| // shape [num_experts, hidden_size, group_num, 1] | ||
| // 10: w2_zp - expert zp for final projection for compressed experts, | ||
| // | ||
| moe_fused_compressed(const primitive_id& id, const std::vector<input_info>& inputs, const MOEFusedCompressed::Config& config) | ||
| : primitive_base(id, inputs, 1, {optional_data_type()}), | ||
| _config(config) {} | ||
|
|
||
| MOEFusedCompressed::Config _config; | ||
|
|
||
| bool operator==(const primitive& rhs) const override { | ||
| if (!compare_common_params(rhs)) | ||
| return false; | ||
|
|
||
| auto rhs_casted = downcast<const moe_fused_compressed>(rhs); | ||
|
|
||
| return std::memcmp(&_config, &rhs_casted._config, sizeof(_config)) == 0; | ||
| } | ||
|
|
||
| void save(BinaryOutputBuffer& ob) const override { | ||
| primitive_base<moe_fused_compressed>::save(ob); | ||
| ob << make_data(&_config, sizeof(_config)); | ||
| } | ||
|
|
||
| void load(BinaryInputBuffer& ib) override { | ||
| primitive_base<moe_fused_compressed>::load(ib); | ||
| ib >> make_data(&_config, sizeof(_config)); | ||
| } | ||
| }; | ||
|
|
||
| } // namespace cldnn | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description is for 3gemm_Swiglu_type only. Please mention that