-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Flashinfer_CUTLASS_MOE fuses quantization for TP #27223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Shu Wang. <[email protected]>
1057c27
to
4c37b7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Alternatively, you could prevent modular kernels from being created in this case and fall though to the direct call to flashinfer_cutlass_moe_fp4
?
Yes. But do you prefer to have modular kernels anyways? |
|
||
assert self.moe_quant_config is not None | ||
|
||
return flashinfer_cutlass_moe_fp4( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update for compressed-tensors too?
vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Line 516 in f9e7ad5
return flashinfer_cutlass_moe_fp4( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But by deleting this elif
clause (and, per @mgoin 's suggestion, applying this change to compressed-tensors), doesn't it force the use of FlashInfer cutlass implementation to go through the modular kernels?
I'm just trying to understand if this is the plan for all cases that use FlashInfer, regardless of distributed strategies, or whether self.flashinfer_moe_backend
is FlashinferMoeBackend.TENSORRT_LLM
or FlashinferMoeBackend.CUTLASS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason to use the modular kernels for cases that aren't using some kind of all2all communication. In this particular case I think @wenscarl and @leejnau figured out that this was dead code because the CUTLASS case always created a modular kernel. I'm not sure if the same holds true for compressed_tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnellnm this PR is updated with a additional quant_dtype
: nvfp4_skip_quantization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just trying to understand if this is the plan for all cases that use FlashInfer
I vote for that. Since flashinfer cutlass moe is at least a better option to normal cutlass_moe. TRTLLM MoE can win sometimes even.
I've no preference for this particular case since there's no communication going on. My only concern would be that the
|
c78959c
to
4c37b7d
Compare
co-authored by @leejnau
cc. @bnellnm
Purpose
For TP case, the nvfp4 quantzation is fused with flashinfer_cutlass_moe call. This fixes the accuracy issue for nvidia/Deepseek-R1-0528-FP4-v2 model.
For DP case, the fix should rely on #26135.
Test Plan
Test Result
Previously:
With this PR:
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.