Skip to content

Update flashinfer CUTLASS MoE Kernel #21408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 24, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from vllm.model_executor.layers.fused_moe.config import FusedMoEQuantConfig
from vllm.model_executor.layers.fused_moe.utils import (
extract_required_args, moe_kernel_quantize_input)
from vllm.utils.flashinfer import block_scale_interleave
from vllm.utils.flashinfer import nvfp4_block_scale_interleave


def get_local_sizes(local_tokens):
Expand Down Expand Up @@ -92,7 +92,7 @@ def prepare(
dim=0,
sizes=get_local_sizes(local_tokens))
a1_m, a1_n = a1q.shape
a1q_scale = block_scale_interleave(a1q_scale)
a1q_scale = nvfp4_block_scale_interleave(a1q_scale)

return a1q, a1q_scale, None, topk_ids, topk_weights

Expand Down
4 changes: 2 additions & 2 deletions vllm/model_executor/layers/quantization/modelopt.py
Original file line number Diff line number Diff line change
Expand Up @@ -1254,8 +1254,8 @@ def apply(
x, layer.w13_weight, layer.w2_weight), (
"Flashinfer CUTLASS Fused MoE not applicable!")

a1_gscale = torch.min(layer.w13_input_scale_quant)
a2_gscale = torch.min(layer.w2_input_scale_quant)
a1_gscale = layer.w13_input_scale_quant
a2_gscale = layer.w2_input_scale_quant
Comment on lines +1257 to +1258
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The change from torch.min(layer.w13_input_scale_quant) to layer.w13_input_scale_quant may be incorrect. The a1_gscale variable is used in extra_prepare_args and passed to FlashInferCutlassMoEPrepareAndFinalize.prepare. This method uses the scale to quantize the input activation tensor (hidden_states) before the tokens are routed to the experts. The input activation tensor has a shape of (num_tokens, hidden_dim). To quantize this tensor, a single scalar scale is required. The original code, torch.min(layer.w13_input_scale_quant), correctly produced this scalar by selecting the most conservative scale among all per-expert scales. The new code passes layer.w13_input_scale_quant, which is a tensor of per-expert scales with shape (num_experts,). Using per-expert scales to quantize the entire activation tensor before expert routing is logically incorrect and may cause a runtime error or incorrect results from flashinfer.fp4_quantize. Please revert this change to use torch.min to ensure a correct scalar scale is used for the initial input quantization.

Suggested change
a1_gscale = layer.w13_input_scale_quant
a2_gscale = layer.w2_input_scale_quant
a1_gscale = torch.min(layer.w13_input_scale_quant)
a2_gscale = torch.min(layer.w2_input_scale_quant)

extra_expert_args = {
'g1_alphas': layer.g1_alphas,
'g2_alphas': layer.g2_alphas,
Expand Down
8 changes: 4 additions & 4 deletions vllm/utils/flashinfer.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,8 +69,8 @@ def wrapper(*args, **kwargs):
flashinfer_cutlass_fused_moe = _lazy_import_wrapper("flashinfer.fused_moe",
"cutlass_fused_moe")
fp4_quantize = _lazy_import_wrapper("flashinfer", "fp4_quantize")
block_scale_interleave = _lazy_import_wrapper("flashinfer",
"block_scale_interleave")
nvfp4_block_scale_interleave = _lazy_import_wrapper(
"flashinfer", "nvfp4_block_scale_interleave")

# Special case for autotune since it returns a context manager
autotune = _lazy_import_wrapper(
Expand All @@ -95,7 +95,7 @@ def has_flashinfer_cutlass_fused_moe() -> bool:
required_functions = [
("flashinfer.fused_moe", "cutlass_fused_moe"),
("flashinfer", "fp4_quantize"),
("flashinfer", "block_scale_interleave"),
("flashinfer", "nvfp4_block_scale_interleave"),
]

for module_name, attr_name in required_functions:
Expand All @@ -110,7 +110,7 @@ def has_flashinfer_cutlass_fused_moe() -> bool:
"flashinfer_trtllm_fp8_block_scale_moe",
"flashinfer_cutlass_fused_moe",
"fp4_quantize",
"block_scale_interleave",
"nvfp4_block_scale_interleave",
"autotune",
"has_flashinfer_moe",
"has_flashinfer_cutlass_fused_moe",
Expand Down