update vllm kernel benchmark scripts by 1pikachu · Pull Request #176 · vllm-project/vllm-xpu-kernels

1pikachu · 2026-03-05T07:32:35Z

Purpose

update kernel benchmark scripts to set up regular validation pipelines

Copilot

Pull request overview

Updates and expands XPU kernel benchmark scripts to better support regular validation (correctness checks + perf reporting), with more consistent runtime configuration logging.

Changes:

Added several new benchmark scripts for FP8 quant/GEMM, rotary embedding, MoE kernels, FlashAttention varlen, and MLA concat/cache.
Updated existing benchmarks to print the active benchmark configuration/provider for easier CI log triage.
Simplified RMSNorm benchmarking by removing optional IPEX path and expanding the config grid.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
benchmark/benchmark_topk.py	Adds per-run config/provider logging for CI traceability.
benchmark/benchmark_swigluoai_and_mul.py	Adds per-run config/provider logging for CI traceability.
benchmark/benchmark_static_scaled_fp8_quant.py	New correctness + perf benchmark for static scaled FP8 quant.
benchmark/benchmark_rotary_embedding.py	New correctness + perf benchmark for rotary embedding native vs vLLM paths.
benchmark/benchmark_rmsnorm.py	Removes IPEX path, expands config space, adds logging, refactors “naive” naming.
benchmark/benchmark_reshape_and_cache.py	Adds per-run config/provider logging for CI traceability.
benchmark/benchmark_moe_sum.py	New correctness + perf benchmark for moe_sum op.
benchmark/benchmark_moe_align_block_size.py	New correctness + perf benchmarks + opchecks for moe_align_block_size variants.
benchmark/benchmark_lora.py	Fixes imports to use `benchmark.*` package paths.
benchmark/benchmark_grouped_topk.py	Adds per-run config/provider logging for CI traceability.
benchmark/benchmark_fp8_gemm_w8a16.py	New correctness + perf benchmark for fp8_gemm_w8a16.
benchmark/benchmark_dynamic_per_token_scaled_fp8_quant.py	New correctness + perf benchmark for dynamic per-token FP8 quant.
benchmark/benchmark_cutlass_fused_moe.py	New correctness + perf benchmark for CUTLASS fused MoE vs reference.
benchmark/benchmark_cutlass_flash_attn_varlen.py	New correctness + perf benchmark for FlashAttention varlen vs native reference.
benchmark/benchmark_concat_and_cache_mla.py	New benchmark comparing `torch.cat` vs direct copy for MLA concat.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T07:38:53Z

benchmark/benchmark_cutlass_flash_attn_varlen.py

+                                kv_lens=kv_lens,
+                                block_tables=block_tables,
+                                scale=scale,
+                                casual=is_causal,


ref_paged_attn is called with keyword argument casual, which is very likely a typo for causal. If ref_paged_attn doesn’t accept casual, this will raise a TypeError and break the benchmark/correctness run. Rename the keyword to causal (or match the exact parameter name in ref_paged_attn).

Suggested change

casual=is_causal,

causal=is_causal,

Copilot · 2026-03-05T07:38:54Z

benchmark/benchmark_cutlass_flash_attn_varlen.py

+            kv_lens=kv_lens,
+            block_tables=block_tables,
+            scale=scale,
+            casual=is_causal,


Same issue as above: casual is likely an invalid keyword for ref_paged_attn and will crash at runtime. Use the correct keyword (likely causal).

Suggested change

casual=is_causal,

causal=is_causal,

Copilot · 2026-03-05T07:38:54Z

benchmark/benchmark_rotary_embedding.py

+    positions = torch.randint(0,
+                              max_position, (batch_size, seq_len),
+                              device=device)
+    head_stride = head_size + (64 if head_stride_is_contiguous else 0)


The head_stride_is_contiguous flag appears inverted here: when the stride is contiguous, head_stride typically equals head_size (no padding). Adding 64 when head_stride_is_contiguous is True makes the tensor less contiguous and undermines the intended layout coverage. Swap the condition so extra padding is added only when testing the non-contiguous stride case.

Suggested change

head_stride = head_size + (64 if head_stride_is_contiguous else 0)

head_stride = head_size + (0 if head_stride_is_contiguous else 64)

Copilot · 2026-03-05T07:38:54Z

benchmark/benchmark_cutlass_flash_attn_varlen.py

+                                dtype=dtype)
+    else:
+        key_cache = torch.randn(sum(kv_lens),
+                                num_query_heads,


In the non-paged KV path, key_cache is generated with num_query_heads. KV cache tensors are typically shaped with num_kv_heads (and the code already distinguishes num_query_heads vs num_kv_heads). Generating KV with query heads can make the reference and kernel consume mismatched shapes/semantics. Use num_kv_heads here (and ensure the reference path expects the same).

Suggested change

num_query_heads,

num_kv_heads,

Copilot · 2026-03-05T07:38:54Z

benchmark/benchmark_static_scaled_fp8_quant.py

+    args = parse_args()
+    seed_everything(0)
+
+    num_toknes = [1, 7, 83, 4096]


Correct the typo in the variable name num_toknes to num_tokens for readability and consistency (it’s printed as num_tokens and used to build configs).

Copilot · 2026-03-05T07:38:55Z

benchmark/benchmark_static_scaled_fp8_quant.py

+    fp8_dtype = [torch.float8_e4m3fn]
+    group_shape = [(1, -1), (-1, 1)]
+    print("Final configuration:")
+    print(f"  num_tokens: {num_toknes}")


Correct the typo in the variable name num_toknes to num_tokens for readability and consistency (it’s printed as num_tokens and used to build configs).

Copilot · 2026-03-05T07:38:55Z

benchmark/benchmark_static_scaled_fp8_quant.py

+    print(f"  group_shape: {group_shape}")
+
+    configs = list(
+        itertools.product(num_toknes, hidden_size, dtype, fp8_dtype, group_shape))


Correct the typo in the variable name num_toknes to num_tokens for readability and consistency (it’s printed as num_tokens and used to build configs).

Copilot · 2026-03-05T07:38:55Z

benchmark/benchmark_moe_align_block_size.py

+from tests.utils import parse_args, opcheck, round_up, seed_everything
+from tests.test_moe_align_block_size import torch_moe_align_block_size, _verify_expert_level_sorting
+
+seed_everything(0)


Calling seed_everything(0) at import time introduces a module import side-effect (and can interfere with other benchmarks/tests when this file is imported). Move seeding under the if __name__ == \"__main__\": guard (or into the specific routines) so importing the module doesn’t alter global RNG state.

Copilot · 2026-03-05T07:38:55Z

benchmark/benchmark_rmsnorm.py

+    print(f"native output={output_native}")
    print(f"vLLM output={output_vllm}")


Printing full output tensors for every correctness config can massively slow down benchmark runs and flood CI logs (especially with the expanded config grid). Consider printing only on mismatch (or printing summary stats like max/mean abs diff) and keeping the config identifier for debugging.

1pikachu and others added 29 commits January 21, 2026 23:43

add fp8_gemm_w8a16 benchmark

44d7b5e

add rotary_embedding

ff4ed61

update moe_align_block_size

28f94cf

update moe_sum benchmark

e384cb3

update dynamic_per_token_scaled_fp8_quant

f388c66

update concat_and_cache_mla

51f80dd

update cutlass_fused_moe

044ecb8

update cutlass_fused_moe

c0d6a23

update cutlass_fused_moe

5c15367

update cutlass_flash_attn_varlen

8c0945c

Merge branch 'main' of https://github.com/vllm-project/vllm-xpu-kernels

a8242d6

fix logic issue

e14b589

show config when calculate diff

f4eef6b

update script to calculate kernel duration without cloning

0632f04

update

4e9355d

update static_scaled_fp8_quant

cfaa5a2

update benchmark_dynamic_per_token_scaled_fp8_quant

4527dc7

rename variables

1cb7d17

rename variables

4a7e421

fix issue

14571c9

update cutlass_flash_attn_varlen to release cache memory

7434758

Merge branch 'main' of https://github.com/1pikachu/vllm-xpu-kernels

6b3bdc2

update cutlass_fused_moe

47e48fe

update cutlass flash_attn & fused_moe

3936396

update dynamic_per_token_scaled_fp8_quant

f339fe0

update cutlass_fused_moe

f22b728

update cutlass_flash_attn_varlen

ff236db

update cutlass_flash_attn_varlen

4937aed

unified script format

6d7a071

Copilot AI review requested due to automatic review settings March 5, 2026 07:32

Merge branch 'main' into main

9331ee4

Copilot started reviewing on behalf of 1pikachu March 5, 2026 07:38 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

1pikachu added 13 commits March 4, 2026 23:55

update cutlass_flash_attn_varlen

4f16d95

fix the mistakes from copilot

c6f6618

Merge branch 'main' of https://github.com/1pikachu/vllm-xpu-kernels

218df23

update cutlass_flash_attn_varlen

5fd7464

update cutlass_flash_attn_decode

b9b0f16

update cutlass_flash_attn kernel time

d9e2df6

fix cutlass_flash_attn mistakes

1f189ce

fix cutlass_flash_attn and update cutlass_fused_moe kernel time

9accb66

fix cutlass_flash_attn and update cutlass_fused_moe kernel time

d1ce50f

update cutlass_flash_attn_decode

97af674

update cutlass_fused_moe

ef7dbfb

update cutlass_flash_attn_decode

7c532a4

update cutlass_fused_moe

e14c8b2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update vllm kernel benchmark scripts#176

update vllm kernel benchmark scripts#176
1pikachu wants to merge 43 commits intovllm-project:mainfrom
1pikachu:main

1pikachu commented Mar 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	head_stride = head_size + (64 if head_stride_is_contiguous else 0)
	head_stride = head_size + (0 if head_stride_is_contiguous else 64)

		print(f"native output={output_native}")
		print(f"vLLM output={output_vllm}")

Conversation

1pikachu commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1pikachu commented Mar 5, 2026 •

edited

Loading