[fmha] support fp8 query by xinyu-intel · Pull Request #153 · vllm-project/vllm-xpu-kernels

xinyu-intel · 2026-02-11T13:16:56Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

added fp8 query support. Currently, fp8 q/k/v will be dequantized to the output dtype before mma.

depends on #150

vllm change: xinyu-intel/vllm@97c2151

example:

VLLM_WORKER_MULTIPROC_METHOD=spawn python examples/offline_inference/data_parallel.py --model /workspace/Qwen3-0.6B/ --no-enable-expert-parallel --enforce-eager --kv-cache-dtype fp8 --calculate-kv-scales

Test Plan

Test Result

(Optional) Documentation Update

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Copilot

Pull request overview

This PR adds support for FP8 query tensors in the flash attention implementation. Currently, FP8 Q/K/V tensors are dequantized to the output dtype before matrix multiplication operations.

Changes:

Modified flash attention interface to accept optional FP8 query scaling parameters
Updated kernel implementations to handle FP8 query dequantization
Enhanced test coverage for FP8 query scenarios

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
vllm_xpu_kernels/flash_attn_interface.py	Added type hints for q/k/v_descale parameters and validation logic
tests/flash_attn/test_flash_attn_varlen_func.py	Extended tests to include FP8 query dtypes and updated reference implementation
csrc/xpu/attn/xe_2/fmha_xe2.h	Changed k_scale and v_scale from float to optional Tensor
csrc/xpu/attn/xe_2/fmha_xe2.cpp	Updated to handle optional Tensor scaling parameters and FP8 query detection
csrc/xpu/attn/xe_2/fmha_utils.hpp	Extended CutlassQKType to CutlassQKOType to include output dtype
csrc/xpu/attn/xe_2/collective/chunk_prefill_mainloop.hpp	Implemented FP8 query dequantization logic in mainloop
csrc/xpu/attn/xe_2/collective/chunk_prefill_epilogue.hpp	Updated sink element type to use output dtype
csrc/xpu/attn/xe_2/chunk_prefill_utils.hpp	Updated type signatures for new scaling parameter types
csrc/xpu/attn/xe_2/chunk_prefill_kernel_template.cpp.in	Updated template parameter from CutlassQKType to CutlassQKOType
csrc/xpu/attn/xe_2/chunk_prefill_extern.hpp	Updated extern template declarations for new type
csrc/xpu/attn/xe_2/chunk_prefill.hpp	Added FP8 query kernel configurations and updated type references
csrc/xpu/attn/attn_interface.h	Updated interface signature for optional Tensor scaling
csrc/xpu/attn/attn_interface.cpp	Forwarded new q_scale parameter to implementation
csrc/flash_attn/flash_api.cpp	Modified dtype validation and added FP8 query support
.github/workflows/ut.yaml	Reduced MAX_JOB from 128 to 72

Comments suppressed due to low confidence (1)

tests/flash_attn/test_flash_attn_varlen_func.py:1

Line 267 checks q_descale is not None but should check v_descale is not None. This is a copy-paste error that will cause incorrect behavior when v_descale is provided without q_descale.

# SPDX-License-Identifier: Apache-2.0

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-11T13:17:56Z

tests/flash_attn/test_flash_attn_varlen_func.py

+    if is_fp8_query:
+        q_descale = (torch.abs(query).max() / 200).to(torch.float32)
+        maybe_quantized_query = (query / q_descale).to(q_dtype)
+    if is_fp8kv:
+        k_descale = (torch.abs(key_cache).max() / 200).to(torch.float32)
+        v_descale = (torch.abs(value_cache).max() / 200).to(torch.float32)
+        maybe_quantized_key_cache = (key_cache / k_descale).to(fp8_dtype)
+        maybe_quantized_value_cache = (value_cache / v_descale).to(fp8_dtype)


The magic number 200 appears three times without explanation. This should be extracted as a named constant (e.g., FP8_QUANTIZATION_DIVISOR) with a comment explaining its purpose in the FP8 scaling calculation.

Copilot · 2026-02-11T13:17:56Z

csrc/xpu/attn/xe_2/fmha_xe2.cpp

+      is_fp8_q ? q_scale.value().data_ptr() : nullptr,
+      is_fp8_kv ? k_scale.value().data_ptr() : nullptr,
+      is_fp8_kv ? v_scale.value().data_ptr() : nullptr,


Direct use of .value() without checking .has_value() first can lead to undefined behavior if the optional is empty. Add explicit checks or document the assumption that when is_fp8_q or is_fp8_kv is true, the corresponding scale tensors must have values.

Copilot · 2026-02-11T13:17:57Z

tests/flash_attn/test_flash_attn_varlen_func.py

+                                        q_descale=q_descale.expand(scale_shape)
+                                        if q_descale is not None else None,
+                                        k_descale=k_descale.expand(scale_shape)
+                                        if k_descale is not None else None,
+                                        v_descale=v_descale.expand(scale_shape)
+                                        if v_descale is not None else None,


Line 287 checks v_descale is not None but should check v_descale is not None. This appears to be a duplicate of the issue on line 267 - the condition is checking the wrong variable.

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

Copilot AI review requested due to automatic review settings February 11, 2026 13:16

Copilot AI reviewed Feb 11, 2026

View reviewed changes

xinyu-intel force-pushed the dev/fp8-query branch from 2c8a646 to 30af0cf Compare February 28, 2026 04:45

xinyu-intel mentioned this pull request Mar 3, 2026

[Kernel] refactor cache kernel #169

Merged

xinyu-intel force-pushed the dev/fp8-query branch from 30af0cf to c3df9d2 Compare March 7, 2026 07:08

jikunshang mentioned this pull request Mar 9, 2026

Attention kernel optimization #185

Open

8 tasks

xinyu-intel added 3 commits March 9, 2026 16:13

support fp8 query

d27c918

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

explicitly instantiation with dtype combinations

4925bf4

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

try to split attn binaries

c0f0d16

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

xinyu-intel force-pushed the dev/fp8-query branch from c8bf0cf to c0f0d16 Compare March 10, 2026 07:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fmha] support fp8 query#153

[fmha] support fp8 query#153
xinyu-intel wants to merge 3 commits intovllm-project:mainfrom
xinyu-intel:dev/fp8-query

xinyu-intel commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xinyu-intel commented Feb 11, 2026

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants