[WIP][Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend #23289

MatthewBonanni · 2025-08-20T21:33:33Z

Purpose

Enable FP8 KV cache support on Blackwell in the CUTLASS_MLA backend.

Based on #22668, merge that first

Test Plan

Test Result

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Matthew Bonanni <[email protected]>

Signed-off-by: breno.skuk <[email protected]> Signed-off-by: Breno Baldas Skuk <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Michael Goin <[email protected]>

…ect#23008) Signed-off-by: mgoin <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]>

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>

…tion related code (vllm-project#23122) Signed-off-by: Thomas Parnell <[email protected]>

…ng Tests (vllm-project#22871) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

…#22776) Signed-off-by: Woosuk Kwon <[email protected]>

Signed-off-by: Xiao Liu <[email protected]>

) Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: Chengji Yao <[email protected]>

…-project#22725) Signed-off-by: Nikhil Suryawanshi <[email protected]>

…llm-project#22023) Signed-off-by: Benji Beck <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Signed-off-by: yewentao256 <[email protected]>

…ing-2506 (vllm-project#23114) Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Signed-off-by: calvin chen <[email protected]>

Signed-off-by: Shiming Zhang <[email protected]>

Signed-off-by: Xin Yang <[email protected]>

…#23245) Signed-off-by: DarkLight1337 <[email protected]>

…elease image (vllm-project#23129) Signed-off-by: mgoin <[email protected]>

Signed-off-by: zhouchong <[email protected]> Co-authored-by: zhouchong <[email protected]>

Signed-off-by: Jee Jee Li <[email protected]>

Signed-off-by: rongfu.leng <[email protected]>

Signed-off-by: Matthew Bonanni <[email protected]>

github-actions · 2025-08-20T21:33:41Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-08-20T21:34:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

MatthewBonanni and others added 30 commits August 11, 2025 18:38

Pass layer to _forward_decode, add q and k descale for FlashMLA backend

c840bc0

Signed-off-by: Matthew Bonanni <[email protected]>

Update tests

5854011

Signed-off-by: Matthew Bonanni <[email protected]>

Quantize Q and KV, fix shape error with scales

cce4e0a

Signed-off-by: Matthew Bonanni <[email protected]>

Update to reflect FP8 support in FLASHMLA backend

64febac

Signed-off-by: Matthew Bonanni <[email protected]>

Update cmake

b548b10

Signed-off-by: Matthew Bonanni <[email protected]>

Address comment

8ae24a3

Signed-off-by: Matthew Bonanni <[email protected]>

Address pre-commit hooks

860f3e0

Signed-off-by: Matthew Bonanni <[email protected]>

Dequant in chunked prefill

dd7977d

Signed-off-by: Matthew Bonanni <[email protected]>

Dequant within gather_cache kernel

8dfbf29

Signed-off-by: Matthew Bonanni <[email protected]>

Update test

926ba4d

Signed-off-by: Matthew Bonanni <[email protected]>

Merge branch 'main' into feature/fp8_mla_flashmla

9cb1836

Signed-off-by: Matthew Bonanni <[email protected]>

Merge branch 'main' into feature/fp8_mla_flashmla

b1d21c8

Update GIT_TAG

56e8135

Signed-off-by: Matthew Bonanni <[email protected]>

Merge branch 'vllm-project:main' into feature/fp8_mla_flashmla

d618243

Remove unnecessary contiguous() calls - tensors are already contiguous

7b86ffb

Signed-off-by: Matthew Bonanni <[email protected]>

Update fp8 platform/backend support logic

f59cf50

Signed-off-by: Matthew Bonanni <[email protected]>

Use Blackwell FlashInfer MXFP4 MoE by default if available (vllm-proj…

804bc10

…ect#23008) Signed-off-by: mgoin <[email protected]>

Install tpu_info==0.4.0 to fix core dump for TPU (vllm-project#23135)

279ac5c

[Misc] Minor refactoring for prepare_inputs (vllm-project#23116)

5759f9d

Signed-off-by: Woosuk Kwon <[email protected]>

[Spec Decode] Make propose_draft_token_ids non-blocking for lower T…

be1ab29

…TFT (vllm-project#23041) Signed-off-by: Woosuk Kwon <[email protected]>

[Misc] Add @tdoublep as a maintainer of hybrid model and Triton-atten…

d414f5f

…tion related code (vllm-project#23122) Signed-off-by: Thomas Parnell <[email protected]>

[V0 Deprecation] Remove V0 FlashInfer attention backend (vllm-project…

99a371f

…#22776) Signed-off-by: Woosuk Kwon <[email protected]>

chore: disable enable_cpp_symbolic_shape_guards (vllm-project#23048)

fcfa758

Signed-off-by: Xiao Liu <[email protected]>

[TPU] make ptxla not imported when using tpu_commons (vllm-project#23081

07887d0

) Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: Chengji Yao <[email protected]>

[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes (vllm…

27d601a

…-project#22725) Signed-off-by: Nikhil Suryawanshi <[email protected]>

Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema (v…

7344fad

…llm-project#22023) Signed-off-by: Benji Beck <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

[Log] Warning Once for Cutlass MLA (vllm-project#23137)

e753218

Signed-off-by: yewentao256 <[email protected]>

[Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Think…

0977823

…ing-2506 (vllm-project#23114) Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

calvin0327 and others added 12 commits August 20, 2025 15:41

[Model] use autoWeightsLoader for gptoss (vllm-project#22446)

7832428

Signed-off-by: calvin chen <[email protected]>

Fix missing quotes (vllm-project#23242)

f07a407

Signed-off-by: Shiming Zhang <[email protected]>

[Model] Support deepseek with eagle (vllm-project#21086)

02b939e

Signed-off-by: Xin Yang <[email protected]>

[Bugfix] Ensure correctness of Cohere2Vision processing (vllm-project…

287c883

…#23245) Signed-off-by: DarkLight1337 <[email protected]>

Update to flashinfer-python==0.2.12 and disable AOT compile for non-r…

75cc6b8

…elease image (vllm-project#23129) Signed-off-by: mgoin <[email protected]>

[Model][V1] Support Ernie MTP (vllm-project#22169)

31aa116

Signed-off-by: zhouchong <[email protected]> Co-authored-by: zhouchong <[email protected]>

[Model] Improve olmo and olmo2 (vllm-project#23228)

059a042

Signed-off-by: Jee Jee Li <[email protected]>

[Fix] fix offline env use local mode path (vllm-project#22526)

fe37b01

Signed-off-by: rongfu.leng <[email protected]>

Enable FP8 in cutlass MLA impl

c7c64d6

Signed-off-by: Matthew Bonanni <[email protected]>

Update kv_cache_dtype support in CudaPlatformBase

35930d3

Signed-off-by: Matthew Bonanni <[email protected]>

Dequant the output for the up proj

433bd5a

Signed-off-by: Matthew Bonanni <[email protected]>

Update test, use style of test_flashmla.py

28d207d

Signed-off-by: Matthew Bonanni <[email protected]>

mergify bot added the needs-rebase label Aug 20, 2025

MatthewBonanni mentioned this pull request Aug 21, 2025

[Kernel] Add FP8 support with FlashMLA backend #22668

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP][Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend #23289

[WIP][Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend #23289

MatthewBonanni commented Aug 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 20, 2025

Uh oh!

mergify bot commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

[WIP][Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend #23289

Are you sure you want to change the base?

[WIP][Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend #23289

Conversation

MatthewBonanni commented Aug 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 20, 2025

Uh oh!

mergify bot commented Aug 20, 2025

Uh oh!

Uh oh!

MatthewBonanni commented Aug 20, 2025 •

edited by github-actions bot

Loading