[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs #34457

MatthewBonanni · 2026-02-12T19:50:26Z

Purpose

DeepseekV32IndexerMetadataBuilder currently only reports support for UNIFORM_SINGLE_TOKEN_DECODE. Therefore, when running a sparse MLA model with MTP, FULL cudagraphs are never captured.

In reality, the deepGEMM kernel fp8_paged_mqa_logits supports MTP with num_speculative_tokens=1 (i.e. next_n = 2).

This PR changes the reported support to UNIFORM_BATCH and adds an explicit error for num_speculative_tokens > 1, rather than letting the kernel itself crash.

Test Plan

vllm serve deepseek-ai/DeepSeek-V3.2 \
    -tp 8 -ep \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
    --no-enable-prefix-caching

with

wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/main/data/spec_bench/question.jsonl
vllm bench serve \
    --dataset-name spec_bench \
    --dataset-path question.jsonl \
    --spec-bench-output-len 1024 \
    --seed 42 \
    --ignore-eos \
    --temperature 0 \
    --skip-chat-template

Test Result

Main:

PR:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  234.50    
Total input tokens:                      269392    
Total generated tokens:                  1023063   
Request throughput (req/s):              4.26      
Output token throughput (tok/s):         4362.81   
Peak output token throughput (tok/s):    7000.00   
Peak concurrent requests:                1000.00   
Total token throughput (tok/s):          5511.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          41407.32  
Median TTFT (ms):                        35815.64  
P99 TTFT (ms):                           95279.58  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          156.45    
Median TPOT (ms):                        154.36    
P99 TPOT (ms):                           192.94    
---------------Inter-token Latency----------------
Mean ITL (ms):                           305.20    
Median ITL (ms):                         145.98    
P99 ITL (ms):                            3598.89   
---------------Speculative Decoding---------------
Acceptance rate (%):                     95.25     
Acceptance length:                       1.95      
Drafts:                                  523691    
Draft tokens:                            523691    
Accepted tokens:                         498803    
Per-position acceptance (%):
  Position 0:                            95.25     
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <[email protected]>

gemini-code-assist

Code Review

This pull request correctly enables FULL cudagraph support for sparse MLA models with MTP by changing _cudagraph_support to UNIFORM_BATCH. It also adds a necessary safeguard to prevent crashes by raising a ValueError for unsupported num_speculative_tokens > 1, which is a limitation of the fp8_paged_mqa_logits kernel. The changes are well-implemented, improving both functionality and robustness. The code is clean and the logic is sound.

Signed-off-by: Matthew Bonanni <[email protected]>

Fix

a0528fe

Signed-off-by: Matthew Bonanni <[email protected]>

MatthewBonanni requested a review from pavanimajety as a code owner February 12, 2026 19:50

mergify bot added nvidia v1 bug Something isn't working labels Feb 12, 2026

github-project-automation bot added this to NVIDIA Feb 12, 2026

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

MatthewBonanni mentioned this pull request Feb 12, 2026

[Bug]: [H200] DeepSeek V3.2 MTP > 1 run into error (FLASHMLA_SPARSE backend) #31845

Open

1 task

mergify bot added the performance Performance-related issues label Feb 12, 2026

Fix

c849515

Signed-off-by: Matthew Bonanni <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs #34457

[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs #34457

MatthewBonanni commented Feb 12, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs #34457

Are you sure you want to change the base?

[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs #34457

Conversation

MatthewBonanni commented Feb 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MatthewBonanni commented Feb 12, 2026 •

edited by github-actions bot

Loading