Skip to content

Conversation

@MatthewBonanni
Copy link
Contributor

@MatthewBonanni MatthewBonanni commented Feb 12, 2026

Purpose

DeepseekV32IndexerMetadataBuilder currently only reports support for UNIFORM_SINGLE_TOKEN_DECODE. Therefore, when running a sparse MLA model with MTP, FULL cudagraphs are never captured.

In reality, the deepGEMM kernel fp8_paged_mqa_logits supports MTP with num_speculative_tokens=1 (i.e. next_n = 2).

This PR changes the reported support to UNIFORM_BATCH and adds an explicit error for num_speculative_tokens > 1, rather than letting the kernel itself crash.

Test Plan

vllm serve deepseek-ai/DeepSeek-V3.2 \
    -tp 8 -ep \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
    --no-enable-prefix-caching

with

wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/main/data/spec_bench/question.jsonl
vllm bench serve \
    --dataset-name spec_bench \
    --dataset-path question.jsonl \
    --spec-bench-output-len 1024 \
    --seed 42 \
    --ignore-eos \
    --temperature 0 \
    --skip-chat-template

Test Result

Main:

PR:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  234.50    
Total input tokens:                      269392    
Total generated tokens:                  1023063   
Request throughput (req/s):              4.26      
Output token throughput (tok/s):         4362.81   
Peak output token throughput (tok/s):    7000.00   
Peak concurrent requests:                1000.00   
Total token throughput (tok/s):          5511.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          41407.32  
Median TTFT (ms):                        35815.64  
P99 TTFT (ms):                           95279.58  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          156.45    
Median TPOT (ms):                        154.36    
P99 TPOT (ms):                           192.94    
---------------Inter-token Latency----------------
Mean ITL (ms):                           305.20    
Median ITL (ms):                         145.98    
P99 ITL (ms):                            3598.89   
---------------Speculative Decoding---------------
Acceptance rate (%):                     95.25     
Acceptance length:                       1.95      
Drafts:                                  523691    
Draft tokens:                            523691    
Accepted tokens:                         498803    
Per-position acceptance (%):
  Position 0:                            95.25     
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <[email protected]>
@mergify mergify bot added nvidia v1 bug Something isn't working labels Feb 12, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly enables FULL cudagraph support for sparse MLA models with MTP by changing _cudagraph_support to UNIFORM_BATCH. It also adds a necessary safeguard to prevent crashes by raising a ValueError for unsupported num_speculative_tokens > 1, which is a limitation of the fp8_paged_mqa_logits kernel. The changes are well-implemented, improving both functionality and robustness. The code is clean and the logic is sound.

Signed-off-by: Matthew Bonanni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia performance Performance-related issues v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant