Skip to content

Conversation

JC-ut0
Copy link
Contributor

@JC-ut0 JC-ut0 commented Jul 17, 2025

What this PR does / why we need it?

MTP model adapt torchair graph mode

The MTP model only utilizes the torchair graph during the decode phase. Our padding strategy involves running a fixed number of 1+MTP tokens per batch, regardless of whether the main model accepts or rejects them. When generating the MTP hidden state, we only take the last index of the accepted tokens for each batch.

Does this PR introduce any user-facing change?

How was this patch tested?

Tested in DP4/TP4, TP16 DP1, TP8 DP2, and Prefilling Decoding Disaggregation

@JC-ut0 JC-ut0 force-pushed the v0.9.1-dev branch 3 times, most recently from c201ddb to aec1751 Compare July 23, 2025 06:33
@JC-ut0 JC-ut0 requested a review from NeverRaR July 23, 2025 06:57
@JC-ut0 JC-ut0 force-pushed the v0.9.1-dev branch 2 times, most recently from ec6ac82 to bb12ac5 Compare July 23, 2025 07:15
@ganyi1996ppo ganyi1996ppo merged commit 4c369a8 into vllm-project:v0.9.1-dev Jul 23, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants