Main mfix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_lentp #3910

zouyida2052 · 2025-10-30T13:06:24Z

What this PR does / why we need it?

Revert bugfix for mtp in fullgraph and support it when vllm supports
raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_len
bugfix when max_num_seqs=14 in mtp=2 scenario

Does this PR introduce any user-facing change?

no

How was this patch tested?

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

github-actions · 2025-10-30T13:06:35Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

This reverts commit adadd50. Signed-off-by: zouyida2052 <[email protected]>

… of uniform_decode_query_len Signed-off-by: zouyida2052 <[email protected]>

Signed-off-by: zouyida2052 <[email protected]>

gemini-code-assist

Code Review

This pull request introduces several changes, including reverting a previous bugfix, adding a validation check for cudagraph_capture_sizes, and fixing a bug related to max_num_seqs. The changes in vllm_ascend/worker/model_runner_v1.py to raise an error for invalid cudagraph_capture_sizes are well-implemented with a clear error message. The refactoring in vllm_ascend/utils.py and vllm_ascend/torchair/torchair_model_runner.py improves code structure.

However, I've identified a critical issue in vllm_ascend/torchair/torchair_model_runner.py where mc2_tokens_capacity is calculated incorrectly, which could lead to insufficient memory allocation. Additionally, there's an opportunity to improve error handling for hardware limit violations by raising exceptions instead of just logging errors. Please see my detailed comments for suggestions.

gemini-code-assist · 2025-10-30T13:09:12Z

vllm_ascend/torchair/torchair_model_runner.py

+        max_graph_batch_size = self.calculate_new_torchair_graph_batch_size(
+            self.max_num_reqs, tp_size)


There appears to be a bug in the calculation of max_graph_batch_size. The function calculate_new_torchair_graph_batch_size expects a token count, but it's being called with self.max_num_reqs, which is a request count. This will lead to an incorrect and much smaller mc2_tokens_capacity. The previous implementation correctly calculated max_num_tokens = self.max_num_reqs * self.uniform_decode_query_len. This logic should be restored before calling the new helper function.

Suggested change

max_graph_batch_size = self.calculate_new_torchair_graph_batch_size(

self.max_num_reqs, tp_size)

max_num_tokens = self.max_num_reqs * self.uniform_decode_query_len

max_graph_batch_size = self.calculate_new_torchair_graph_batch_size(

max_num_tokens, tp_size)

gemini-code-assist · 2025-10-30T13:09:12Z

vllm_ascend/torchair/torchair_model_runner.py

+        if get_ascend_soc_version(
+        ) == AscendSocVersion.A3 and self.mc2_tokens_capacity > 512:
+            logger.error(
+                f"A3: the max number of tokens must smaller then 512, but now is {self.mc2_tokens_capacity}"
+            )
+        if get_ascend_soc_version(
+        ) == AscendSocVersion.A2 and self.mc2_tokens_capacity > 256:
+            logger.error(
+                f"A2: the max number of tokens must smaller then 256, but now is {self.mc2_tokens_capacity}"
+            )


Using logger.error for violations of hard hardware limits might not be sufficient. An error log will be printed, but the execution will continue, potentially leading to more obscure failures later on. It would be better to raise a ValueError to halt execution immediately and provide a clear error message to the user. This also provides an opportunity to improve the error messages for clarity and grammatical correctness. Additionally, calling get_ascend_soc_version() once and storing it in a local variable would be more efficient.

soc_version = get_ascend_soc_version() if soc_version == AscendSocVersion.A3 and self.mc2_tokens_capacity > 512: raise ValueError( f"On Ascend A3, the max number of tokens for mc2 must be smaller than or equal to 512, but it is {self.mc2_tokens_capacity}" ) if soc_version == AscendSocVersion.A2 and self.mc2_tokens_capacity > 256: raise ValueError( f"On Ascend A2, the max number of tokens for mc2 must be smaller than or equal to 256, but it is {self.mc2_tokens_capacity}" )

Signed-off-by: zouyida2052 <[email protected]>

github-actions bot added the module:core label Oct 30, 2025

zouyida2052 added 3 commits October 30, 2025 21:06

Revert "bugfix for mtp fullgraph (vllm-project#3845)"

eb30c92

This reverts commit adadd50. Signed-off-by: zouyida2052 <[email protected]>

raise error when cudagraph_capture_sizes can't be an integer multiple…

814b409

… of uniform_decode_query_len Signed-off-by: zouyida2052 <[email protected]>

bugfix when max_num_seqs=14 in mtp=2 scenario

c892e9a

Signed-off-by: zouyida2052 <[email protected]>

zouyida2052 force-pushed the main_mtp branch from b2b0940 to c892e9a Compare October 30, 2025 13:06

gemini-code-assist bot reviewed Oct 30, 2025

View reviewed changes

fix lint

b9949a7

Signed-off-by: zouyida2052 <[email protected]>

whx-sjtu added ready read for review ready-for-test start test by label for PR labels Oct 30, 2025

wangxiyuan added ready-for-test start test by label for PR and removed ready-for-test start test by label for PR labels Oct 30, 2025

wangxiyuan approved these changes Oct 31, 2025

View reviewed changes

wangxiyuan merged commit 1966885 into vllm-project:main Oct 31, 2025
57 of 75 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Main mfix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_lentp #3910

Main mfix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_lentp #3910

zouyida2052 commented Oct 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Oct 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		max_graph_batch_size = self.calculate_new_torchair_graph_batch_size(
		self.max_num_reqs, tp_size)

Main mfix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_lentp #3910

Main mfix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_lentp #3910

Conversation

zouyida2052 commented Oct 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Oct 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zouyida2052 commented Oct 30, 2025 •

edited by github-actions bot

Loading