Skip to content

Conversation

@zouyida2052
Copy link
Contributor

@zouyida2052 zouyida2052 commented Oct 30, 2025

What this PR does / why we need it?

  1. Revert bugfix for mtp in fullgraph and support it when vllm supports
  2. raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_len
  3. bugfix when max_num_seqs=14 in mtp=2 scenario

Does this PR introduce any user-facing change?

no

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several changes, including reverting a previous bugfix, adding a validation check for cudagraph_capture_sizes, and fixing a bug related to max_num_seqs. The changes in vllm_ascend/worker/model_runner_v1.py to raise an error for invalid cudagraph_capture_sizes are well-implemented with a clear error message. The refactoring in vllm_ascend/utils.py and vllm_ascend/torchair/torchair_model_runner.py improves code structure.

However, I've identified a critical issue in vllm_ascend/torchair/torchair_model_runner.py where mc2_tokens_capacity is calculated incorrectly, which could lead to insufficient memory allocation. Additionally, there's an opportunity to improve error handling for hardware limit violations by raising exceptions instead of just logging errors. Please see my detailed comments for suggestions.

Comment on lines 122 to 123
max_graph_batch_size = self.calculate_new_torchair_graph_batch_size(
self.max_num_reqs, tp_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a bug in the calculation of max_graph_batch_size. The function calculate_new_torchair_graph_batch_size expects a token count, but it's being called with self.max_num_reqs, which is a request count. This will lead to an incorrect and much smaller mc2_tokens_capacity. The previous implementation correctly calculated max_num_tokens = self.max_num_reqs * self.uniform_decode_query_len. This logic should be restored before calling the new helper function.

Suggested change
max_graph_batch_size = self.calculate_new_torchair_graph_batch_size(
self.max_num_reqs, tp_size)
max_num_tokens = self.max_num_reqs * self.uniform_decode_query_len
max_graph_batch_size = self.calculate_new_torchair_graph_batch_size(
max_num_tokens, tp_size)

Comment on lines +126 to +135
if get_ascend_soc_version(
) == AscendSocVersion.A3 and self.mc2_tokens_capacity > 512:
logger.error(
f"A3: the max number of tokens must smaller then 512, but now is {self.mc2_tokens_capacity}"
)
if get_ascend_soc_version(
) == AscendSocVersion.A2 and self.mc2_tokens_capacity > 256:
logger.error(
f"A2: the max number of tokens must smaller then 256, but now is {self.mc2_tokens_capacity}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using logger.error for violations of hard hardware limits might not be sufficient. An error log will be printed, but the execution will continue, potentially leading to more obscure failures later on. It would be better to raise a ValueError to halt execution immediately and provide a clear error message to the user. This also provides an opportunity to improve the error messages for clarity and grammatical correctness. Additionally, calling get_ascend_soc_version() once and storing it in a local variable would be more efficient.

        soc_version = get_ascend_soc_version()
        if soc_version == AscendSocVersion.A3 and self.mc2_tokens_capacity > 512:
            raise ValueError(
                f"On Ascend A3, the max number of tokens for mc2 must be smaller than or equal to 512, but it is {self.mc2_tokens_capacity}"
            )
        if soc_version == AscendSocVersion.A2 and self.mc2_tokens_capacity > 256:
            raise ValueError(
                f"On Ascend A2, the max number of tokens for mc2 must be smaller than or equal to 256, but it is {self.mc2_tokens_capacity}"
            )

Signed-off-by: zouyida2052 <[email protected]>
@whx-sjtu whx-sjtu added ready read for review ready-for-test start test by label for PR labels Oct 30, 2025
@wangxiyuan wangxiyuan added ready-for-test start test by label for PR and removed ready-for-test start test by label for PR labels Oct 30, 2025
@wangxiyuan wangxiyuan merged commit 1966885 into vllm-project:main Oct 31, 2025
57 of 75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants