Skip to content

Conversation

@lilinsiman
Copy link
Contributor

@lilinsiman lilinsiman commented Oct 30, 2025

What this PR does / why we need it?

add new test case for aclgraph capture and replay

Does this PR introduce any user-facing change?

no

How was this patch tested?

ut

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds an end-to-end test for aclgraph capture and replay with data parallelism. The test logic is comprehensive, but there are a few areas for improvement regarding test performance, maintainability, and robustness. I've identified a critical performance issue with a long sleep, and some high-severity issues related to magic numbers and brittle test implementation that should be addressed.

Comment on lines +166 to +169
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = 'spawn'
sleep(600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This sleep(600) call will cause the test to hang for 10 minutes after all assertions have passed, which severely impacts CI performance and should be removed.

Additionally, modifying os.environ at the end of a test is an anti-pattern as it can affect other tests in unpredictable ways. If you need to manage environment variables for tests, it's better to use pytest fixtures to ensure proper setup and teardown for each test, maintaining test isolation.

Comment on lines +92 to +104
def trace_calls(frame, event, arg):
if event == 'call':
code = frame.f_code
func_name = code.co_name
file_name = code.co_filename
if func_name == 'execute_dummy_batch' and 'worker_v1.py' in file_name:
with num_execute_model_shared.get_lock():
num_execute_model_shared.value += 1
return trace_calls

sys.settrace(trace_calls)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using sys.settrace to count method calls is brittle as it depends on string matching for function and file names (execute_dummy_batch, worker_v1.py). This can easily break if the target code is refactored. A more robust and idiomatic approach is to use unittest.mock.patch to wrap the method, similar to how NPUGraph.replay and NPUGraph.__init__ are already being tracked in this test.

This block and the corresponding sys.settrace(None) on line 118 should be replaced by patching vllm_ascend.worker.worker_v1.NPUWorker.execute_dummy_batch.

Comment on lines +153 to +157
max_num_batch_sizes = math.floor(
(1800 - num_comm_groups * 40) / num_acl_graphs /
(1 + num_comm_groups * 2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation for max_num_batch_sizes uses magic numbers (1800, 40, 2), which makes the logic difficult to understand and maintain. These values should be defined as named constants with descriptive names explaining their significance. This will improve code readability and make it easier to update if the underlying assumptions change.

For example:

# At the top of the file or function
_ACL_GRAPH_MEM_LIMIT = 1800
_COMM_GROUP_MEM_OVERHEAD = 40
_BATCH_SIZE_FACTOR = 2

max_num_batch_sizes = math.floor(
    (_ACL_GRAPH_MEM_LIMIT - num_comm_groups * _COMM_GROUP_MEM_OVERHEAD) / num_acl_graphs /
    (1 + num_comm_groups * _BATCH_SIZE_FACTOR))

@linfeng-yuan linfeng-yuan added ready read for review ready-for-test start test by label for PR labels Oct 30, 2025
@lilinsiman lilinsiman force-pushed the aclgraph_capture_replay branch from 5dfe4ad to 82d1e90 Compare October 30, 2025 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants