add new test case for aclgraph capture and replay #3886

lilinsiman · 2025-10-30T01:54:07Z

What this PR does / why we need it?

add new test case for aclgraph capture and replay

Does this PR introduce any user-facing change?

no

How was this patch tested?

ut

vLLM version: v0.11.0rc3
vLLM main: vllm-project/vllm@83f478b

github-actions · 2025-10-30T01:54:14Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds an end-to-end test for aclgraph capture and replay with data parallelism. The test logic is comprehensive, but there are a few areas for improvement regarding test performance, maintainability, and robustness. I've identified a critical performance issue with a long sleep, and some high-severity issues related to magic numbers and brittle test implementation that should be addressed.

gemini-code-assist · 2025-10-30T01:55:34Z

tests/e2e/multicard/test_aclgraph_capture_replay.py

+    os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = 'spawn'
+    sleep(600)


This sleep(600) call will cause the test to hang for 10 minutes after all assertions have passed, which severely impacts CI performance and should be removed.

Additionally, modifying os.environ at the end of a test is an anti-pattern as it can affect other tests in unpredictable ways. If you need to manage environment variables for tests, it's better to use pytest fixtures to ensure proper setup and teardown for each test, maintaining test isolation.

gemini-code-assist · 2025-10-30T01:55:34Z

tests/e2e/multicard/test_aclgraph_capture_replay.py

+            def trace_calls(frame, event, arg):
+                if event == 'call':
+                    code = frame.f_code
+                    func_name = code.co_name
+                    file_name = code.co_filename
+                    if func_name == 'execute_dummy_batch' and 'worker_v1.py' in file_name:
+                        with num_execute_model_shared.get_lock():
+                            num_execute_model_shared.value += 1
+                return trace_calls
+
+            sys.settrace(trace_calls)


Using sys.settrace to count method calls is brittle as it depends on string matching for function and file names (execute_dummy_batch, worker_v1.py). This can easily break if the target code is refactored. A more robust and idiomatic approach is to use unittest.mock.patch to wrap the method, similar to how NPUGraph.replay and NPUGraph.__init__ are already being tracked in this test.

This block and the corresponding sys.settrace(None) on line 118 should be replaced by patching vllm_ascend.worker.worker_v1.NPUWorker.execute_dummy_batch.

gemini-code-assist · 2025-10-30T01:55:34Z

tests/e2e/multicard/test_aclgraph_capture_replay.py

+    max_num_batch_sizes = math.floor(
+        (1800 - num_comm_groups * 40) / num_acl_graphs /
+        (1 + num_comm_groups * 2))


The calculation for max_num_batch_sizes uses magic numbers (1800, 40, 2), which makes the logic difficult to understand and maintain. These values should be defined as named constants with descriptive names explaining their significance. This will improve code readability and make it easier to update if the underlying assumptions change.

For example:

# At the top of the file or function _ACL_GRAPH_MEM_LIMIT = 1800 _COMM_GROUP_MEM_OVERHEAD = 40 _BATCH_SIZE_FACTOR = 2 max_num_batch_sizes = math.floor( (_ACL_GRAPH_MEM_LIMIT - num_comm_groups * _COMM_GROUP_MEM_OVERHEAD) / num_acl_graphs / (1 + num_comm_groups * _BATCH_SIZE_FACTOR))

Signed-off-by: lilinsiman <[email protected]>

github-actions bot added the module:tests label Oct 30, 2025

gemini-code-assist bot reviewed Oct 30, 2025

View reviewed changes

linfeng-yuan added ready read for review ready-for-test start test by label for PR labels Oct 30, 2025

add new test case for aclgraph capture and replay

82d1e90

Signed-off-by: lilinsiman <[email protected]>

lilinsiman force-pushed the aclgraph_capture_replay branch from 5dfe4ad to 82d1e90 Compare October 30, 2025 07:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add new test case for aclgraph capture and replay #3886

add new test case for aclgraph capture and replay #3886

lilinsiman commented Oct 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Oct 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = 'spawn'
		sleep(600)

add new test case for aclgraph capture and replay #3886

Are you sure you want to change the base?

add new test case for aclgraph capture and replay #3886

Conversation

lilinsiman commented Oct 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Oct 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lilinsiman commented Oct 30, 2025 •

edited by github-actions bot

Loading