- 
                Notifications
    You must be signed in to change notification settings 
- Fork 530
add new test case for aclgraph capture and replay #3886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| 👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge: 
 If CI fails, you can run linting and testing checks locally according Contributing and Testing. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds an end-to-end test for aclgraph capture and replay with data parallelism. The test logic is comprehensive, but there are a few areas for improvement regarding test performance, maintainability, and robustness. I've identified a critical performance issue with a long sleep, and some high-severity issues related to magic numbers and brittle test implementation that should be addressed.
| os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = 'spawn' | ||
| sleep(600) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sleep(600) call will cause the test to hang for 10 minutes after all assertions have passed, which severely impacts CI performance and should be removed.
Additionally, modifying os.environ at the end of a test is an anti-pattern as it can affect other tests in unpredictable ways. If you need to manage environment variables for tests, it's better to use pytest fixtures to ensure proper setup and teardown for each test, maintaining test isolation.
| def trace_calls(frame, event, arg): | ||
| if event == 'call': | ||
| code = frame.f_code | ||
| func_name = code.co_name | ||
| file_name = code.co_filename | ||
| if func_name == 'execute_dummy_batch' and 'worker_v1.py' in file_name: | ||
| with num_execute_model_shared.get_lock(): | ||
| num_execute_model_shared.value += 1 | ||
| return trace_calls | ||
|  | ||
| sys.settrace(trace_calls) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using sys.settrace to count method calls is brittle as it depends on string matching for function and file names (execute_dummy_batch, worker_v1.py). This can easily break if the target code is refactored. A more robust and idiomatic approach is to use unittest.mock.patch to wrap the method, similar to how NPUGraph.replay and NPUGraph.__init__ are already being tracked in this test.
This block and the corresponding sys.settrace(None) on line 118 should be replaced by patching vllm_ascend.worker.worker_v1.NPUWorker.execute_dummy_batch.
| max_num_batch_sizes = math.floor( | ||
| (1800 - num_comm_groups * 40) / num_acl_graphs / | ||
| (1 + num_comm_groups * 2)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation for max_num_batch_sizes uses magic numbers (1800, 40, 2), which makes the logic difficult to understand and maintain. These values should be defined as named constants with descriptive names explaining their significance. This will improve code readability and make it easier to update if the underlying assumptions change.
For example:
# At the top of the file or function
_ACL_GRAPH_MEM_LIMIT = 1800
_COMM_GROUP_MEM_OVERHEAD = 40
_BATCH_SIZE_FACTOR = 2
max_num_batch_sizes = math.floor(
    (_ACL_GRAPH_MEM_LIMIT - num_comm_groups * _COMM_GROUP_MEM_OVERHEAD) / num_acl_graphs /
    (1 + num_comm_groups * _BATCH_SIZE_FACTOR))Signed-off-by: lilinsiman <[email protected]>
5dfe4ad    to
    82d1e90      
    Compare
  
    
What this PR does / why we need it?
add new test case for aclgraph capture and replay
Does this PR introduce any user-facing change?
no
How was this patch tested?
ut