-
Notifications
You must be signed in to change notification settings - Fork 532
add new test model for aclgraph single_request v0.11.0 #3889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new test model, vllm-ascend/DeepSeek-V2-Lite-W8A8, to the end-to-end tests for aclgraph with single requests. The changes are confined to the test file. However, I've found a critical issue in how the server arguments are constructed for this new quantized model. The arguments are malformed, which would cause the test server to fail on startup. I've provided a code suggestion to fix this issue, which also refactors the code to remove duplication and improve maintainability.
| if model == "vllm-ascend/DeepSeek-V2-Lite-W8A8": | ||
| server_args = [ | ||
| "--no-enable-prefix-caching", "--tensor-parallel-size", "1", | ||
| "--data-parallel-size", | ||
| "--data-parallel-size", "quantization", "ascend", | ||
| str(dp_size), "--port", | ||
| str(port), "--trust-remote-code", "--gpu-memory-utilization", "0.9" | ||
| ] | ||
| else: | ||
| server_args = [ | ||
| "--no-enable-prefix-caching", "--tensor-parallel-size", "1", | ||
| "--data-parallel-size", | ||
| str(dp_size), "--port", | ||
| str(port), "--trust-remote-code", "--gpu-memory-utilization", "0.9" | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of issues in this block:
- Incorrect Server Arguments: The server arguments for the new model
vllm-ascend/DeepSeek-V2-Lite-W8A8appear to be incorrect. The arguments"quantization"and"ascend"are passed as values to--data-parallel-size, which expects an integer. This will likely cause the server to fail to start. The correct way to enable quantization for this model is probably by using the--quantization ascendflag. - Code Duplication: The
if/elseblock for constructingserver_argscontains a lot of duplicated code. This can be refactored to improve readability and maintainability.
I've suggested a change that addresses both issues by defining a base list of arguments and conditionally adding the quantization arguments for the specific model.
| if model == "vllm-ascend/DeepSeek-V2-Lite-W8A8": | |
| server_args = [ | |
| "--no-enable-prefix-caching", "--tensor-parallel-size", "1", | |
| "--data-parallel-size", | |
| "--data-parallel-size", "quantization", "ascend", | |
| str(dp_size), "--port", | |
| str(port), "--trust-remote-code", "--gpu-memory-utilization", "0.9" | |
| ] | |
| else: | |
| server_args = [ | |
| "--no-enable-prefix-caching", "--tensor-parallel-size", "1", | |
| "--data-parallel-size", | |
| str(dp_size), "--port", | |
| str(port), "--trust-remote-code", "--gpu-memory-utilization", "0.9" | |
| ] | |
| server_args = [ | |
| "--no-enable-prefix-caching", "--tensor-parallel-size", "1", | |
| "--data-parallel-size", str(dp_size), | |
| "--port", str(port), | |
| "--trust-remote-code", "--gpu-memory-utilization", "0.9", | |
| ] | |
| if model == "vllm-ascend/DeepSeek-V2-Lite-W8A8": | |
| server_args.extend(["--quantization", "ascend"]) |
66cca63 to
4ba2acc
Compare
Signed-off-by: lilinsiman <[email protected]>
4ba2acc to
55cc947
Compare
What this PR does / why we need it?
add new test model for aclgraph single_request v0.11.0
Does this PR introduce any user-facing change?
no
How was this patch tested?
ut