-
Notifications
You must be signed in to change notification settings - Fork 266
[Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 #1811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
vllm_ascend/core/scheduler.py
Outdated
new_blocks = self.kv_cache_manager.allocate_slots( | ||
request, | ||
num_new_tokens, | ||
num_draft_tokens=num_draft_tokens, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the main branch of vllm, the self.kv_cache_manager.allocate_slots
method no longer has the num_draft_tokens
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally attempted to fix this issue using the kwargs approach, but encountered runtime failures. As a fallback, I reverted to the original implementation and only preserving version information. This solution also can provide performance improvements. Please review and confirm if this resolves the issue satisfactorily.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1811 +/- ##
==========================================
+ Coverage 54.18% 54.23% +0.05%
==========================================
Files 74 74
Lines 9235 9246 +11
==========================================
+ Hits 5004 5015 +11
Misses 4231 4231
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e3b9487
to
108b4e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change. I was shocked by this change. Maybe schedule func is called much time, so that the vllm_version_is
is not good to call. @Potabk Can you run the perf to make sure the change is good to go? Thanks
vllm serve Qwen/Qwen3-32B \
--served-model-name Qwen3-32B \
--tensor-parallel-size 4 \
--swap-space 16 \
--max-model-len 6000 \
--load-format dummy \
--disable-log-stats \
--disable-log-requests qps 1: vllm bench serve --model Qwen/Qwen3-32B \
--endpoint-type "vllm" --dataset-name random \
--random-input-len 128 \
--served-model-name Qwen3-32B \
--num-prompts 200 \
--request-rate 1 result ============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 189.63
Total input tokens: 25553
Total generated tokens: 25600
Request throughput (req/s): 1.05
Output token throughput (tok/s): 135.00
Total Token throughput (tok/s): 269.75
---------------Time to First Token----------------
Mean TTFT (ms): 1691.98
Median TTFT (ms): 273.25
P99 TTFT (ms): 8373.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 39.52
Median TPOT (ms): 39.08
P99 TPOT (ms): 68.64
---------------Inter-token Latency----------------
Mean ITL (ms): 56.88
Median ITL (ms): 41.36
P99 ITL (ms): 267.80
================================================== patch this pr ============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 189.61
Total input tokens: 25553
Total generated tokens: 25600
Request throughput (req/s): 1.05
Output token throughput (tok/s): 135.02
Total Token throughput (tok/s): 269.78
---------------Time to First Token----------------
Mean TTFT (ms): 831.29
Median TTFT (ms): 590.72
P99 TTFT (ms): 2931.72
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 37.65
Median TPOT (ms): 38.02
P99 TPOT (ms): 49.79
---------------Inter-token Latency----------------
Mean ITL (ms): 52.87
Median ITL (ms): 38.10
P99 ITL (ms): 243.91
================================================== qps 200: before: ============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 38.26
Total input tokens: 25553
Total generated tokens: 25600
Request throughput (req/s): 5.23
Output token throughput (tok/s): 669.09
Total Token throughput (tok/s): 1336.95
---------------Time to First Token----------------
Mean TTFT (ms): 15209.68
Median TTFT (ms): 17573.13
P99 TTFT (ms): 33862.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 53.90
Median TPOT (ms): 65.05
P99 TPOT (ms): 76.68
---------------Inter-token Latency----------------
Mean ITL (ms): 93.19
Median ITL (ms): 65.41
P99 ITL (ms): 416.54
================================================== patch: ============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 13.82
Total input tokens: 25553
Total generated tokens: 25600
Request throughput (req/s): 14.47
Output token throughput (tok/s): 1852.70
Total Token throughput (tok/s): 3702.00
---------------Time to First Token----------------
Mean TTFT (ms): 2701.13
Median TTFT (ms): 2654.85
P99 TTFT (ms): 4468.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 85.24
Median TPOT (ms): 87.24
P99 TPOT (ms): 93.39
---------------Inter-token Latency----------------
Mean ITL (ms): 86.77
Median ITL (ms): 74.63
P99 ITL (ms): 282.23
================================================== |
@ApsarasX Would you mind taking another look, otherwise I will merge this soon. also cc @jianzs @ganyi1996ppo |
This PR makes me a little bit confuse, how does those change fix performance gap? @lianyiibo |
I think that when the |
Is using |
This might be a good solution. If the test proves effective, should I make any adjustments to this PR? |
I suggest that using |
The modification using |
Got, looks good. |
@@ -280,6 +281,7 @@ def adapt_patch(is_global_patch: bool = False): | |||
from vllm_ascend.patch import worker # noqa: F401 | |||
|
|||
|
|||
@functools.cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_vllm_version_is should be updated as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix completed. Please review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, use cache here is better, so that in the future, we won't hit the simliar issue again.
Please rebase to make CI happy |
Signed-off-by: lianyibo <[email protected]>
Done. PTAL. |
What this PR does / why we need it?
maybe fixes #1728
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Test Qwen3-32B tp=4 with:
Request batch_size=128 input/output token=1024
In 0.9.2rc1
Apply this PR
The TPOT performance gap between these two sets of data is about 3%.