[Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 #1811

lianyiibo · 2025-07-15T08:54:59Z

What this PR does / why we need it?

maybe fixes #1728

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Test Qwen3-32B tp=4 with:

vllm serve --port 1234 Qwen/Qwen3-32B \
    --served-model-name Qwen3-32B \
    --tensor-parallel-size 4 \
    --swap-space 16 \
    --max-model-len 6000 \
    --load-format dummy \
    --disable-log-stats \
    --disable-log-requests \

Request batch_size=128 input/output token=1024

In 0.9.2rc1

=====================================================
Total TPS with    prefill(tokens/s)         : 785.1395
Total TPS without prefill                   : 846.6809
Mean TPS with    prefill                    : 6.1339
Mean TPS without prefill                    : 6.6147
=====================================================
Mean TTFT(ms)                               : 10307.8123
Max  TTFT(ms)                               : 21423.0733
Min  TTFT(ms)                               : 362.3602
=====================================================
Mean TPOT(ms)                               : 151.3051
Max  TPOT(ms)                               : 159.4649
Min  TPOT(ms)                               : 140.899
=====================================================
Total Time(s)                               : 175.6032
Request Throughput(requests/s)              : 0.7289
=====================================================

Apply this PR

=====================================================
Total TPS with    prefill(tokens/s)         : 811.0014
Total TPS without prefill                   : 876.4423
Mean TPS with    prefill                    : 6.3359
Mean TPS without prefill                    : 6.8472
=====================================================
Mean TTFT(ms)                               : 10263.8382
Max  TTFT(ms)                               : 21151.2547
Min  TTFT(ms)                               : 375.9136
=====================================================
Mean TPOT(ms)                               : 146.1686
Max  TPOT(ms)                               : 154.0957
Min  TPOT(ms)                               : 136.8879
=====================================================
Total Time(s)                               : 169.8579
Request Throughput(requests/s)              : 0.7536
=====================================================

The TPOT performance gap between these two sets of data is about 3%.

vLLM version: v0.9.2
vLLM main: vllm-project/vllm@8dfb45c

ApsarasX · 2025-07-15T09:12:31Z

vllm_ascend/core/scheduler.py

+                    new_blocks = self.kv_cache_manager.allocate_slots(
+                        request,
+                        num_new_tokens,
+                        num_draft_tokens=num_draft_tokens,


In the main branch of vllm, the self.kv_cache_manager.allocate_slots method no longer has the num_draft_tokens parameter.

I originally attempted to fix this issue using the kwargs approach, but encountered runtime failures. As a fallback, I reverted to the original implementation and only preserving version information. This solution also can provide performance improvements. Please review and confirm if this resolves the issue satisfactorily.

codecov · 2025-07-15T09:17:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 54.23%. Comparing base (f9dfde0) to head (359e6df).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1811      +/-   ##
==========================================
+ Coverage   54.18%   54.23%   +0.05%     
==========================================
  Files          74       74              
  Lines        9235     9246      +11     
==========================================
+ Hits         5004     5015      +11     
  Misses       4231     4231

Flag	Coverage Δ
unittests	`54.23% <100.00%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wangxiyuan

Thanks for the change. I was shocked by this change. Maybe schedule func is called much time, so that the vllm_version_is is not good to call. @Potabk Can you run the perf to make sure the change is good to go? Thanks

Potabk · 2025-07-16T07:30:24Z

vllm serve Qwen/Qwen3-32B \
    --served-model-name Qwen3-32B \
    --tensor-parallel-size 4 \
    --swap-space 16 \
    --max-model-len 6000 \
    --load-format dummy \
    --disable-log-stats \
    --disable-log-requests

qps 1:

 vllm bench serve --model Qwen/Qwen3-32B \
 --endpoint-type "vllm" --dataset-name random \
 --random-input-len 128 \
 --served-model-name Qwen3-32B \
 --num-prompts 200 \
 --request-rate 1

result
before

============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  189.63
Total input tokens:                      25553
Total generated tokens:                  25600
Request throughput (req/s):              1.05
Output token throughput (tok/s):         135.00
Total Token throughput (tok/s):          269.75
---------------Time to First Token----------------
Mean TTFT (ms):                          1691.98
Median TTFT (ms):                        273.25
P99 TTFT (ms):                           8373.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.52
Median TPOT (ms):                        39.08
P99 TPOT (ms):                           68.64
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.88
Median ITL (ms):                         41.36
P99 ITL (ms):                            267.80
==================================================

patch this pr

============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  189.61
Total input tokens:                      25553
Total generated tokens:                  25600
Request throughput (req/s):              1.05
Output token throughput (tok/s):         135.02
Total Token throughput (tok/s):          269.78
---------------Time to First Token----------------
Mean TTFT (ms):                          831.29
Median TTFT (ms):                        590.72
P99 TTFT (ms):                           2931.72
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.65
Median TPOT (ms):                        38.02
P99 TPOT (ms):                           49.79
---------------Inter-token Latency----------------
Mean ITL (ms):                           52.87
Median ITL (ms):                         38.10
P99 ITL (ms):                            243.91
==================================================

qps 200:

before:

============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  38.26
Total input tokens:                      25553
Total generated tokens:                  25600
Request throughput (req/s):              5.23
Output token throughput (tok/s):         669.09
Total Token throughput (tok/s):          1336.95
---------------Time to First Token----------------
Mean TTFT (ms):                          15209.68
Median TTFT (ms):                        17573.13
P99 TTFT (ms):                           33862.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          53.90
Median TPOT (ms):                        65.05
P99 TPOT (ms):                           76.68
---------------Inter-token Latency----------------
Mean ITL (ms):                           93.19
Median ITL (ms):                         65.41
P99 ITL (ms):                            416.54
==================================================

patch:

============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  13.82
Total input tokens:                      25553
Total generated tokens:                  25600
Request throughput (req/s):              14.47
Output token throughput (tok/s):         1852.70
Total Token throughput (tok/s):          3702.00
---------------Time to First Token----------------
Mean TTFT (ms):                          2701.13
Median TTFT (ms):                        2654.85
P99 TTFT (ms):                           4468.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          85.24
Median TPOT (ms):                        87.24
P99 TPOT (ms):                           93.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           86.77
Median ITL (ms):                         74.63
P99 ITL (ms):                            282.23
==================================================

Yikun · 2025-07-16T08:46:59Z

Confrim this improve the perf (qwen3 8B)

# Start vLLM V1
export MODEL=Qwen/Qwen3-8B
VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model $MODEL \
         --tensor-parallel-size 1 --swap-space 16 --disable-log-stats \
         --disable-log-requests  --load-format dummy
# Benchmark
docker exec -it  yikun-test bash
export MODEL=Qwen/Qwen3-8B
export VLLM_USE_MODELSCOPE=true
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r /vllm-workspace/vllm-ascend/benchmarks/requirements-bench.txt
python3 /vllm-workspace/vllm/benchmarks/benchmark_serving.py --model $MODEL --dataset-name random \
         --random-input-len 200 --num-prompts 200 --request-rate 1 \
         --save-result --result-dir ./

Yikun · 2025-07-16T09:23:19Z

@ApsarasX Would you mind taking another look, otherwise I will merge this soon.

also cc @jianzs @ganyi1996ppo

ganyi1996ppo · 2025-07-16T09:31:31Z

This PR makes me a little bit confuse, how does those change fix performance gap? @lianyiibo

lianyiibo · 2025-07-16T09:50:08Z

This PR makes me a little bit confuse, how does those change fix performance gap? @lianyiibo

I think that when the VLLM_VERSION environment variable is not set, the code segment at utils.py in the scheduler's while True loop is time-consuming. A more general solution might be to add a global static variable like vllm.__version__ in utils.py to avoid similar potential issues in the future.

jianzs · 2025-07-16T09:55:05Z

Is using functools.cache to decorate vllm_version_is a better approach?

lianyiibo · 2025-07-16T10:01:51Z

Is using functools.cache to decorate vllm_version_is a better approach?

This might be a good solution. If the test proves effective, should I make any adjustments to this PR?

jianzs · 2025-07-16T10:17:46Z

Is using functools.cache to decorate vllm_version_is a better approach?

This might be a good solution. If the test proves effective, should I make any adjustments to this PR?

I suggest that using functools.cache, which is a cleaner approach and works elsewhere as well. cc @ganyi1996ppo @Yikun @ApsarasX

lianyiibo · 2025-07-16T10:36:44Z

Is using functools.cache to decorate vllm_version_is a better approach?

This might be a good solution. If the test proves effective, should I make any adjustments to this PR?

I suggest that using functools.cache, which is a cleaner approach and works elsewhere as well. cc @ganyi1996ppo @Yikun @ApsarasX

The modification using functools.cache has been verified as effective in local testing. After reverting the previous commit, I have submitted the revised code in a new commit.

ganyi1996ppo · 2025-07-16T12:43:37Z

I think that when the VLLM_VERSION environment variable is not set, the code segment at utils.py in the scheduler's while True loop is time-consuming. A more general solution might be to add a global static variable like vllm.__version__ in utils.py to avoid similar potential issues in the future.

Got, looks good.

wangxiyuan · 2025-07-17T01:01:37Z

vllm_ascend/utils.py

@@ -280,6 +281,7 @@ def adapt_patch(is_global_patch: bool = False):
        from vllm_ascend.patch import worker  # noqa: F401


+@functools.cache


test_vllm_version_is should be updated as well

Fix completed. Please review.

Thanks, use cache here is better, so that in the future, we won't hit the simliar issue again.

jianzs · 2025-07-17T12:51:18Z

Please rebase to make CI happy

Signed-off-by: lianyibo <[email protected]>

lianyiibo · 2025-07-18T02:15:47Z

Please rebase to make CI happy

Done. PTAL.

lianyiibo changed the title ~~Simplify the conditional branching of version information in scheduler~~ [Bugfix]Simplify the conditional branching of version information in scheduler Jul 15, 2025

lianyiibo changed the title ~~[Bugfix]Simplify the conditional branching of version information in scheduler~~ [Bugfix]Fixed the performance gap between 0.9.2rc1 and 0.9.1 Jul 15, 2025

lianyiibo changed the title ~~[Bugfix]Fixed the performance gap between 0.9.2rc1 and 0.9.1~~ [Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 Jul 15, 2025

ApsarasX reviewed Jul 15, 2025

View reviewed changes

lianyiibo force-pushed the main branch 5 times, most recently from e3b9487 to 108b4e3 Compare July 16, 2025 02:08

wangxiyuan approved these changes Jul 16, 2025

View reviewed changes

Yikun approved these changes Jul 16, 2025

View reviewed changes

github-actions bot added the module:core label Jul 16, 2025

lianyiibo force-pushed the main branch from 07095aa to 37f3c78 Compare July 16, 2025 10:30

lianyiibo force-pushed the main branch from 37f3c78 to 3780b54 Compare July 16, 2025 10:39

wangxiyuan reviewed Jul 17, 2025

View reviewed changes

ApsarasX approved these changes Jul 17, 2025

View reviewed changes

lianyiibo force-pushed the main branch from 7a7c266 to 281edea Compare July 17, 2025 02:50

github-actions bot added the module:tests label Jul 17, 2025

lianyiibo force-pushed the main branch from 281edea to d8e4771 Compare July 18, 2025 02:05

Cache vllm_version_is function

359e6df

Signed-off-by: lianyibo <[email protected]>

lianyiibo force-pushed the main branch from d8e4771 to 359e6df Compare July 18, 2025 02:14

		@@ -280,6 +281,7 @@ def adapt_patch(is_global_patch: bool = False):
		from vllm_ascend.patch import worker # noqa: F401


		@functools.cache

[Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 #1811

Are you sure you want to change the base?

[Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 #1811

Uh oh!

Conversation

lianyiibo commented Jul 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ApsarasX Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

lianyiibo Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Potabk commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yikun commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yikun commented Jul 16, 2025

Uh oh!

ganyi1996ppo commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lianyiibo commented Jul 16, 2025

Uh oh!

jianzs commented Jul 16, 2025

Uh oh!

lianyiibo commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jianzs commented Jul 16, 2025

Uh oh!

lianyiibo commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ganyi1996ppo commented Jul 16, 2025

Uh oh!

wangxiyuan Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

lianyiibo Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

jianzs commented Jul 17, 2025

Uh oh!

lianyiibo commented Jul 18, 2025

Uh oh!

Uh oh!

lianyiibo commented Jul 15, 2025 •

edited by github-actions bot

Loading

codecov bot commented Jul 15, 2025 •

edited

Loading

Potabk commented Jul 16, 2025 •

edited

Loading

Yikun commented Jul 16, 2025 •

edited

Loading

ganyi1996ppo commented Jul 16, 2025 •

edited

Loading

lianyiibo commented Jul 16, 2025 •

edited

Loading

lianyiibo commented Jul 16, 2025 •

edited

Loading