Skip to content

[Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 #1811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lianyiibo
Copy link

@lianyiibo lianyiibo commented Jul 15, 2025

What this PR does / why we need it?

maybe fixes #1728

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Test Qwen3-32B tp=4 with:

vllm serve --port 1234 Qwen/Qwen3-32B \
    --served-model-name Qwen3-32B \
    --tensor-parallel-size 4 \
    --swap-space 16 \
    --max-model-len 6000 \
    --load-format dummy \
    --disable-log-stats \
    --disable-log-requests \

Request batch_size=128 input/output token=1024

In 0.9.2rc1

=====================================================
Total TPS with    prefill(tokens/s)         : 785.1395
Total TPS without prefill                   : 846.6809
Mean TPS with    prefill                    : 6.1339
Mean TPS without prefill                    : 6.6147
=====================================================
Mean TTFT(ms)                               : 10307.8123
Max  TTFT(ms)                               : 21423.0733
Min  TTFT(ms)                               : 362.3602
=====================================================
Mean TPOT(ms)                               : 151.3051
Max  TPOT(ms)                               : 159.4649
Min  TPOT(ms)                               : 140.899
=====================================================
Total Time(s)                               : 175.6032
Request Throughput(requests/s)              : 0.7289
=====================================================

Apply this PR

=====================================================
Total TPS with    prefill(tokens/s)         : 811.0014
Total TPS without prefill                   : 876.4423
Mean TPS with    prefill                    : 6.3359
Mean TPS without prefill                    : 6.8472
=====================================================
Mean TTFT(ms)                               : 10263.8382
Max  TTFT(ms)                               : 21151.2547
Min  TTFT(ms)                               : 375.9136
=====================================================
Mean TPOT(ms)                               : 146.1686
Max  TPOT(ms)                               : 154.0957
Min  TPOT(ms)                               : 136.8879
=====================================================
Total Time(s)                               : 169.8579
Request Throughput(requests/s)              : 0.7536
=====================================================

The TPOT performance gap between these two sets of data is about 3%.

@lianyiibo lianyiibo changed the title Simplify the conditional branching of version information in scheduler [Bugfix]Simplify the conditional branching of version information in scheduler Jul 15, 2025
@lianyiibo lianyiibo changed the title [Bugfix]Simplify the conditional branching of version information in scheduler [Bugfix]Fixed the performance gap between 0.9.2rc1 and 0.9.1 Jul 15, 2025
@lianyiibo lianyiibo changed the title [Bugfix]Fixed the performance gap between 0.9.2rc1 and 0.9.1 [Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 Jul 15, 2025
new_blocks = self.kv_cache_manager.allocate_slots(
request,
num_new_tokens,
num_draft_tokens=num_draft_tokens,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the main branch of vllm, the self.kv_cache_manager.allocate_slots method no longer has the num_draft_tokens parameter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally attempted to fix this issue using the kwargs approach, but encountered runtime failures. As a fallback, I reverted to the original implementation and only preserving version information. This solution also can provide performance improvements. Please review and confirm if this resolves the issue satisfactorily.

Copy link

codecov bot commented Jul 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 54.23%. Comparing base (f9dfde0) to head (359e6df).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1811      +/-   ##
==========================================
+ Coverage   54.18%   54.23%   +0.05%     
==========================================
  Files          74       74              
  Lines        9235     9246      +11     
==========================================
+ Hits         5004     5015      +11     
  Misses       4231     4231              
Flag Coverage Δ
unittests 54.23% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lianyiibo lianyiibo force-pushed the main branch 5 times, most recently from e3b9487 to 108b4e3 Compare July 16, 2025 02:08
Copy link
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change. I was shocked by this change. Maybe schedule func is called much time, so that the vllm_version_is is not good to call. @Potabk Can you run the perf to make sure the change is good to go? Thanks

@Potabk
Copy link
Contributor

Potabk commented Jul 16, 2025

vllm serve Qwen/Qwen3-32B \
    --served-model-name Qwen3-32B \
    --tensor-parallel-size 4 \
    --swap-space 16 \
    --max-model-len 6000 \
    --load-format dummy \
    --disable-log-stats \
    --disable-log-requests

qps 1:

 vllm bench serve --model Qwen/Qwen3-32B \
 --endpoint-type "vllm" --dataset-name random \
 --random-input-len 128 \
 --served-model-name Qwen3-32B \
 --num-prompts 200 \
 --request-rate 1

result
before

============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  189.63
Total input tokens:                      25553
Total generated tokens:                  25600
Request throughput (req/s):              1.05
Output token throughput (tok/s):         135.00
Total Token throughput (tok/s):          269.75
---------------Time to First Token----------------
Mean TTFT (ms):                          1691.98
Median TTFT (ms):                        273.25
P99 TTFT (ms):                           8373.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.52
Median TPOT (ms):                        39.08
P99 TPOT (ms):                           68.64
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.88
Median ITL (ms):                         41.36
P99 ITL (ms):                            267.80
==================================================

patch this pr

============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  189.61
Total input tokens:                      25553
Total generated tokens:                  25600
Request throughput (req/s):              1.05
Output token throughput (tok/s):         135.02
Total Token throughput (tok/s):          269.78
---------------Time to First Token----------------
Mean TTFT (ms):                          831.29
Median TTFT (ms):                        590.72
P99 TTFT (ms):                           2931.72
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.65
Median TPOT (ms):                        38.02
P99 TPOT (ms):                           49.79
---------------Inter-token Latency----------------
Mean ITL (ms):                           52.87
Median ITL (ms):                         38.10
P99 ITL (ms):                            243.91
==================================================

qps 200:

before:

============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  38.26
Total input tokens:                      25553
Total generated tokens:                  25600
Request throughput (req/s):              5.23
Output token throughput (tok/s):         669.09
Total Token throughput (tok/s):          1336.95
---------------Time to First Token----------------
Mean TTFT (ms):                          15209.68
Median TTFT (ms):                        17573.13
P99 TTFT (ms):                           33862.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          53.90
Median TPOT (ms):                        65.05
P99 TPOT (ms):                           76.68
---------------Inter-token Latency----------------
Mean ITL (ms):                           93.19
Median ITL (ms):                         65.41
P99 ITL (ms):                            416.54
==================================================

patch:

============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  13.82
Total input tokens:                      25553
Total generated tokens:                  25600
Request throughput (req/s):              14.47
Output token throughput (tok/s):         1852.70
Total Token throughput (tok/s):          3702.00
---------------Time to First Token----------------
Mean TTFT (ms):                          2701.13
Median TTFT (ms):                        2654.85
P99 TTFT (ms):                           4468.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          85.24
Median TPOT (ms):                        87.24
P99 TPOT (ms):                           93.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           86.77
Median ITL (ms):                         74.63
P99 ITL (ms):                            282.23
==================================================

@Yikun
Copy link
Collaborator

Yikun commented Jul 16, 2025

image

Confrim this improve the perf (qwen3 8B)

# Start vLLM V1
export MODEL=Qwen/Qwen3-8B
VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model $MODEL \
         --tensor-parallel-size 1 --swap-space 16 --disable-log-stats \
         --disable-log-requests  --load-format dummy
# Benchmark
docker exec -it  yikun-test bash
export MODEL=Qwen/Qwen3-8B
export VLLM_USE_MODELSCOPE=true
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r /vllm-workspace/vllm-ascend/benchmarks/requirements-bench.txt
python3 /vllm-workspace/vllm/benchmarks/benchmark_serving.py --model $MODEL --dataset-name random \
         --random-input-len 200 --num-prompts 200 --request-rate 1 \
         --save-result --result-dir ./

@Yikun
Copy link
Collaborator

Yikun commented Jul 16, 2025

@ApsarasX Would you mind taking another look, otherwise I will merge this soon.

also cc @jianzs @ganyi1996ppo

@ganyi1996ppo
Copy link
Collaborator

ganyi1996ppo commented Jul 16, 2025

This PR makes me a little bit confuse, how does those change fix performance gap? @lianyiibo

@lianyiibo
Copy link
Author

This PR makes me a little bit confuse, how does those change fix performance gap? @lianyiibo

I think that when the VLLM_VERSION environment variable is not set, the code segment at utils.py in the scheduler's while True loop is time-consuming. A more general solution might be to add a global static variable like vllm.__version__ in utils.py to avoid similar potential issues in the future.

@jianzs
Copy link
Collaborator

jianzs commented Jul 16, 2025

Is using functools.cache to decorate vllm_version_is a better approach?

@lianyiibo
Copy link
Author

lianyiibo commented Jul 16, 2025

Is using functools.cache to decorate vllm_version_is a better approach?

This might be a good solution. If the test proves effective, should I make any adjustments to this PR?

@jianzs
Copy link
Collaborator

jianzs commented Jul 16, 2025

Is using functools.cache to decorate vllm_version_is a better approach?

This might be a good solution. If the test proves effective, should I make any adjustments to this PR?

I suggest that using functools.cache, which is a cleaner approach and works elsewhere as well. cc @ganyi1996ppo @Yikun @ApsarasX

@lianyiibo
Copy link
Author

lianyiibo commented Jul 16, 2025

Is using functools.cache to decorate vllm_version_is a better approach?

This might be a good solution. If the test proves effective, should I make any adjustments to this PR?

I suggest that using functools.cache, which is a cleaner approach and works elsewhere as well. cc @ganyi1996ppo @Yikun @ApsarasX

The modification using functools.cache has been verified as effective in local testing. After reverting the previous commit, I have submitted the revised code in a new commit.

@ganyi1996ppo
Copy link
Collaborator

I think that when the VLLM_VERSION environment variable is not set, the code segment at utils.py in the scheduler's while True loop is time-consuming. A more general solution might be to add a global static variable like vllm.__version__ in utils.py to avoid similar potential issues in the future.

Got, looks good.

@@ -280,6 +281,7 @@ def adapt_patch(is_global_patch: bool = False):
from vllm_ascend.patch import worker # noqa: F401


@functools.cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_vllm_version_is should be updated as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix completed. Please review.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, use cache here is better, so that in the future, we won't hit the simliar issue again.

@jianzs
Copy link
Collaborator

jianzs commented Jul 17, 2025

Please rebase to make CI happy

@lianyiibo
Copy link
Author

Please rebase to make CI happy

Done. PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Performance]: performance is down 18% after update
7 participants