[Bugfix][P/D] Fix throughput stats in disaggregated setup #27569

NickLucche · 2025-10-27T11:33:53Z

Fix prompt throughput stats in CLI logger by only accounting for tokens that were prefilled locally.

In a P/D setup, kv cache is copied over from P to D, and this currently resolves in outputting the following on the Decoder side:

# Sending 2 different reqs one after another
(APIServer pid=3322894) INFO:     Started server process [3322894]
(APIServer pid=3322894) INFO:     Waiting for application startup.
(APIServer pid=3322894) INFO:     Application startup complete.
(APIServer pid=3322894) INFO:     127.0.0.1:38042 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=3322894) INFO 10-27 11:21:40 [loggers.py:208] Engine 000: Avg prompt throughput: 53.2 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%
(APIServer pid=3322894) INFO 10-27 11:21:50 [loggers.py:208] Engine 000: Avg prompt throughput: 10.3 tokens/s, Avg generation throughput: 11.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%

which is plain wrong given we have not actually prefilled those tokens, but actually just "copied" them over.

After this PR:

# Same setup
(APIServer pid=3318553) INFO:     Started server process [3318553]
(APIServer pid=3318553) INFO:     Waiting for application startup.
(APIServer pid=3318553) INFO:     Application startup complete.
(APIServer pid=3318553) INFO:     127.0.0.1:54610 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=3318553) INFO 10-27 11:15:05 [loggers.py:208] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%

(APIServer pid=3318553) INFO:     127.0.0.1:35638 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=3318553) INFO 10-27 11:15:15 [loggers.py:208] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%

Signed-off-by: NickLucche <[email protected]>

gemini-code-assist

Code Review

This pull request addresses a bug in the prompt throughput stats calculation for disaggregated setups, where tokens copied from the primary to the decoder were incorrectly counted as prefilled tokens. The changes involve modifying the scheduler to track locally prefilled tokens and updating the logging to reflect the corrected throughput. The review focuses on ensuring the correctness of the fix and the clarity of the code changes.

vllm/v1/metrics/loggers.py

gemini-code-assist · 2025-10-27T11:35:27Z

vllm/v1/core/sched/scheduler.py

                    self._update_connector_prefix_cache_stats(
                        request, num_external_computed_tokens
                    )
+                    request.num_external_computed_tokens += num_external_computed_tokens


This line adds the external computed tokens to the request's num_external_computed_tokens. It's crucial to ensure that this addition doesn't lead to double-counting in subsequent calculations or logging. Verify that this value is used correctly and doesn't inadvertently inflate the count of external tokens. It might be better to update the value directly instead of adding to it, depending on how it's used elsewhere.

Consider if this should be an assignment rather than addition, depending on the logic.

Suggested change

request.num_external_computed_tokens += num_external_computed_tokens

request.num_external_computed_tokens = num_external_computed_tokens

+1, why is it += instead of =?

@tlrmchlsmth the flow is WAITING (get num_external_computed_tokens)=>WAITING_FOR_REMOTE_KVS=>WAITING where request.num_computed_tokens=N-1 (prefill done, to figure as if those tokens were processed). At this point num_external_computed_tokens=0 and assigning again overwrites.

I've moved the assignment in the first WAITING iteration so that it's clearer + it will only go through that once

vllm/v1/metrics/stats.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/core/sched/scheduler.py

Signed-off-by: NickLucche <[email protected]>

NickLucche · 2025-10-27T14:59:31Z

vllm/v1/core/sched/scheduler.py

+                # Prefill is to be recomputed locally.
+                request.num_external_computed_tokens = 0


@sdavidbd can you please double check this, my understanding is that we have to re-compute the whole prefill now so we can track prompt throughput

propagate num_external_tokens to logger

378e06a

Signed-off-by: NickLucche <[email protected]>

NickLucche requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners October 27, 2025 11:33

mergify bot added v1 kv-connector labels Oct 27, 2025

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 27, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

NickLucche added 2 commits October 27, 2025 14:36

assign remote token count earlier

4c463b6

Signed-off-by: NickLucche <[email protected]>

reset num external tokens when having to recompute prefill

863ea67

Signed-off-by: NickLucche <[email protected]>

NickLucche commented Oct 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Bugfix][P/D] Fix throughput stats in disaggregated setup #27569

[Bugfix][P/D] Fix throughput stats in disaggregated setup #27569

NickLucche commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

tlrmchlsmth Oct 27, 2025

Uh oh!

NickLucche Oct 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

NickLucche Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	request.num_external_computed_tokens += num_external_computed_tokens
	request.num_external_computed_tokens = num_external_computed_tokens

		# Prefill is to be recomputed locally.
		request.num_external_computed_tokens = 0

Uh oh!

Uh oh!

[Bugfix][P/D] Fix throughput stats in disaggregated setup #27569

Are you sure you want to change the base?

[Bugfix][P/D] Fix throughput stats in disaggregated setup #27569

Conversation

NickLucche commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

NickLucche Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

NickLucche Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NickLucche Oct 27, 2025 •

edited

Loading