Skip to content

Conversation

@NickLucche
Copy link
Collaborator

Fix prompt throughput stats in CLI logger by only accounting for tokens that were prefilled locally.

In a P/D setup, kv cache is copied over from P to D, and this currently resolves in outputting the following on the Decoder side:

# Sending 2 different reqs one after another
(APIServer pid=3322894) INFO:     Started server process [3322894]
(APIServer pid=3322894) INFO:     Waiting for application startup.
(APIServer pid=3322894) INFO:     Application startup complete.
(APIServer pid=3322894) INFO:     127.0.0.1:38042 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=3322894) INFO 10-27 11:21:40 [loggers.py:208] Engine 000: Avg prompt throughput: 53.2 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%
(APIServer pid=3322894) INFO 10-27 11:21:50 [loggers.py:208] Engine 000: Avg prompt throughput: 10.3 tokens/s, Avg generation throughput: 11.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%

which is plain wrong given we have not actually prefilled those tokens, but actually just "copied" them over.

After this PR:

# Same setup
(APIServer pid=3318553) INFO:     Started server process [3318553]
(APIServer pid=3318553) INFO:     Waiting for application startup.
(APIServer pid=3318553) INFO:     Application startup complete.
(APIServer pid=3318553) INFO:     127.0.0.1:54610 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=3318553) INFO 10-27 11:15:05 [loggers.py:208] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%

(APIServer pid=3318553) INFO:     127.0.0.1:35638 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=3318553) INFO 10-27 11:15:15 [loggers.py:208] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in the prompt throughput stats calculation for disaggregated setups, where tokens copied from the primary to the decoder were incorrectly counted as prefilled tokens. The changes involve modifying the scheduler to track locally prefilled tokens and updating the logging to reflect the corrected throughput. The review focuses on ensuring the correctness of the fix and the clarity of the code changes.

self._update_connector_prefix_cache_stats(
request, num_external_computed_tokens
)
request.num_external_computed_tokens += num_external_computed_tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line adds the external computed tokens to the request's num_external_computed_tokens. It's crucial to ensure that this addition doesn't lead to double-counting in subsequent calculations or logging. Verify that this value is used correctly and doesn't inadvertently inflate the count of external tokens. It might be better to update the value directly instead of adding to it, depending on how it's used elsewhere.

Consider if this should be an assignment rather than addition, depending on the logic.

Suggested change
request.num_external_computed_tokens += num_external_computed_tokens
request.num_external_computed_tokens = num_external_computed_tokens

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, why is it += instead of =?

Copy link
Collaborator Author

@NickLucche NickLucche Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlrmchlsmth the flow is WAITING (get num_external_computed_tokens)=>WAITING_FOR_REMOTE_KVS=>WAITING where request.num_computed_tokens=N-1 (prefill done, to figure as if those tokens were processed). At this point num_external_computed_tokens=0 and assigning again overwrites.

I've moved the assignment in the first WAITING iteration so that it's clearer + it will only go through that once

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1496 to +1497
# Prefill is to be recomputed locally.
request.num_external_computed_tokens = 0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdavidbd can you please double check this, my understanding is that we have to re-compute the whole prefill now so we can track prompt throughput

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants