-
-
Couldn't load subscription status.
- Fork 10.8k
[Bugfix][P/D] Fix throughput stats in disaggregated setup #27569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: NickLucche <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in the prompt throughput stats calculation for disaggregated setups, where tokens copied from the primary to the decoder were incorrectly counted as prefilled tokens. The changes involve modifying the scheduler to track locally prefilled tokens and updating the logging to reflect the corrected throughput. The review focuses on ensuring the correctness of the fix and the clarity of the code changes.
vllm/v1/core/sched/scheduler.py
Outdated
| self._update_connector_prefix_cache_stats( | ||
| request, num_external_computed_tokens | ||
| ) | ||
| request.num_external_computed_tokens += num_external_computed_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line adds the external computed tokens to the request's num_external_computed_tokens. It's crucial to ensure that this addition doesn't lead to double-counting in subsequent calculations or logging. Verify that this value is used correctly and doesn't inadvertently inflate the count of external tokens. It might be better to update the value directly instead of adding to it, depending on how it's used elsewhere.
Consider if this should be an assignment rather than addition, depending on the logic.
| request.num_external_computed_tokens += num_external_computed_tokens | |
| request.num_external_computed_tokens = num_external_computed_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, why is it += instead of =?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlrmchlsmth the flow is WAITING (get num_external_computed_tokens)=>WAITING_FOR_REMOTE_KVS=>WAITING where request.num_computed_tokens=N-1 (prefill done, to figure as if those tokens were processed). At this point num_external_computed_tokens=0 and assigning again overwrites.
I've moved the assignment in the first WAITING iteration so that it's clearer + it will only go through that once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
| # Prefill is to be recomputed locally. | ||
| request.num_external_computed_tokens = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdavidbd can you please double check this, my understanding is that we have to re-compute the whole prefill now so we can track prompt throughput
Fix prompt throughput stats in CLI logger by only accounting for tokens that were prefilled locally.
In a P/D setup, kv cache is copied over from P to D, and this currently resolves in outputting the following on the Decoder side:
which is plain wrong given we have not actually prefilled those tokens, but actually just "copied" them over.
After this PR: