-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[BugFix] Fix KVConnector TP worker aggregation #21473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Nick Hill <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes an issue with KVConnector TP worker aggregation for non-Ray pipeline-decoupled setups. The changes primarily involve adjusting the return logic in gpu_worker.py
for non-last pipeline parallel ranks to correctly handle KV cache transfer status. A potential issue was identified where the code could modify a shared mutable global constant. A code suggestion has been provided to address this and make the implementation more robust.
empty_output.finished_sending = output.finished_sending | ||
empty_output.finished_recving = output.finished_recving | ||
output = empty_output | ||
new_output = copy.copy(new_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be a deepcopy right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we updated the value of EMPTY_MODEL_RUNNER_OUTPUT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't completely follow what you mean but I'm pretty sure it doesn't need to be a deep copy. We are just copying because we don't want to modify the shared EMPTY_MODEL_RUNNER_OUTPUT
itself.
output = new_output | ||
|
||
assert isinstance(output, ModelRunnerOutput) | ||
# return output only from the driver worker | ||
return output if self.is_driver_worker else None | ||
return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change modifies the original behavior: if no KV connector is present, non-driver workers in the last PP rank now return output. I'm not certain this has any practical impact, though—under MultiprocExecutor
, only the worker with output_rank
sends its output back via WorkerProc
.
Suggested fix:
return new_output
assert isinstance(output, ModelRunnerOutput)
return_output = self.is_driver_worker or has_kv_transfer_group()
return output if return_output else None
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: 董巍 <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: avigny <[email protected]>
…nch (#2122) ### What this PR does / why we need it? We notice that vllm's main branch merged the PR vllm-project/vllm#21072 and vllm-project/vllm#21473 to support ray backend and fix some rebase bug from previous change. Those changes makes the disaggregate pd in vllm ascend breaks in some scenario. In this PR, we adopt those changes to make sure the `llmdatddist_c_mgr_connector` works fine on the newest vllm main branch. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? relevant ut will be added to make sure the functionality of those changes. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ad57f23 --------- Signed-off-by: ganyi <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: shuw <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: x22x22 <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…nch (vllm-project#2122) ### What this PR does / why we need it? We notice that vllm's main branch merged the PR vllm-project/vllm#21072 and vllm-project/vllm#21473 to support ray backend and fix some rebase bug from previous change. Those changes makes the disaggregate pd in vllm ascend breaks in some scenario. In this PR, we adopt those changes to make sure the `llmdatddist_c_mgr_connector` works fine on the newest vllm main branch. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? relevant ut will be added to make sure the functionality of those changes. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ad57f23 --------- Signed-off-by: ganyi <[email protected]>
…nch (vllm-project#2122) ### What this PR does / why we need it? We notice that vllm's main branch merged the PR vllm-project/vllm#21072 and vllm-project/vllm#21473 to support ray backend and fix some rebase bug from previous change. Those changes makes the disaggregate pd in vllm ascend breaks in some scenario. In this PR, we adopt those changes to make sure the `llmdatddist_c_mgr_connector` works fine on the newest vllm main branch. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? relevant ut will be added to make sure the functionality of those changes. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ad57f23 --------- Signed-off-by: ganyi <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Boyuan Feng <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
The ray PD compatibility fix #21072 broke non-ray TP PD.
Discovered by @robertgshaw2-redhat