-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend #21525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend #21525
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes a potential None
pointer issue with the workspace buffer in the FlashInfer TRTLLM backend by removing a redundant workspace_buffer
attribute from FlashInferMetadata
and instead sourcing the buffer directly from the decode_wrapper
. The change is logical and improves robustness. However, I've identified a critical potential issue where the code now uses _float_workspace_buffer
while the older, more established backend implementation uses _int_workspace_buffer
for the same function call. This discrepancy needs to be addressed to prevent potential runtime errors or incorrect computations.
40d3415
to
78b341b
Compare
78b341b
to
dadc54b
Compare
CC @mgoin for review. Thanks. |
23ca680
to
bb27433
Compare
bb27433
to
84e689c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and verified locally, thank you!
Failures in CI look related
|
Signed-off-by: elvischenv <[email protected]>
Signed-off-by: elvischenv <[email protected]>
Head branch was pushed to by a user without write access
84e689c
to
bc30deb
Compare
Thanks @mgoin, fixed that in the latest commit. There is still one failure but seems to be not related?
|
Retrying that test now |
vllm-project#21525) Signed-off-by: elvischenv <[email protected]>
vllm-project#21525) Signed-off-by: elvischenv <[email protected]> Signed-off-by: shuw <[email protected]>
vllm-project#21525) Signed-off-by: elvischenv <[email protected]> Signed-off-by: x22x22 <[email protected]>
vllm-project#21525) Signed-off-by: elvischenv <[email protected]> Signed-off-by: x22x22 <[email protected]>
vllm-project#21525) Signed-off-by: elvischenv <[email protected]>
vllm-project#21525) Signed-off-by: elvischenv <[email protected]>
vllm-project#21525) Signed-off-by: elvischenv <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
vllm-project#21525) Signed-off-by: elvischenv <[email protected]> Signed-off-by: Noam Gat <[email protected]>
vllm-project#21525) Signed-off-by: elvischenv <[email protected]> Signed-off-by: Paul Pak <[email protected]>
vllm-project#21525) Signed-off-by: elvischenv <[email protected]>
vllm-project#21525) Signed-off-by: elvischenv <[email protected]> Signed-off-by: Boyuan Feng <[email protected]>
vllm-project#21525) Signed-off-by: elvischenv <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
There is an always valid
workspace_buffer
initialized inattn_metadata.decode_wrapper._float_workspace_buffer
, so no need to store the extra oneattn_metadata.workspace_buffer
. They should be the same.Original PR(#19825) initialize
attn_metadata
via the following way,self._workspace_buffer
will be None at the beginning, and will be allocated after ONE round ofself._plan()
. So it will only get the correct assignment at the 2nd round ofbuild()
.There is one PR(#21137) likely fixed the issue a while ago, but I think it is better to just remove this duplicated attribute.
Also fixed unit test for the breakage API for #21485.
Test Plan
tests/kernels/attention/test_flashinfer_trtllm_decode_attention.py
lm_eavl
Test Result
(Optional) Documentation Update