-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
Support token_type_ids in V1 with less code changes #21985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
post_process_tokens(model_config, engine_prompt) | ||
|
||
if mm_data is not None: | ||
engine_prompt["multi_modal_data"] = mm_data | ||
return full_prompt, engine_prompt | ||
|
||
|
||
def compress_token_type_ids(token_type_ids: list[int]) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to minimize the amount of data that is transferred between processes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a clever optimization for passing token_type_ids
to models by bit-packing them into the input_ids
tensor. This avoids changing many function signatures across the codebase. The overall approach is sound and the implementation appears correct.
My main feedback focuses on improving the maintainability of this new bit-packing mechanism in vllm/model_executor/models/bert.py
. The functions for encoding and decoding token_type_ids
have in-place side effects that are not obvious from their names, and the code could benefit from comments explaining the bit-packing logic. Addressing these points will make the code easier to understand and safer for future modifications.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Can you merge from main again? Thanks |
Signed-off-by: Max de Bayser <[email protected]>
Pooling models tests are failing, please check |
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
In the BertWithRope model the position_ids argument was renamed to positions. |
Signed-off-by: Max de Bayser <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WoosukKwon is quite busy lately, I'll just merge this since the changes to model runner is really minimal and the pooling models tests have passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open a follow-up PR to update the docs accordingly?
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: jingyu <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Boyuan Feng <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
This PR is yet another follow-up to #16188 and #21270. It adds support for models such as cross-encoder/ms-marco-MiniLM-L-6-v2 that require token_type_ids ids to be passed from the tokenizer to the model.
Since passing the token_type_ids up the chain from the entrypoints to the model runner, I'm also exploring other implementation alternatives such as: #19988 and #20026.
PR #19988 tries the same approach as V0 but the problem is that it has to touch too many places in the code. #20026 tries to minimize the code impact by passing the token_types as multimodal args, which admittedly is a bit weird. This one adds the token type ids to the pooling params thereby removing the need to touch to many places in the code. It also avoids allocating persistent tensors by encoding the token types together with the token ids.
cc: @DarkLight1337