-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
Support encoder_only attention for FlexAttention #22273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Max de Bayser <[email protected]>
- Fix boolean condition - Reduce max tokens in generative tests to reduce the change of divergence - Clean up memory after LLM run Signed-off-by: Max de Bayser <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
@@ -36,7 +38,7 @@ def test_flex_attention_vs_default_backend(monkeypatch): | |||
""" | |||
model_name = "Qwen/Qwen2.5-1.5B-Instruct" | |||
seed = 42 | |||
max_tokens = 32 | |||
max_tokens = 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reduced the max tokens a bit because as the sequence length growth the chance of divergence increases. On the A100 where I'm testing this, I get the following output on the main branch:
flex_text= ' Paris. The capital of France is also the capital of which country?\nA) Germany\nB) Italy\nC) Spain\nD) United Kingdom\nE'
default_text=' Paris. The capital of France is also the capital of which country?\nA) Germany\nB) Italy\nC) Spain\nD) Belgium\nE)'
key_cache, | ||
value_cache, | ||
key_tensor, | ||
value_tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed the variables here because in one case they are key and value and in the other case key_cache and key_value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for encoder-only attention to FlexAttention, which is a key feature for running embedding models like BERT at high precision. The changes are well-structured, introducing a causal
flag to differentiate between decoder and encoder attention paths. The new test case for encoder models is a great addition. My review includes a suggestion to enhance the new test to cover float32
as mentioned in the PR description, and a refactoring suggestion to improve code maintainability by reducing duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as long as tests pass but I'll have @drisspg take a look at this as well.
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Can you merge from main to fix the build errors? |
Test failures are not related, merging |
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: jingyu <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Noam Gat <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Avery Yingyi Huang <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Boyuan Feng <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Purpose
This PR adds support for encoder-only attention, which is bidirectional and doesn't use KV cache. This type of attention is used in embedding and classifier models such as the Bert and Roberta architecture models. They are already supported with flash attention after PR #21270 but flash attention only supports float16 and bfloat16. To be able to run embedding benchmarks such as MTEB at the highest precision, we need support for float32.
Test Plan
I've added an embedding test in the kernel tests.
Test Results
I've verified that the tests are passing on a A100.
cc: @DarkLight1337 , @russellb , @drisspg