Skip to content

Conversation

maxdebayser
Copy link
Contributor

@maxdebayser maxdebayser commented Aug 5, 2025

Purpose

This PR adds support for encoder-only attention, which is bidirectional and doesn't use KV cache. This type of attention is used in embedding and classifier models such as the Bert and Roberta architecture models. They are already supported with flash attention after PR #21270 but flash attention only supports float16 and bfloat16. To be able to run embedding benchmarks such as MTEB at the highest precision, we need support for float32.

Test Plan

I've added an embedding test in the kernel tests.

Test Results

I've verified that the tests are passing on a A100.

cc: @DarkLight1337 , @russellb , @drisspg

- Fix boolean condition
- Reduce max tokens in generative tests to reduce the change of
  divergence
- Clean up memory after LLM run

Signed-off-by: Max de Bayser <[email protected]>
Copy link

github-actions bot commented Aug 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@@ -36,7 +38,7 @@ def test_flex_attention_vs_default_backend(monkeypatch):
"""
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
seed = 42
max_tokens = 32
max_tokens = 24
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reduced the max tokens a bit because as the sequence length growth the chance of divergence increases. On the A100 where I'm testing this, I get the following output on the main branch:

flex_text=   ' Paris. The capital of France is also the capital of which country?\nA) Germany\nB) Italy\nC) Spain\nD) United Kingdom\nE'
default_text=' Paris. The capital of France is also the capital of which country?\nA) Germany\nB) Italy\nC) Spain\nD) Belgium\nE)'

key_cache,
value_cache,
key_tensor,
value_tensor,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed the variables here because in one case they are key and value and in the other case key_cache and key_value.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for encoder-only attention to FlexAttention, which is a key feature for running embedding models like BERT at high precision. The changes are well-structured, introducing a causal flag to differentiate between decoder and encoder attention paths. The new test case for encoder models is a great addition. My review includes a suggestion to enhance the new test to cover float32 as mentioned in the PR description, and a refactoring suggestion to improve code maintainability by reducing duplication.

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as tests pass but I'll have @drisspg take a look at this as well.

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 6, 2025
Copy link
Contributor

@drisspg drisspg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@DarkLight1337
Copy link
Member

Can you merge from main to fix the build errors?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 7, 2025

Test failures are not related, merging

@vllm-bot vllm-bot merged commit f825c6b into vllm-project:main Aug 7, 2025
34 of 41 checks passed
nvjullin pushed a commit to nvjullin/vllm that referenced this pull request Aug 7, 2025
jingyu-ml pushed a commit to jingyu-ml/vllm that referenced this pull request Aug 8, 2025
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025
yyihuang pushed a commit to yyihuang/vllm that referenced this pull request Aug 11, 2025
wuhang2014 pushed a commit to wuhang2014/vllm that referenced this pull request Aug 12, 2025
aarnphm pushed a commit to aarnphm/vllm that referenced this pull request Aug 13, 2025
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
taneem-ibrahim pushed a commit to taneem-ibrahim/vllm that referenced this pull request Aug 14, 2025
BoyuanFeng pushed a commit to BoyuanFeng/vllm that referenced this pull request Aug 14, 2025
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
juuice-lee pushed a commit to juuice-lee/vllm-moe.code that referenced this pull request Aug 18, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
dumb0002 pushed a commit to dumb0002/vllm that referenced this pull request Aug 28, 2025
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants