Support encoder_only attention for FlexAttention #22273

maxdebayser · 2025-08-05T20:38:29Z

Purpose

This PR adds support for encoder-only attention, which is bidirectional and doesn't use KV cache. This type of attention is used in embedding and classifier models such as the Bert and Roberta architecture models. They are already supported with flash attention after PR #21270 but flash attention only supports float16 and bfloat16. To be able to run embedding benchmarks such as MTEB at the highest precision, we need support for float32.

Test Plan

I've added an embedding test in the kernel tests.

Test Results

I've verified that the tests are passing on a A100.

cc: @DarkLight1337 , @russellb , @drisspg

Signed-off-by: Max de Bayser <[email protected]>

- Fix boolean condition - Reduce max tokens in generative tests to reduce the change of divergence - Clean up memory after LLM run Signed-off-by: Max de Bayser <[email protected]>

github-actions · 2025-08-05T20:38:37Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

maxdebayser · 2025-08-05T20:39:42Z

tests/kernels/test_flex_attention.py

@@ -36,7 +38,7 @@ def test_flex_attention_vs_default_backend(monkeypatch):
    """
    model_name = "Qwen/Qwen2.5-1.5B-Instruct"
    seed = 42
-    max_tokens = 32
+    max_tokens = 24


I've reduced the max tokens a bit because as the sequence length growth the chance of divergence increases. On the A100 where I'm testing this, I get the following output on the main branch:

flex_text= ' Paris. The capital of France is also the capital of which country?\nA) Germany\nB) Italy\nC) Spain\nD) United Kingdom\nE' default_text=' Paris. The capital of France is also the capital of which country?\nA) Germany\nB) Italy\nC) Spain\nD) Belgium\nE)'

tests/kernels/test_flex_attention.py

maxdebayser · 2025-08-05T20:41:30Z

vllm/v1/attention/backends/flex_attention.py

-            key_cache,
-            value_cache,
+            key_tensor,
+            value_tensor,


I've renamed the variables here because in one case they are key and value and in the other case key_cache and key_value.

gemini-code-assist

Code Review

This pull request adds support for encoder-only attention to FlexAttention, which is a key feature for running embedding models like BERT at high precision. The changes are well-structured, introducing a causal flag to differentiate between decoder and encoder attention paths. The new test case for encoder models is a great addition. My review includes a suggestion to enhance the new test to cover float32 as mentioned in the PR description, and a refactoring suggestion to improve code maintainability by reducing duplication.

tests/kernels/test_flex_attention.py

vllm/v1/attention/backends/flex_attention.py

DarkLight1337

LGTM as long as tests pass but I'll have @drisspg take a look at this as well.

tests/kernels/test_flex_attention.py

Signed-off-by: Max de Bayser <[email protected]>

drisspg

Looks good

DarkLight1337 · 2025-08-06T16:37:55Z

Can you merge from main to fix the build errors?

DarkLight1337 · 2025-08-07T01:37:06Z

Test failures are not related, merging

Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: jingyu <[email protected]>

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Noam Gat <[email protected]>

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Avery Yingyi Huang <[email protected]>

Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Paul Pak <[email protected]>

Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Boyuan Feng <[email protected]>

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

Signed-off-by: Max de Bayser <[email protected]>

maxdebayser added 2 commits August 5, 2025 14:32

Add encoder-only support to FlexAttention

51ca8e1

Signed-off-by: Max de Bayser <[email protected]>

Fix several small problems

d68e60b

- Fix boolean condition - Reduce max tokens in generative tests to reduce the change of divergence - Clean up memory after LLM run Signed-off-by: Max de Bayser <[email protected]>

maxdebayser requested review from tlrmchlsmth, WoosukKwon and robertgshaw2-redhat as code owners August 5, 2025 20:38

maxdebayser requested review from njhill, ywang96, comaniac and alexm-redhat as code owners August 5, 2025 20:38

mergify bot added the v1 label Aug 5, 2025

maxdebayser commented Aug 5, 2025

View reviewed changes

tests/kernels/test_flex_attention.py Outdated Show resolved Hide resolved

maxdebayser commented Aug 5, 2025

View reviewed changes

gemini-code-assist bot reviewed Aug 5, 2025

View reviewed changes

tests/kernels/test_flex_attention.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/flex_attention.py Show resolved Hide resolved

DarkLight1337 reviewed Aug 6, 2025

View reviewed changes

DarkLight1337 requested review from LucasWilkinson and Isotr0py August 6, 2025 03:14

Isotr0py approved these changes Aug 6, 2025

View reviewed changes

tests/kernels/test_flex_attention.py Outdated Show resolved Hide resolved

maxdebayser added 2 commits August 6, 2025 10:16

Merge branch 'upstream_main' into flex_encoder_attn

88e388d

Signed-off-by: Max de Bayser <[email protected]>

use vllm_runner

16f18c6

Signed-off-by: Max de Bayser <[email protected]>

Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 6, 2025

drisspg approved these changes Aug 6, 2025

View reviewed changes

Merge branch 'upstream_main' into flex_encoder_attn

ed55ccc

vllm-bot merged commit f825c6b into vllm-project:main Aug 7, 2025
34 of 41 checks passed

nvjullin pushed a commit to nvjullin/vllm that referenced this pull request Aug 7, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

7714be0

Signed-off-by: Max de Bayser <[email protected]>

jingyu-ml pushed a commit to jingyu-ml/vllm that referenced this pull request Aug 8, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

a68d5ef

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: jingyu <[email protected]>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

11bb5da

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

9118029

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Noam Gat <[email protected]>

yyihuang pushed a commit to yyihuang/vllm that referenced this pull request Aug 11, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

02a9360

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Avery Yingyi Huang <[email protected]>

wuhang2014 pushed a commit to wuhang2014/vllm that referenced this pull request Aug 12, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

5443b35

Signed-off-by: Max de Bayser <[email protected]>

aarnphm pushed a commit to aarnphm/vllm that referenced this pull request Aug 13, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

2cb1780

Signed-off-by: Max de Bayser <[email protected]>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

2a13be6

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Paul Pak <[email protected]>

taneem-ibrahim pushed a commit to taneem-ibrahim/vllm that referenced this pull request Aug 14, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

fb6f9cb

Signed-off-by: Max de Bayser <[email protected]>

BoyuanFeng pushed a commit to BoyuanFeng/vllm that referenced this pull request Aug 14, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

3c66f3d

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Boyuan Feng <[email protected]>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

a6a0300

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

juuice-lee pushed a commit to juuice-lee/vllm-moe.code that referenced this pull request Aug 18, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

e202560

Signed-off-by: Max de Bayser <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

4a764b6

Signed-off-by: Max de Bayser <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

6e242dd

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

1003540

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

caaa2ab

Signed-off-by: Max de Bayser <[email protected]>

dumb0002 pushed a commit to dumb0002/vllm that referenced this pull request Aug 28, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

c4c3d58

Signed-off-by: Max de Bayser <[email protected]>

googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

b738474

Signed-off-by: Max de Bayser <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support encoder_only attention for FlexAttention #22273

Support encoder_only attention for FlexAttention #22273

Uh oh!

maxdebayser commented Aug 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

maxdebayser Aug 5, 2025

Uh oh!

Uh oh!

maxdebayser Aug 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment •

edited

Loading

Uh oh!

Uh oh!

drisspg left a comment

Uh oh!

DarkLight1337 commented Aug 6, 2025

Uh oh!

DarkLight1337 commented Aug 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Support encoder_only attention for FlexAttention #22273

Support encoder_only attention for FlexAttention #22273

Uh oh!

Conversation

maxdebayser commented Aug 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Results

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

maxdebayser Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxdebayser Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Aug 6, 2025

Uh oh!

DarkLight1337 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maxdebayser commented Aug 5, 2025 •

edited by github-actions bot

Loading

DarkLight1337 left a comment •

edited

Loading

DarkLight1337 commented Aug 7, 2025 •

edited

Loading